- Code: Flickr Developer Blog
- Etsy
- YouDevise Developer blog
- Foursquare Engineering Blog
- BackType Technology
- Foursquare Blog
- The Twitter Engineering Blog
- Engineering @ Facebook's Facebook Notes
- The Netflix Tech Blog
- Twilio Engineering Blog
- bizo developer blog
- SimpleGeo · Developers
- LinkedIn Engineering - Blog
- IMAX
- Adku
- Mixpanel Engineering
- mgm technology blog
- Stephen Colebourne's blog
- Engineering Rapleaf
- Sematext Blog
- Neo4j Blog
- Linode Blog
- Amazon Web Services Blog
- Simplegeo
- Facebook developer blog
- Server Fault Blog
- Lucid Imagination
- Instagram Engineering
9.12.11
List of engineering blogs
8.7.10
MapReduce / Hadoop videos and tutorials
Except the official documentation, here's my source of information about Hadoop
- One of the best resources I've seen on the web about Hadoop and the tools it has (hbase, pig, hive): the Cloudera Videos.
- Two lectures from a course from UCBerkeley about MapReduce and Hadoop here and here.
- A free book about MapReduce algorithms from here.
- Blogs:
Another very good source of information about the inner workings of Hadoop was Hadoop: The Definitive Guide.
2.6.10
[HOWTO] Extract relevant text from news articles / blog posts
I needed a way to extract the text from news articles. Because the text was 'decorated' with extra text like menus, last articles, top articles, weather info, exchange info, etc. was hard to take only the relevant text of the article without creating a specialized parser for every site.
The first idea was to create trees with the HTML structure and then 'match' the trees and extract the content which was different. A benefit of this, was that the comments could be extracted analyzing the repetitive nodes. (haven't implemented this method yet, but I may write another post about this).
Another, much simpler method was to extract the text and compare the lines from this. The pages needed to be from the same day and from the same site. The main idea is if a line appears in more then a percent of the analyzed pages, then that's a decorator line (last/top articles, other infos). Here's a demo program (it uses the htmlparser library):
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import org.htmlparser.beans.StringBean;
public class TestH {
public static void main(String[] args) {
String[] urls = {"http://site/page1",
"http://site/page2",
"http://site/page3",
"http://site/page4",
"http://site/page5"};
Collection contentLines = new ArrayList();
Map freq = new HashMap();
for ( String url : urls ) {
StringBean sb = new StringBean();
sb.setLinks(false);
sb.setReplaceNonBreakingSpaces(true);
sb.setCollapse(true);
sb.setURL( url );
String s = sb.getStrings();
String[] content = s.split("\r\n|\r|\n");
contentLines.add( content.clone() );
for ( String c : content ) {
Long l = freq.get( c );
if ( l == null ) {
l = 0L;
}
l++;
freq.put(c, l);
}
}
for ( String[] c : contentLines ) {
for ( String s : c ) {
if ( freq.get(s) != null && freq.get(s) > 3 ) {
continue;
}
System.out.println( s );
}
System.out.println();
System.out.println( "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" );
System.out.println( "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" );
System.out.println();
}
}
}
18.4.10
[HOWTO] Using Hadoop MapReduce (0.20) with input and output to HBase (0.20) tables
create 'webpage','pageContent'Then I had done three classes: a Map, a Reduce and another class to make this two to work together. The Mapper would read every row from the table and see if that row has or not the text extracted from html (pages are added at any time, so a check must be done - the check it's simple: html content != null AND text content == null) and then create Put operations to be executed by the reducer (the Put could be done and inserted in the Mapper, but I choose to use a Reducer). So, here's the code:
The Mapper:
package ro.zava.crawler.mr.htmlstripper;
import java.io.IOException;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.io.Writable;
import ro.zava.crawler.htmlparser.HTMLExtractor;
public class HtmlMapper extends TableMapper<ImmutableBytesWritable, Writable> {
private static final Log logger = LogFactory.getLog( HtmlMapper.class );
private static enum Counters {
ROWS
}
private static byte[] qBtes = "pageContent".getBytes();
private static byte[] cBtes = "textVal".getBytes();
@Override
public void map(ImmutableBytesWritable row, Result values,
Context context) throws IOException {
byte[] htmlContent = null;
byte[] textContent = null;
String column = null;
for (KeyValue value : values.list()) {
column = new String( value.getColumn() );
if ( "pageContent:val".equalsIgnoreCase(column) ) {
htmlContent = value.getValue();
} else if ( "pageContent:textVal".equalsIgnoreCase(column) ) {
textContent = value.getValue();
}
}
if ( htmlContent != null && textContent == null ) {
context.getCounter(Counters.ROWS).increment(1);
try {
KeyValue kv = new KeyValue(
row.get(),
qBtes,
cBtes,
System.currentTimeMillis(),
HTMLExtractor.extract(new String(htmlContent)).getBytes());
Put p = new Put( row.get() );
p.add( kv );
context.write(row, p);
} catch ( Exception e ) {
logger.error("", e );
}
}
}
}
The Reducer:
package ro.zava.crawler.mr.htmlstripper;
import java.io.IOException;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.io.Writable;
public class HtmlReducer extends TableReducer<Writable, Writable, Writable> {
@Override
public void reduce(Writable key, Iterable<Writable> values, Context context)
throws IOException, InterruptedException {
for (Writable putOrDelete : values) {
context.write(key, putOrDelete);
}
}
}
The 'link' for the two classes:
package ro.zava.crawler.mr.htmlstripper;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class HtmlStripper {
public static Job createSubmittableJob(Configuration conf)
throws IOException {
String tableName = "webpage";
Job job = new Job(conf, "rowCounter_" + tableName);
job.setJarByClass(HtmlStripper.class);
Scan scan = new Scan();
scan.addColumns("pageContent:val");
scan.addColumns("pageContent:textVal");
TableMapReduceUtil.initTableMapperJob(
tableName, scan,
HtmlMapper.class, ImmutableBytesWritable.class, Put.class, job);
TableMapReduceUtil.initTableReducerJob(tableName, HtmlReducer.class, job);
job.setOutputFormatClass(TableOutputFormat.class);
job.setReducerClass( HtmlReducer.class );
return job;
}
public static void main(String[] args) throws Exception {
HBaseConfiguration conf = new HBaseConfiguration();
Job job = createSubmittableJob(conf);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Other articles which helped me were this, the documentation and the test cases from HBase.
13.4.10
[HOWTO] Make a backup and restore of HDFS data
To save:
hadoop fs -copyToLocal /HBASE /home/adi/dev/backupsTo restore the backup:
hadoop fs -copyFromLocal /home/adi/dev/backups/HBASE /