8.7.10

MapReduce / Hadoop videos and tutorials

Except the official documentation, here's my source of information about Hadoop

- One of the best resources I've seen on the web about Hadoop and the tools it has (hbase, pig, hive): the Cloudera Videos.
- Two lectures from a course from UCBerkeley about MapReduce and Hadoop here and here.
- A free book about MapReduce algorithms from here.
- Blogs:

Another very good source of information about the inner workings of Hadoop was Hadoop: The Definitive Guide.

2.6.10

[HOWTO] Extract relevant text from news articles / blog posts

I needed a way to extract the text from news articles. Because the text was 'decorated' with extra text like menus, last articles, top articles, weather info, exchange info, etc. was hard to take only the relevant text of the article without creating a specialized parser for every site.

The first idea was to create trees with the HTML structure and then 'match' the trees and extract the content which was different. A benefit of this, was that the comments could be extracted analyzing the repetitive nodes. (haven't implemented this method yet, but I may write another post about this).

Another, much simpler method was to extract the text and compare the lines from this. The pages needed to be from the same day and from the same site. The main idea is if a line appears in more then a percent of the analyzed pages, then that's a decorator line (last/top articles, other infos). Here's a demo program (it uses the htmlparser library):

import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

import org.htmlparser.beans.StringBean;

public class TestH {

 public static void main(String[] args) {
  String[] urls = {"http://site/page1",
    "http://site/page2",
    "http://site/page3",
    "http://site/page4",
    "http://site/page5"};
  
  Collection contentLines = new ArrayList();
  Map freq = new HashMap();
  
  for ( String url : urls ) {
   StringBean sb = new StringBean();
   sb.setLinks(false);
   sb.setReplaceNonBreakingSpaces(true);
   sb.setCollapse(true);
   sb.setURL( url );
   
   String s = sb.getStrings();
   String[] content = s.split("\r\n|\r|\n");
   contentLines.add( content.clone() );
   
   for ( String c : content ) {
    Long l = freq.get( c );
    if ( l == null ) {
     l = 0L;
    }
    l++;
    freq.put(c, l);
   }
  }
  
  for ( String[] c : contentLines ) {
   for ( String s : c ) {
    if ( freq.get(s) != null && freq.get(s) > 3 ) {
     continue;
    }
    System.out.println( s );
   }
   System.out.println();
   System.out.println( "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" );
   System.out.println( "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" );
   System.out.println();
  }
 }

}

18.4.10

[HOWTO] Using Hadoop MapReduce (0.20) with input and output to HBase (0.20) tables

I needed a MapReduce job to populate a HBase table - the input and the output were the same table... The problem was that I couldn't find an example on doing this... So here's one. What I wanted, was to extract from a table with webpages the text content from the html content (the html content was already in the table, and the parsed text must be inserted). The simplified structure of the table was this: table name 'webpage', a column family of 'pageContent' and two columns with the name 'val' and 'textVal' - the script to create this (simplified) table is
create 'webpage','pageContent'
Then I had done three classes: a Map, a Reduce and another class to make this two to work together. The Mapper would read every row from the table and see if that row has or not the text extracted from html (pages are added at any time, so a check must be done - the check it's simple: html content != null AND text content == null) and then create Put operations to be executed by the reducer (the Put could be done and inserted in the Mapper, but I choose to use a Reducer). So, here's the code:

The Mapper:
package ro.zava.crawler.mr.htmlstripper;

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.io.Writable;

import ro.zava.crawler.htmlparser.HTMLExtractor;

public class HtmlMapper extends TableMapper<ImmutableBytesWritable, Writable> {

 private static final Log logger = LogFactory.getLog( HtmlMapper.class );
 
 private static enum Counters {
  ROWS
 }

 private static byte[] qBtes = "pageContent".getBytes();
 private static byte[] cBtes = "textVal".getBytes();
 
 @Override
 public void map(ImmutableBytesWritable row, Result values,
   Context context) throws IOException {
  byte[] htmlContent = null;
  byte[] textContent = null;
  
  String column = null;
  for (KeyValue value : values.list()) {
   column = new String( value.getColumn() );
   if ( "pageContent:val".equalsIgnoreCase(column) ) {
    htmlContent = value.getValue();
   } else if ( "pageContent:textVal".equalsIgnoreCase(column) ) {
    textContent = value.getValue();
   }
  }
  
  if ( htmlContent != null && textContent == null ) {
   context.getCounter(Counters.ROWS).increment(1);
   try {
    KeyValue kv = new KeyValue(
      row.get(),
      qBtes,
      cBtes,
      System.currentTimeMillis(),
      HTMLExtractor.extract(new String(htmlContent)).getBytes());
 
    Put p = new Put( row.get() );
    p.add( kv );
    
    context.write(row, p);
   } catch ( Exception e ) {
    logger.error("", e );
   }
  }
 }

}
The Reducer:
package ro.zava.crawler.mr.htmlstripper;

import java.io.IOException;

import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.io.Writable;

public class HtmlReducer extends TableReducer<Writable, Writable, Writable> {
 
 @Override
 public void reduce(Writable key, Iterable<Writable> values, Context context)
   throws IOException, InterruptedException {
  for (Writable putOrDelete : values) {
   context.write(key, putOrDelete);
  }
 }
 
}
The 'link' for the two classes:
package ro.zava.crawler.mr.htmlstripper;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.mapreduce.Job;

public class HtmlStripper {
 public static Job createSubmittableJob(Configuration conf)
   throws IOException {
  String tableName = "webpage";
  Job job = new Job(conf, "rowCounter_" + tableName);
  job.setJarByClass(HtmlStripper.class);
  Scan scan = new Scan();
  scan.addColumns("pageContent:val");
  scan.addColumns("pageContent:textVal");
  
  TableMapReduceUtil.initTableMapperJob(
    tableName, scan,
    HtmlMapper.class, ImmutableBytesWritable.class, Put.class, job);
  
  TableMapReduceUtil.initTableReducerJob(tableName, HtmlReducer.class, job);
  
  job.setOutputFormatClass(TableOutputFormat.class);
  job.setReducerClass( HtmlReducer.class );

  return job;
 }

 public static void main(String[] args) throws Exception {
  HBaseConfiguration conf = new HBaseConfiguration();
  Job job = createSubmittableJob(conf);
  System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
}


Other articles which helped me were this, the documentation and the test cases from HBase.

13.4.10

[HOWTO] Make a backup and restore of HDFS data

I was using this to backup my HBase data, but can be used with any data from HDFS.
To save:
hadoop fs -copyToLocal /HBASE /home/adi/dev/backups
To restore the backup:
hadoop fs -copyFromLocal /home/adi/dev/backups/HBASE /