Open Source Google for Everyone

There's a real 'spike' of activity going on at the Apache Software Foundation at the moment. I wrote about CouchDB in an earlier post, but there are a number of very interesting projects running currently. Probably the most significant is Hadoop. Hadoop was promoted to an Apache 'Top Level Project' a year ago but it's now taking off in the Open Source community.

Hadoop is a highly distributed computing middleware designed to process petabytes of data across 1000's of commodity hardware nodes. It implements a computational approach called Map/Reduce across a distributed file system to deliver a highly fault tolerant compute platform to process very large data sets in parallel. Hadoop is 'inspired' by Google BigTable.

So how does it work?

There are two major components to Hadoop:

HDFS - a distributed file system that replicates data across many nodes
Map/Reduce - an execution middleware that distributes processing to nodes where the data resides

Files loaded onto HDFS are split into chunks and these chunks are replicated to every node in the Hadoop cluster. System monitoring responds to hardware and processing failures to replicate data to other nodes providing very high levels of fault tolerance.

In the Hadoop programming framework data is record orientated. Input files are broken into records, lines or whatever sub element is appropriate for the processing application logic. Each Hadoop process running on a node processes a subset of these records. Essentially, if at all possible, processes act on data local to the node hard disk and do not transfer data across the network. The Hadoop approach has a strategy of moving computation to the data rather than the data to the computation. This is what gives Hadoop it's performance.

The splitting and recombining of data and processing is handled using a Map/Reduce algorithm. Here records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together

The clever aspect of Hadoop is that it takes pretty much all of the cluster and distributing processing away from the Developer, letting him focus on the application logic.

In my early programming career I worked on Apollo Domain Workstations, and I always remember one of the coolest programming examples that shipped with the operating system (AEGIS) was a Mandelbrot generator that executed elements of the set on different nodes in the network in parallel. That was my first experience of the power of distributed parallel computing. The problem with the program though is that all the inter-process and node communication was coded 'low level' through TCP socket programming etc. If I remember rightly, most of the code was handling all of this IPC stuff rather than generating the Mandelbrot sequences. This is the exact problem Hadoop solves.

The architecture of Hadoop exhibits flat scalability. On a cluster with small data sets the performance advantage is minimal, if at all. Once your program is running on two nodes with a 1 GB of data, it'll scale to thousands of nodes and petabytes of data without modification.

For an example Hadoop application imagine you wanted to write a program that counted unique occurrences of words in multiple text files. Example text files would look like:

text1.txt: google is the best search engine

text2.txt: a9 is the better search engine

The output would look like:

a9 1
google 1
is 2
the 2
best 1
better 1
search 2
engine 2

A pseudo code for a Map Reduce approach for solving this looks like:

mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)

reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)

Several instances of the mapper function get created on different machines in the cluster. Each instance receives a different input file (it is assumed that we have many such files). The mappers output (word, 1) pairs which are then forwarded to the reducers. Several instances of the reducer method are also instantiated on the different machines. Each reducer is responsible for processing the list of values associated with a different word. The list of values will be a list of 1's; the reducer sums up those ones into a final count associated with a single word. The reducer then emits the final (word, count) output which is written to an output file.

The Hadoop distribution ships with a sample Java program that, essentially, does a similar task. It's available in the Hadoop distribution download under src/examples/org/apache/hadoop/examples/WordCount.java. This is partially reproduced below:


public static class MapClass extends MapReduceBase
implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value,
   OutputCollector output,
   Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
  word.set(itr.nextToken());
  output.collect(word, one);
}
}
}

/**
* A reducer class that just emits the sum of the input values.
*/
public static class Reduce extends MapReduceBase
implements Reducer {

public void reduce(Text key, Iterator values,
      OutputCollector output,
      Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
  sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

The final component of the Map/Reduce algorithm is the Driver. The driver initializes the job and instructs the Hadoop platform to execute your code on a set of input files, and controls where the output files are placed.


public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);

FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));

JobClient.runJob(conf);
}

The Apache Hadoop project also has a number of sub-projects that utilise or complement the core Hadoop middleware including:

HBase - a distributed database
Pig - a high-level data flow language to ease the development of parallel programs for Hadoop
Zookeeper - a management middleware for Hadoop
Hive - a data warehousing infrastructure
Mahout - machine learning libraries supporting a Map / Reduce processing model

So is any one using Hadoop and what for?

You bet, probably the biggest names using Hadoop are Facebook , Amazon and Yahoo. Facebook is using Hadoop to perform analytics on it's service, Amazon's using it for producing the product search indicies for it's A9 search engine. Even Microsoft is getting in on the act via its acquisition of Powerset, a NLP search engine. Yahoo use Hadoop to fight spam.

The New York Times used a Hadoop based solution to process 11 million TIFF images to PDF, all running on Amazon's EC2 and S3!

A company called Cloudera has started offering development, consulting and implementation services to clients wanting to implement Hadoop solutions.

I believe the future for Hadoop looks good. It opens up a whole area of large scale parallel computing to organisations and companies which just wasn't available before without dedicated supercomputing capabilities. You couple Hadoop with on-demand Cloud computing with services such as Amazon's EC2 and S3 then you have supercomputing for the masses.

Google's success was built on the foundation of Bigtable and it's Map / Reduce technology, having such a technology as Open Source, I believe, will drive a whole new generation of Internet computing services and applications.

Digital Ideation

Friday, 24 April 2009