Map Reduce and Hadoop

A short introduction to the basic ideas.

Idea: Pay per use. Let someone else worry about installing, maintaining, upgrading, backing-up, and repairing PCs.
Data and Computation are done in the cloud.
Multiple access tools: PCs, iphone, netbook, TV, voice recognition, mashup, ???.

Microsoft: Craig Mundie [1]
Google: RoundTable [2]
Amazon: Bezos [3]
IBM: Blue Cloud [4]

Lots of data out there: web, money transactions, science experiments, etc.
Availability of web services.
Virtualization technology.
Death of Moore's law: rise of multicore.
Data center expertise and availability.

[5]

[6]

Map Reduce is a small part of the cloud.
Do one thing on many computers.

A function does not change anything
Functions take arguments and return values.
Functions can be executed in any order.

d = function (x) {return 2*x;};

t = function (x) {return 3*x};

f = function (a,b) {
    return d(a) + t(b);
};

In a real functional language the parameters are never changed.
Call by value.

a = "hello";

reverse = function(x) {
  var result = "";
  for (i = 0; i < x.length; i++){
      result = x[i] + result;
  };
  return result;
};

reverse(a);

t = function(x) {return 3*x;};

[1,2,3,4,5].map(t);

add = function(a, b){ return a + b; };

[0, 1, 2, 3].reduce(add);

Count the number of letters.

count = function(x) {return x.length;};

add = function(x,y) {return x + y;};

l = ["The", "status", "is", "not", "quo"];

l.map(count).reduce(add);

Count web pages with the word "equilibrium".
Web images that contain a face.
Credit card transactions that fit a certain profile.
Specific energy trail in LHC collision data.
Most machine learning techniques* [7].

Hadoop [8] is open-source: Apache foundation. Includes a distributed file system implementation.
Automatically parallelize and distribute computation.
Fault-tolerance.
Monitoring tools.
Easy to use for programmers.

Input is given to map function as (key,value) pairs: (filename,textline).
Map returns output key and intermediate values.

All values with same output key are bundled.
Output keys are sorted.
Reduce combines all values of each output key into one value.

Programmer defines map and reduce functions.
Map input is a very large file. Hadoop distributes it among CPUs.
Run map functions on section of input.
Output keys and their values are accumulated and sorted.
Apply reduce to all the values of each output key.

The map functions run in parallel.
Each output key is one reduce function, in parallel.
Reduce phase cannot start until map phase is done.


/input_key: document name

//input_value: document contents

map(String input_key, String input_value):

  for each word w in input_value:

    EmitIntermediate(w,"1");



//output_key: a word

//output_values: al list of counts

reduce(String output_key, Iterator intermediate_values):

  in result = 0;

  for each v in intermediate_values:

    result+= ParseInt(v)

  Emit(AsString(result))

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements. */
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * This is an example Hadoop Map/Reduce application.
 * It reads the text input files, breaks each line into words
 * and counts them. The output is a locally sorted list of words and the 
 * count of how often they occurred.
 *
 * To run: bin/hadoop jar build/hadoop-examples.jar wordcount
 *            [-m <i>maps</i>] [-r <i>reduces</i>] <i>in-dir</i> <i>out-dir</i> 
 */
public class WordCount extends Configured implements Tool {
  
  /**
   * Counts the words in each line.
   * For each line of input, break the line into words and emit them as
   * (<b>word</b>, <b>1</b>).
   */
  public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    
    public void map(LongWritable key, Text value, 
                    OutputCollector<Text, IntWritable> output, 
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        output.collect(word, one);
      }
    }
  }
  
  /**
   * A reducer class that just emits the sum of the input values.
   */
  public static class Reduce extends MapReduceBase
    implements Reducer<Text, IntWritable, Text, IntWritable> {
    
    public void reduce(Text key, Iterator<IntWritable> values,
                       OutputCollector<Text, IntWritable> output, 
                       Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }
  
  static int printUsage() {
    System.out.println("wordcount [-m <maps>] [-r <reduces>] <input> <output>");
    ToolRunner.printGenericCommandUsage(System.out);
    return -1;
  }
  
  /**
   * The main driver for word count map/reduce program.
   * Invoke this method to submit the map/reduce job.
   * @throws IOException When there is communication problems with the 
   *                     job tracker.
   */
  public int run(String[] args) throws Exception {
    JobConf conf = new JobConf(getConf(), WordCount.class);
    conf.setJobName("wordcount");
 
    // the keys are words (strings)
    conf.setOutputKeyClass(Text.class);
    // the values are counts (ints)
    conf.setOutputValueClass(IntWritable.class);
    
    conf.setMapperClass(MapClass.class);        
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);
    
    List<String> other_args = new ArrayList<String>();
    for(int i=0; i < args.length; ++i) {
      try {
        if ("-m".equals(args[i])) {
          conf.setNumMapTasks(Integer.parseInt(args[++i]));
        } else if ("-r".equals(args[i])) {
          conf.setNumReduceTasks(Integer.parseInt(args[++i]));
        } else {
          other_args.add(args[i]);
        }
      } catch (NumberFormatException except) {
        System.out.println("ERROR: Integer expected instead of " + args[i]);
        return printUsage();
      } catch (ArrayIndexOutOfBoundsException except) {
        System.out.println("ERROR: Required parameter missing from " +
                           args[i-1]);
        return printUsage();
      }
    }
    // Make sure there are exactly 2 parameters left.
    if (other_args.size() != 2) {
      System.out.println("ERROR: Wrong number of parameters: " +
                         other_args.size() + " instead of 2.");
      return printUsage();
    }
    conf.setInputPath(new Path(other_args.get(0)));
    conf.setOutputPath(new Path(other_args.get(1)));
        
    JobClient.runJob(conf);
    return 0;
  }
  
  
  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new WordCount(), args);
    System.exit(res);
  }

}

C++, Python, Shell,...
anything you can run on Linux.

Tries to have map task in same machine as data.
If a machine dies, its jobs get re-run automatically.
If a key-value causes a crash, it gets ignored.
If slow map task, several copies are started.
A combiner (mini-reduce) can run on the same machine as a map.

Used by many companies.
Easy to install, just need Linux workstations on a network.
Takes care of hard problems: failover, ginormous distributed file system.
Good web-based monitoring tools.

Map Reduce and Hadoop

1 Cloud Computing

1.1 Everyone Agrees

1.2 Driving Factors

1.3 Death of Moore's Law

1.4 Google, Goose Creek, SC

1.5 Governor Sanford

1.6 Map Reduce

2 Functional Programming

2.1 Functions

2.2 By Value Only

2.3 Map

2.4 Reduce

2.5 Map then Reduce

2.6 On a Large Scale

3 Hadoop

3.1 Map Function

3.2 Reduce Function

3.3 Process

3.4 Parallelism

3.5 Example: Count Word Occurrences

3.6 WordCount in Java

3.7 Language Support

3.8 Hadoop Behind the Scenes

3.9 Hadoop Conclusions

URLs