HADOOP Tutorial

I recommend the Map/Reduce tutorial from the Hadoop documentation.

   [http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html]

This file is not a tutorial. It is merely a cheat sheet to help to
fire up things and get past the first tiny example. (As always with
Java, things are extremely sensitive about versions. A nightmare.)

Hadoop is actually installed on our Beowulf cluster, and you can have
a peek at it through various web interfaces.

HDFS status        [http://hadoopmaster:50070/dfsnodelist.jsp?whatNodes=LIVE]

Map/Reduce master  [http://hadoopmaster:50030/]

Map/Reduce workers [http://hadoopslave??:50060/]
where ?? ranges from 01 to 29 (with some numbers dead)

Unfortunately, I failed to secure write access rights on the HDFS
in time for the tutorial, so we can't use this installation.


Instead, I've downloaded the latest Hadoop release

   [http://hadoop.apache.org/common/releases.html]

and installed it as a stand-alone application. That is, it can execute
Map/Reduce jobs but it is running on a single machine in a single Java
VM, working on the local file system rather than on HDFS.

To use this installation, set your environment as follows.

  export HADOOP_HOME=/scratch/hadoop-0.23.0
  export JAVA_HOME=/usr/java/default
  export PATH=${HADOOP_HOME}/bin:${JAVA_HOME}/bin:${PATH}

Now, we can try to go through the first WordCount example from the
online tutoral.

* Create a directory `wordcount' and change into it.
* Create a file `WordCount.java' containing the Java code below.

Compiling:
* wget http://www.macs.hw.ac.uk/~hwloidl/Courses/F21DP/srcs/WordCount.java
* mkdir wordcount_classes
* javac -classpath "${HADOOP_HOME}/lib/*:${HADOOP_HOME}/share/hadoop/common/*:${HADOOP_HOME}/share/hadoop/common/lib/*"  -d wordcount_classes WordCount.java
* jar -cvf wordcount.jar -C wordcount_classes .

Providing input:
* mkdir input
* echo "Hello World Bye World" > input/file01
* echo "Hello Hadoop Goodbye Hadoop" > input/file02

Running (locally):
* java -cp wordcount.jar:${HADOOP_HOME}/*:${HADOOP_HOME}/lib/*:${HADOOP_HOME}/share/hadoop/common/*:${HADOOP_HOME}/share/hadoop/common/lib/* org.myorg.WordCount input output

Running (on cluster; needs files uploaded to HDFS):
* hadoop jar wordcount.jar org.myorg.WordCount input output

Before running the same job again, you must remove or rename the
output directory, or else Hadoop will complain that it exists:
* mv ouput output_old
* mv input input_old
* mkdir input
* wget http://www.macs.hw.ac.uk/~hwloidl/Courses/F21DP/srcs/WaD.txt 
* mv WaD.txt input/file01
* wget http://www.macs.hw.ac.uk/~hwloidl/Courses/F21DP/srcs/TMM.txt  
* mv TMM.txt input/file02

* java -cp wordcount.jar:${HADOOP_HOME}/*:${HADOOP_HOME}/lib/*:${HADOOP_HOME}/share/hadoop/common/*:${HADOOP_HOME}/share/hadoop/common/lib/* org.myorg.WordCount input output

Have fun.  You may try the rest of the online tutorial, or you may try
something complete different, like sumEuler.


-------------------------------------------------------------------------------

package org.myorg;

import java.io.IOException;
import java.util.*;
 
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount {

  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        output.collect(word, one);
      }
    }
  }

  public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }

  public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);
  }
}