HADOOP Tutorial I recommend the Map/Reduce tutorial from the Hadoop documentation. [http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html] This file is not a tutorial. It is merely a cheat sheet to help to fire up things and get past the first tiny example. (As always with Java, things are extremely sensitive about versions. A nightmare.) Hadoop is actually installed on our Beowulf cluster, and you can have a peek at it through various web interfaces. HDFS status [http://hadoopmaster:50070/dfsnodelist.jsp?whatNodes=LIVE] Map/Reduce master [http://hadoopmaster:50030/] Map/Reduce workers [http://hadoopslave??:50060/] where ?? ranges from 01 to 29 (with some numbers dead) Unfortunately, I failed to secure write access rights on the HDFS in time for the tutorial, so we can't use this installation. Instead, I've downloaded the latest Hadoop release [http://hadoop.apache.org/common/releases.html] and installed it as a stand-alone application. That is, it can execute Map/Reduce jobs but it is running on a single machine in a single Java VM, working on the local file system rather than on HDFS. To use this installation, set your environment as follows. export HADOOP_HOME=/scratch/hadoop-0.23.0 export JAVA_HOME=/usr/java/default export PATH=${HADOOP_HOME}/bin:${JAVA_HOME}/bin:${PATH} Now, we can try to go through the first WordCount example from the online tutoral. * Create a directory `wordcount' and change into it. * Create a file `WordCount.java' containing the Java code below. Compiling: * wget http://www.macs.hw.ac.uk/~hwloidl/Courses/F21DP/srcs/WordCount.java * mkdir wordcount_classes * javac -classpath "${HADOOP_HOME}/lib/*:${HADOOP_HOME}/share/hadoop/common/*:${HADOOP_HOME}/share/hadoop/common/lib/*" -d wordcount_classes WordCount.java * jar -cvf wordcount.jar -C wordcount_classes . Providing input: * mkdir input * echo "Hello World Bye World" > input/file01 * echo "Hello Hadoop Goodbye Hadoop" > input/file02 Running (locally): * java -cp wordcount.jar:${HADOOP_HOME}/*:${HADOOP_HOME}/lib/*:${HADOOP_HOME}/share/hadoop/common/*:${HADOOP_HOME}/share/hadoop/common/lib/* org.myorg.WordCount input output Running (on cluster; needs files uploaded to HDFS): * hadoop jar wordcount.jar org.myorg.WordCount input output Before running the same job again, you must remove or rename the output directory, or else Hadoop will complain that it exists: * mv ouput output_old * mv input input_old * mkdir input * wget http://www.macs.hw.ac.uk/~hwloidl/Courses/F21DP/srcs/WaD.txt * mv WaD.txt input/file01 * wget http://www.macs.hw.ac.uk/~hwloidl/Courses/F21DP/srcs/TMM.txt * mv TMM.txt input/file02 * java -cp wordcount.jar:${HADOOP_HOME}/*:${HADOOP_HOME}/lib/*:${HADOOP_HOME}/share/hadoop/common/*:${HADOOP_HOME}/share/hadoop/common/lib/* org.myorg.WordCount input output Have fun. You may try the rest of the online tutorial, or you may try something complete different, like sumEuler. ------------------------------------------------------------------------------- package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }