SICSA Multicore Challenge
Concordance Specification
My Contribution
|
A MapReduce approach to parallelizing the Concordance application, implemented using Hadoop.
Read THESE slides for full details of my MapReduce implementation, a complete set of results varying {reducers, input size, number of nodes, Concordance N}, and the Hadoop configuration used. |
|
Source Code
Key Result: Throughput Scalability
A number of sample input files were specified, to benchmark each implementation. These were: WaD2a.txt, WaD2.txt, WaD.txt, bible.txt, ascii100MB.txt and correspond to the graph below, in respective order.
What does this graph say? I would argue that this quanitifies the claim that Hadoop is a high-throughput, scalable data processing MapReduce framework.
Implementation Optimizations
This excercise was useful in identifying the scalability limits of certain Java classes, and also to expose parameters for runtime tuning, depending on input size, and the number of nodes in the cluster.- Don't use java.lang.String if you are to concatenate output values, many times, for a given key. Use java.lang.StringBuilder instead. This is summarized nicely here.
- Think carefully about the number of map records you want to produce, to be reduced. For the concordance application, if you split a text file as one map per line, then the mapping operation will complete relatively quickly, but this produces millions of records for large files.
- Expose Hadoop APIs as runtime options
- Combiner functions At the very least, expose a combiner as an option to be used at runtime. These operate on the mapping nodes, whilst the key/value pairs are still in memory. It can in certain cases (i.e. when mapping buffer size is large) be beneficial to runtime performance at the reducing stage.
- Data compression Allows the user to specify that the map outputs to be compressed before being sent to the reduce node. This would be more useful, probably, for image processing or such like.