SICSA Multicore Challenge


Full details of the challenge: HERE.

Concordance Specification

  • Given Text file containing English text in ASCII encoding. An integer N.
  • Find For all sequences of words, up to length N, occurring in the input file, the number of occurrences of this sequence in the text, together with a list of start indices. Optionally, sequences with only 1 occurrence should be omitted.

  • My Contribution

    A MapReduce approach to parallelizing the Concordance application, implemented using Hadoop.
    Read THESE slides for full details of my MapReduce implementation, a complete set of results varying {reducers, input size, number of nodes, Concordance N}, and the Hadoop configuration used.

    Source Code

  • My Hadoop implementation - here
  • Others: Sequential C, Sequential Haskell, To appear soon: Java with Fork/Join, Python, Groovy, OpenMP, Eden, GpH, C# Patterns & Erlang.

  • Key Result: Throughput Scalability

    A number of sample input files were specified, to benchmark each implementation. These were: WaD2a.txt, WaD2.txt, WaD.txt, bible.txt, ascii100MB.txt and correspond to the graph below, in respective order.


    What does this graph say? I would argue that this quanitifies the claim that Hadoop is a high-throughput, scalable data processing MapReduce framework.

    Implementation Optimizations

    This excercise was useful in identifying the scalability limits of certain Java classes, and also to expose parameters for runtime tuning, depending on input size, and the number of nodes in the cluster.