Title: Parallel Big-data Computation on Hadoop

Proposer: Hans-Wolfgang Loidl

Suggested supervisors: Hans-Wolfgang Loidl

Goal: Implement and assess the performance of a typical big-data application on the Hadoop software infrastructure for parallel pattern computation


Big Data computing poses challenges on several fronts. It requires the processing of enormous amounts of data, which is beyond the computational capabilities of commodity server hardware. Therefore, parallel programming technologies need to be applied to perform the computations in time.

The goal of this project is to use the Hadoop [1] software infrastructure on the departmental Beowulf cluster in order to implement a typical, data-intensive application. Potential application domains are high-performance scientific computation or bio-informatics. The application can be implemented either in Java, using the low-level Hadoop API, or in one of the emerging scripting languages supported by the Hadoop framework, such as Pig or Hive. This project should summarise the effort involved in prototyping, transforming and implementing the initial application, fundamental problems encountered in this project, which might be problematic in automatising this process, and assess the overall performance and scalability of the final, parallel version.

Resources required: Hadoop on our Beowulf cluster

Degree of difficulty: moderate

Background needed: Good general programming skills; some background on parallel programming (e.g. F21DP)