Title: Parallel Twitter Data Analysis using Hadoop

Proposer: Hans-Wolfgang Loidl

Suggested supervisors: Hans-Wolfgang Loidl

Goal: Implement and assess the performance of a big-data-analytics application, using Twitter data, on the Hadoop software infrastructure for parallel pattern computation


Big Data computing poses challenges on several fronts. It requires the processing of enormous amounts of data, which is beyond the computational capabilities of commodity server hardware. Therefore, parallel programming technologies need to be applied to perform the computations in time.

The goal of this project is to use the Hadoop [1] software infrastructure on the departmental Beowulf cluster in order to perform data-analytics on representative, sizable Twitter [2] data. The application should perform an analysis of the relationships encoded in the Twitter data to expose non-trivial connections from large data-sets. The application can be implemented either in Java, using the low-level Hadoop API, or in one of the emerging scripting languages supported by the Hadoop framework, such as Pig or Hive. This project should summarise the effort involved in prototyping, transforming and implementing the initial application, fundamental problems encountered in this project, which might be problematic in automatising this process, and assess the overall performance and scalability of the final, parallel version.

Resources required: Hadoop on our Beowulf cluster

Degree of difficulty: moderate

Background needed: Good general programming skills; some background on parallel programming (e.g. F21DP)