Title: Parallel Twitter Data Analysis using Hadoop
Proposer: Hans-Wolfgang Loidl
Suggested supervisors: Hans-Wolfgang Loidl
Goal: Implement and assess the performance of a big-data-analytics application, using Twitter data, on the Hadoop software infrastructure for parallel pattern computation
Big Data computing poses challenges on several fronts. It requires the processing of enormous amounts of data, which is beyond the computational capabilities of commodity server hardware. Therefore, parallel programming technologies need to be applied to perform the computations in time.
The goal of this project is to use the Hadoop  software infrastructure on the departmental Beowulf cluster in order to perform data-analytics on representative, sizable Twitter  data. The application should perform an analysis of the relationships encoded in the Twitter data to expose non-trivial connections from large data-sets. The application can be implemented either in Java, using the low-level Hadoop API, or in one of the emerging scripting languages supported by the Hadoop framework, such as Pig or Hive. This project should summarise the effort involved in prototyping, transforming and implementing the initial application, fundamental problems encountered in this project, which might be problematic in automatising this process, and assess the overall performance and scalability of the final, parallel version.
Resources required: Hadoop on our Beowulf cluster
Degree of difficulty: moderate
Background needed: Good general programming skills; some background on parallel programming (e.g. F21DP)