This page covers the C+MPI tutorials in the course F21DP (Haskell tutorials are here).
Sequential C
As background for the sequential C part, refer to the slides for Lecture 2: C Revision (Part 1) Lecture 3: C Revision (Part 2) and the online C Book. Pick up sample sources here.
Practicals
List exercises from lecture 2:
- Write a function node *append(node *x, node *y) that appends the elements of the the second list y to the end of the first list x (i.e. append([1,2],[3,4]) ==> [1,2,3,4]).
- How does this affect the list x?
- Write a second version that does not modify the input lists.
- Under which condition is it safe to use the first version?
- Write a function node *reverse(node *x) that reverses the elements in the list (i.e. reverse([1,2,3]) ==> [3,2,1])
Tree exercises from lecture 3:
- Write a function that checks, whether a tree is balanced.
- Write a function struct node *readTree(FILE *fin, int n) that reads n integer values from a file and builds a balanced tree.
- Write a function void insert(struct node *t, int n) that inserts a value n into a tree, whose leaves are sorted in ascending order.
- For an input array of length n, how much space does such a tree consume?
- Change the tree data structure, to use a more efficient representation of the tag.
Sample solutions:
Theory
For each of the following C definitions/declarations, describe the definition/declaration in English and calculate how much space is required for/allocated to an instance.
double a[7]; char b[5]; struct details {int age; float height;}; struct person1 {char name[4]; struct details d;}; struct person2 {char name[5]; struct details d;}; struct person1 people[15]; struct person2 teams[7][9]; struct person1 * persons[9]; struct person2 ** groups; struct cell {int val; struct cell * left,* right;}; struct cell cells[50]; struct cell * nodes[25];
Suppose cells, defined above, has been fully populated with appropriate values. Explain why it might not be sensible to:
- coerce its content to (char *) and send them as a byte stream over a channel to another processor;
- on the 2nd processor, put the byte stream into an array of char, which is then coerced to (cell *) and copied into an array of cell.
- assume a parallel tree search is implemented by sending left and right subtree, with the search element to different processor, using the above byte stream conversion; at which point would the the search fail?
NB space allocated to a structure includes padding to align on an appropriate word boundary. This depends on the architecture (32-bit or 64-bit).
Answers (this refers to a 64-bit machine):
- In most cases, the size of the structs is just the sum of the sizes of the components.
The exceptional cases are person2 and cell.
- a: 7*8=56 (double is 8 byte on a 64-bit architecture)
- b: 5*1=5
- struct details: 4+4=8 (size of int doesn't depend on arch)
- struct person1: 4*1+8=12
- struct person2: 5*1+p+8=16 (where p is the padding to word-align the pointer)
- people: 15*12=180
- teams: 7*9*16=1008
- persons: 9*8=72 (pointer is 8 byte on a 64-bit architecture)
- groups: 8 (only 1 pointer)
- struct cell: 4+p+8+8=24 (where p is the padding to word-align the pointer)
- cells: 50*24=1200
- nodes: 25*8=200 (pointer is 8 byte on a 64-bit architecture)
- On transferring cells: The main issue is that all pointers in the heap, point to locations in the original heap. If packed as is, the locations in the heap of the target processor will have bogus information. To do it properly, the pointers should be packed as offsets into the buffer. On unpacking, the offsets should be turned into pointers to the unpacked data again. This way the tree structure is properly reconstructed.
C + MPI
Practicals
To go through the C+MPI tutorial, using the matrix multiplication example from class, follow this link to a FAQ.
- Hello world (
very easy ): Download and run the MPI hello world program from class. - Multiple guessers (
easy ): As a gentle introduction into C+MPI, extend the guesser example to several guesser, who pick random values in the known interval. The master needs to distinguish the guessers and announce the winner of the competition. - Seq matrix multiplication (
easy ): To get started, download and run the different parallel versions of matrix multiplication. - Parallel matrix multiplication (
moderately challenging ): Modify the naive parallel version (matrix4.c) to work with any number of processors. - Compare the performance of this version with the final parallel version (matrix8.c) on the large input data, and explain the difference in runtime.
- Parallel matrix multiplication (block-structured) (
challenging ): Extend matrix8.c (using row clustering), to a version with block clustering, by applying the same clustering approach when sending columns. - Parallel matrix multiplication (collectives) (
fairly challenging ): Devise a parallel matrix multiplication algorithm using MPI collective communication (MPI_Bcast, MPI_Scatter, MPI_Gather). - Exponential function (
moderately challenging ): Extend the partial solution to the parallel exponential function, to a general solution for all possible inputs, checking for convergence of the result and tuning the parallel performance, so that the final stages don't dominate the compute time. (seq version) -
Exponential function with GMP arithmetic (
a bit more challenging ): As above, but instead of starting from the naive, sequential implementation, start form this competitive sequential implementation of the exponential function: it uses the GNU GMP library for arbitrary precision arithmetic, so that the result (e^z) can be computed to an arbitrary precision. You supply z and the precision d on the command line (see the comments at the start of the program for explanations). When you study the code, you will see that the overall structure is the same as in the naive version, but all * and + etc operations are replaced by GMP primitivies. Test the program on some inputs with increasing size, and check results and runtime. The concrete task in this exercise, is to parallelise the power_e function, addressing the issues of load balancing, chunking etc as discussed in class. You can start from the sketch of a parallel solution above, but you will need to convert the internal representation of GMP variables, of type mpq_t etc, into a serialised data structure you can send.As a side remark, you might want to compare this C+GMP implementation with this sequential Haskell implementation implementing the same algorithm. Note, that the core of the implementation are 2 lines of Haskell (or one very long line ;-) compared to ca 100 lines in C+GMP.
This improved Haskell version uses circular data structures to avoid repeated multiplications when computing e^z and n! reducing the overall runtime significantly.
Theory
- Analyse the processing and communication factors in parallelising a binary search algorithm, using the tree data structure from lecture 3.
- Discuss the deployment of multiple processors to speed up the sorting of a very large file of numbers using C with MPI.
Haskell
Haskell exercises can be found here.
Hadoop
To get started, follow the instructions in the Hadoop FAQ to run a simple word count example program on Hadoop.
After that, generate your own Hadoop implementation of Euler Totient function, by using the same structure as in WordCount.java, but modifying the map and reduce instances.
Eden
This tutorial is based on the slides from Jost Berthold's Eden presentation at Heriot-Watt, March 2013.
There is a FAQ on Eden usage here, explaining how to compile and run Eden programs.
You can pick up the following Eden sample source code for the exercises:
You can find many more examples, and an in-depth Eden tutorial on this Eden CEFP tutorial page.