Go to the first, previous, next, last section, table of contents.

Internals

This chapter discusses details of the implementation of GranSim.

[THIS TEXT IS TAKEN FROM A DRAFT OF A PREVIOUS PAPER. IT NEEDS UPDATES TO COVER THE MORE RECENT CHANGES LIKE ADDING GRANSIM-LIGHT. SEE ALSO THE HPFC'95 PAPER FOR THE BASICS]

Global Structure

The following picture depicts the global structure of GranSim

[Picture: Global Structure]

The following picture depicts the global structure of GranSim

@centerline{@psfig{angle=270,file=GranSim.eps,width=@hsize}}

The following picture depicts the global structure of GranSim

Global structure of GranSim

GranSim performs an event driven simulation with a centralised event queue. Each of the simulated processors has its own spark queue and thread queue as well as its own clock. The synchronisation of the clocks is performed via accessing the event queue which is sorted by the time stamps of the events.

Accuracy

GranSim is built on top and therefore makes use of a state-of-the-art compiler for Haskell including a huge variety of possible optimisations that can be performed. The instruction count function, which determines the number of instructions that are executed for a given piece of abstract C code, has been carefully tuned by analysing the assembler code generated by GHC and the results have been compared with the number of instructions executed in real Haskell programs. These comparisons have shown that the instruction count of the simulation lies within 10% for arithmetic operations, within 2% for load, store operations, within 20% for branch instructions and within 14% for floating point instructions of the real values.

To allow the modelling of different kinds of architectures the instructions have been split into five classes, with different weights: arithmetic operations, floating point operations (1 cycle each), load, store operations (4 cycles each) and branch instructions (2 cycles). These weights model a SPARC processor and have been verified with Haskell programs in the nofib suite. The weights are tunable in order to simulate other kinds of processors.

Flexibility

GranSim has a big number of tunable parameters (@xref{RTS Options}). One set of parameters allows to model a broad variety of processors. To achieve a similar accuracy as for SPARC processors the same kind of measurements can be performed. Another set of parameters allows to tune the costs for communication latency, message pack time etc. Finally, the overhead imposed by the runtime-system can be captured by adjusting the parameters for task creation time, context switch time etc. This allows us to model architectures ranging from shared memory to distributed memory machines. And we can quite easily simulate the effect of some small changes in the runtime system on the overall behaviour.

Additionally, the GranSim-Light setup allows to study the parallelism inherent in a program rather than choosing a fixed architecture to run the program on. Our experiences with parallelising rather big programs show that this is an important additional feature for tuning the performance of a parallel program.

Visualisation

Especially for analysing the granularity of the created tasks, we need visualisation tools. We have developed a set of tools for that purpose and used it on a quite wide range of example programs to analyse their performance (see section The Parallel NoFib Suite). With these tools we were able to show a very high correlation between execution time and the amount of allocated heap space. Apart from these tools showing the granularity of the resulting tasks, we have also developed a set of tools for showing the activity of the simulated machine as a whole, of all processors and of all tasks in the system. These tools have proven indispensable in the parallelisation and optimisation of a linear system solver. Based on the experience from this implementation, such tools are essential when working with a lazy language, in which the order of evaluation is not at all obvious from the program source. See section Visualisation Tools for a more detailed discussion of these tools.

Efficiency

The high accuracy in the simulation also implies a rather high overhead for the simulation as more bookkeeping has to be done. This is mainly due to the exact modelling of communication. Therefore, programs that perform a lot of communication will impose a higher overhead on the simulation. Early measurements of some example programs on GranSim showed a factor between 14 and more than 100 between a GranSim simulation and an optimised sequential execution. In the meantime we have spent some effort in improving the efficiency of the simulator. The most recent measurements show factors between 10 and 15. Regarding the amount of information we generate out of a simulation we think that is an acceptable factor. Especially a comparison with another simulator for annotated Haskell programs, HBCPP, is of interest. Using a set of small example programs GranSim was between 1.5 and 2 times slower than HBCPP. In some cases the GranSim-Light setup was about as fast the HBCPP version.

We observed one potential problem of the GranSim-Light setup, though. If the parallel program creates a huge number of parallel threads (several thousand) the bookkeeping of all these running threads slows down the simulation significantly. This might make the GranSim-Light setup slower than the plain GranSim setup, which has a limited number of running threads. In such a setup it is advisable to increase the time slice for each thread in the simulation (-bwN option).

Integration into GHC

GranSim is part of the overall GHC system. Therefore, it is possible to use all the features of a compilation via GHC for GranSim, too. Especially, the ccall mechanism can be used to call C functions in a parallel program (indeed, we have experimented with the use of a computer algebra library in implementing a parallel resultant algorithm). Also the interaction between certain optimisations like deforestation and the parallel execution of a program should be of interest (deforestation might eliminate intermediate lists that are crucial for the parallel execution of the program).

Finally, in parallelising the LOLITA natural language processing system it was crucial to have an profiler for the sequential version of the program available. This allowed us to determine the important parts of the computation early on in the parallelisation process.

Go to the first, previous, next, last section, table of contents.