Go to the first, previous, next, last section, table of contents.

Overview

GranSim is mainly intended to be a research tool for studying the dynamic behaviour of parallel, lazy functional programs. By simulating the parallel execution it is possible to concentrate on the amount and structure of the parallelism exposed by the program without being distracted by `external' details like load of the machine produced by other users or the basic network traffic. However, for obtaining results that are of relevance for a particular machine it is possible to set parameters in GranSim that very accurately model the behaviour of this machine.

Components of the GranSim System

The overall GranSim System consists of the following components:

The GranSim Simulator proper.
Some Visualisation Tools.
The GranSim Toolbox.
The GranSim T-shirt (not available, yet, sorry).

The GranSim simulator is built on top of the runtime-system (RTS) of the Glasgow Haskell Compiler (GHC). Thus, the major part of it is a (rather big) runtime system for GHC. This part is responsible for implementing the event driven simulation of parallelism. For example communication is modelled by adding new features to the appropriate routines in the sequential runtime-system.

For simulating parallelism the generated code has to be instrumented, which is achieved by a slightly modified code generator in GHC. To produce instrumented code the compile time option -gransim of the Haskell compiler has to be used. This is what we mean by saying `compiling under GranSim'. Note that it is not possible to use object file compiled for another setup of GHC to generate a GranSim executable file.

Our approach is similar to that of GUM (a portable parallel implementation of Haskell), which has also been developed in this department. Both systems are basically extensions of the sequential runtime-system. GranSim is also similar to GUM in the way it implements several features like the communication mechansim. To a large extend both systems actually use the same code. See section GranSim vs GUM for a more detailled comparison.

In order to analyse the result of a simulation it is very important to visualise the abundance of data. To this end the GranSim system contains a set of visualisation tools that generate PostScript graphs, focusing on specific aspects of the execution. The most important usage of these tools is the generation of an overall activity profile of the execution.

The system contains a set of small tools, usually scripts processing the output of a simulation. These tools are collected in the so-called GranSim Toolbox.

The final component, the GranSim T-shirt has not been implemented, yet. If you have suggestions, feel free to email me.

Semi-explicit Parallelism

The underlying execution model of GranSim is one of semi-explicit parallelism where expressions that should be evaluated in parallel have to be annotated by using the par annotation. However, it is up to the runtime-system to decide when and where to create a parallel thread. It may even discard potential parallelism altogether.

The communication between the threads is performed via accessing shared data structures. If a thread tries to access a value that is under evaluation, it is put into a blocking queue attached to that closure. When the closure is updated by the result this blocking queue is released. So, the only extension of the sequential evaluation mechanism is the addition of blocking queues.

Because of these characteristics of hiding as many low-level details as possible in the runtime-system we call this model semi-explicit. It is only necessary to mark potential parallelism in the program. All issues related to communication, load balancing etc are handled by the runtime system and can be ignored by the programmer.

For example in the expression

 let 
   squares = [ i^2 | i <- [1..n] ] 
 in 
 squares `par` sum squares

the list of squares will be produced by one thread (the producer) and the result sum will be computed by another thread (the consumer). Note that this producer-consumer structure of the algorithm is solely determined by the data-dependencies in the program. All the programmer has to do is to annotate a named subexpression (in this example squares), which should be computed in parallel. Thread placement, communication and whether the thread should be created at all are up to the runtime-system.

In this introduction we don't go into the operational details of this example. @xref{Example: Sum of Squares} for a discussion how to improve the parallelism of the algorithm.

This model of using par and seq annotations is the same as it is used for the GRIP machine and for the GUM system. In fact, there is a strong correspondence between the GranSim and GUM. This allows to carry results of a GranSim simulation over to a real parallel machine operating under GUM (this issue will be addressed in more detail in section GranSim vs GUM.).

GranSim Modes

. .

This section outlines a methodology for parallelising lazy functional programs. The development and performance tuning of a parallel algorithm typically proceeds in several stages:

Profiling of the sequential algorithm if the goal is the parallelisation of an already existing algorithm. This stage should give an idea about the `big computations' in the program. The programmer can then concentrate on parallelising this part of the algorithm.
Extracting parallelism out of the algorithm. In our model of computation this means adding seq and par annotations to the program (See section Parallelism Annotations, for a discussion of all available annotations).
Restructuring parts of the algorithm to increase the amount of parallelism. This can be called performance tuning on an abstract level where the performance of the algorithm is solely modelled by the amount of parallelism in the program.
Optimising the parallel algorithm for execution on a specific machine. In this stage low-level details have to be addressed.

To facilitate such a process of parallelisation the GHC system provides several tools. In particular GHC and GranSim support different modes reflecting different stages:

The sequential profiling setup of GHC allows to get accurate information about computation time and heap usage of program expressions. Especially for parallelising big sequential programs the profiler gives invaluable information for the proper parallelisation. A more detailed description of the profiler is part of the overall GHC documentation.
The GranSim-Light mode simulates the parallel execution of a program with an unbounded number of processors. In this mode no communication is modelled (i.e. access of remote data is as expensive as access of local data). Thus, this mode mainly shows the amount of parallelism in the program and indicates an upper bound of the parallelism when running the program on a real machine.
In the usual mode GranSim can simulate the execution of up to 32 processors(1) with an exact modelling of communication. Thus, the execution on a particular machine with the chosen characteristics is simulated.
A set of visualisation tools allows to generate activity and granularity profiles of the execution. These tools allow to generate overall profiles as well as per-PE and per-thread profiles (see section Visualisation). This especially facilitates the performance tuning of the program.

A Simple Example Program

. . . .

This section gives an example of how to write a simple parallel program in the parallel, lazy functional execution model. Due to a sad lack of imagination we take our good old friend, nfib, as an example program. This is the sequential function from which we start:

nfib :: Int -> Int
nfib 0 = 1
nfib 1 = 1
nfib n = nf1+nf2+1
         where nf1  = nfib (n-1)
               nf2  = nfib (n-2)

The straightforward idea for parallelising nfib is to start parallel processes for computing both recursive calls. This is expressed with the par annotation, which `sparks' a parallel process for computing its first argument and whose result is its second argument (the exact operational semantics of par is discussed in more detail in chapter section Parallelism Annotations.). This gives the following parallel program:

parfib :: Int -> Int
parfib 0 = 1
parfib 1 = 1
parfib n = nf2 `par` (nf1 `par` (nf1+nf2+1))
           where nf1 = parfib (n-1)
                 nf2 = parfib (n-2)

The drawback of this program is the blocking of the main task on the two created child-tasks. Only when both child tasks have returned a result, the main task can continue. It is more efficient to have one of the recursive calls computed by the main task and to spark only one parallel process for the other recursive call. In order to guarantee that the main expression is evaluated in the right order (i.e. without blocking the main task on the child task) the seq annotation is used:

parfib :: Int -> Int
parfib 0 = 1
parfib 1 = 1
parfib n = nf2 `par` (nf1 `seq` (nf1+nf2+1))
           where nf1 = parfib (n-1)
                 nf2 = parfib (n-2)

The operational reading of this function is as follows: First spark a parallel process for computing the expression parfib (n-2). Then evaluate the expression parfib (n-1). Finally, evaluate the main expression. If the parallel process has not finished by this time the main task will be blocked on nf2. Otherwise it will just pick up the result.

This simple example already shows several basic aspects of parallel, lazy functional programming:

For adding parallelism to the program par annotations on local variables (that appear in the main expression) are used.
To specify evaluation order of the program seq annotations are used. It reduces the first argument (usually a variable occurring in the definition of the second argument) to weak head normal form (WHNF).
For the efficiency of the parallel program it is important to avoid unnecessary blocking during the execution.

Another aspect not shown in this example is the importance of evaluation degree. If the parallel expressions create a compound type (e.g. a list) then it is important that they are evaluated to a certain degree. Otherwise there won't be a lot of parallelism in the program. We will revisit this aspect again in section Parallel Functional Programming in the Large: Strategies.

Finally, here is the total example as a stand-alone program:

module Main where

import Parallel

main = getArgs abort ( \ a -> 
       let 
         n = head ( map (\ a1 -> fst ((readDec a1) !! 0)) a )
       in 
       print ("parfib " ++ (show n) ++ " = " ++ (show (parfib n)) ++ "\n")  )

nfib :: Int -> Int
nfib 0 = 1
nfib 1 = 1
nfib n = nf1+nf2+1
         where nf1 = nfib (n-1)
               nf2 = nfib (n-2)

parfib :: Int -> Int
parfib 0 = 1
parfib 1 = 1
parfib n = nf2 `par` (nf1 `seq` (nf1+nf2+1) )
           where nf1 = parfib (n-1)
                 nf2 = parfib (n-2)

Running the Example Program

. .

In a standard installation of GHC with GranSim enabled the example program can be compiled as usually but with adding the compiler option -gransim. So, if the above program is in file `parfib.hs' compile a GranSim version of the program with

ghc -gransim -fvia-C parfib.hs

GranSim requires a compilation using C code as an intermediate stage. Compiling with the native code generator is not supported.

To simulate the execution on a machine with 4 processors type

./a.out +RTS -bP -bp4

This will generate a granularity profile (`.gr' extension). For getting an overall activity profile of the execution use

gr2ps -O a.out.gr

Use your favourite PostScript viewer to examine the activity profile in `a.out.ps'. You'll get the best results with GNU ghostview and ghostscript version 2.6.1 or later.

Go to the first, previous, next, last section, table of contents.