Info file: user.info,    -*-Text-*-
produced by `texinfo-format-buffer'
from file `user.texi'
using `texinfmt.el' version 2.32 of 19 November 1993.


The GranSim User's Guide discusses how to use the GranSim simulator for
the parallel execution of Haskell programs.

Copyright (C) 1994 -- 1996, Hans-Wolfgang Loidl for the GRASP/AQUA
Project, Glasgow University



File: user.info, Node: Internals of GranSim, Next: GranSim vs GUM, Prev: Parallel NoFib Suite, Up: Top

Internals
*********

This chapter discusses details of the implementation of GranSim.

[THIS TEXT IS TAKEN FROM A DRAFT OF A PREVIOUS PAPER. IT NEEDS UPDATES
TO COVER THE MORE RECENT CHANGES LIKE ADDING GRANSIM-LIGHT. SEE ALSO THE
HPFC'95 PAPER FOR THE BASICS]


* Menu:

* Global Structure::            
* Accuracy::                    
* Flexibility::                 
* Visualisation::               
* Efficiency::                  
* Integration into GHC::        



File: user.info, Node: Global Structure, Next: Accuracy, Prev: Internals of GranSim, Up: Internals of GranSim

Global Structure
================


GranSim performs an event driven simulation with a centralised event
queue. Each of the simulated processors has its own spark queue and
thread queue as well as its own clock. The synchronisation of the clocks
is performed via accessing the event queue which is sorted by the time
stamps of the events.



File: user.info, Node: Accuracy, Next: Flexibility, Prev: Global Structure, Up: Internals of GranSim

Accuracy
========

GranSim is built on top and therefore makes use of a state-of-the-art
compiler for Haskell including a huge variety of possible optimisations
that can be performed.  The instruction count function, which determines
the number of instructions that are executed for a given piece of
abstract C code, has been carefully tuned by analysing the assembler
code generated by GHC and the results have been compared with the number
of instructions executed in real Haskell programs.  These comparisons
have shown that the instruction count of the simulation lies within 10%
for arithmetic operations, within 2% for load, store operations, within
20% for branch instructions and within 14% for floating point
instructions of the real values.

To allow the modelling of different kinds of architectures the
instructions have been split into five classes, with different weights:
arithmetic operations, floating point operations (1 cycle each), load,
store operations (4 cycles each) and branch instructions (2 cycles).
These weights model a SPARC processor and have been verified with
Haskell programs in the nofib suite. The weights are tunable in order to
simulate other kinds of processors.



File: user.info, Node: Flexibility, Next: Visualisation, Prev: Accuracy, Up: Internals of GranSim

Flexibility
===========

GranSim has a big number of tunable parameters (*Note RTS Options::).
One set of parameters allows to model a broad variety of processors.  To
achieve a similar accuracy as for SPARC processors the same kind of
measurements can be performed.  Another set of parameters allows to tune
the costs for communication latency, message pack time etc.  Finally,
the overhead imposed by the runtime-system can be captured by adjusting
the parameters for task creation time, context switch time etc.  This
allows us to model architectures ranging from shared memory to
distributed memory machines.  And we can quite easily simulate the
effect of some small changes in the runtime system on the overall
behaviour.

Additionally, the GranSim-Light setup allows to study the parallelism
inherent in a program rather than choosing a fixed architecture to run
the program on.  Our experiences with parallelising rather big programs
show that this is an important additional feature for tuning the
performance of a parallel program.



File: user.info, Node: Visualisation, Next: Efficiency, Prev: Flexibility, Up: Internals of GranSim

Visualisation
=============

Especially for analysing the granularity of the created tasks, we need
visualisation tools. We have developed a set of tools for that purpose
and used it on a quite wide range of example programs to analyse their
performance (*Note Parallel NoFib Suite::).  With these tools we were
able to show a very high correlation between execution time and the
amount of allocated heap space. Apart from these tools showing the
granularity of the resulting tasks, we have also developed a set of
tools for showing the
  activity of the simulated machine as a whole, of all processors and of
all tasks in the system. These tools have proven indispensable in the
parallelisation and optimisation of a linear system solver.  Based on
the experience from this implementation, such tools are essential when
working with a lazy language, in which the order of evaluation is not at
all obvious from the program source. *Note Visualisation Tools::for a
more detailed discussion of these tools.



File: user.info, Node: Efficiency, Next: Integration into GHC, Prev: Visualisation, Up: Internals of GranSim

Efficiency
==========

The high accuracy in the simulation also implies a rather high overhead
for the simulation as more bookkeeping has to be done. This is mainly
due to the exact modelling of communication. Therefore, programs that
perform a lot of communication will impose a higher overhead on the
simulation. Early measurements of some example programs on GranSim
showed a factor between 14 and more than 100 between a GranSim
simulation and an optimised sequential execution. In the meantime we
have spent some effort in improving the efficiency of the simulator. The
most recent measurements show factors between 10 and 15.  Regarding the
amount of information we generate out of a simulation we think that is
an acceptable factor. Especially a comparison with another simulator for
annotated Haskell programs, HBCPP, is of interest. Using a set of small
example programs GranSim was between 1.5 and 2 times slower than HBCPP.
In some cases the GranSim-Light setup was about as fast the HBCPP
version.

We observed one potential problem of the GranSim-Light setup, though. If
the parallel program creates a huge number of parallel threads (several
thousand) the bookkeeping of all these running threads slows down the
simulation significantly. This might make the GranSim-Light setup slower
than the plain GranSim setup, which has a limited number of running
threads. In such a setup it is advisable to increase the time slice for
each thread in the simulation (-bwN option).



File: user.info, Node: Integration into GHC, Prev: Efficiency, Up: Internals of GranSim

Integration into GHC
====================

GranSim is part of the overall GHC system.  Therefore, it is possible to
use all the features of a compilation via GHC for GranSim,
too. Especially, the ccall mechanism can be used to call C functions in
a parallel program (indeed, we have experimented with the use of a
computer algebra library in implementing a parallel resultant
algorithm).  Also the interaction between certain optimisations like
deforestation and the parallel execution of a program should be of
interest (deforestation might eliminate intermediate lists that are
crucial for the parallel execution of the program).

Finally, in parallelising the LOLITA natural language processing system
it was crucial to have an profiler for the sequential version of the
program available.  This allowed us to determine the important parts of
the computation early on in the parallelisation process.



File: user.info, Node: GranSim vs GUM, Next: Future Extensions, Prev: Internals of GranSim, Up: Top

GranSim vs GUM
**************


[MAINLY A PLACEHOLDER FOR A REAL COMPARISON]

In a nutshell: The GUM runtime-system is very close to GranSim. To a
fairly large degree the same code is used by both versions. The main
source of inaccuracy is the modelling of stealing sparks and threads.
GUM uses a fishing mechanism: a fish message is sent from processor to
processor, looking for sparks. In GranSim we use the global knowledge of
the system and weigh the costs for stealing a spark with the inverse of
the probability of hitting a processor with available
sparks. Furthermore, there is no counterpart of GUM's FETCHME closures
as pointers to remote data in GranSim.

On the other hand, GranSim is also significantly more powerful than
GUM. It supports thread migration, offers different strategies for
packing a graph (based on the number of thunks), and it allows to choose
between different fetching strategies, when deciding what to do while
one thread fetches remote data. These features mainly deal with the
aspect of data locality in the program.

Another important aspect for the performance of the granularity of the
parallel program. This can be tuned by choosing among three basic
methods for granularity control (*Note Granularity Control
Mechanisms::).



File: user.info, Node: Future Extensions, Next: Bug Reports, Prev: GranSim vs GUM, Up: Top

Future Extensions
*****************

This version of GranSim should be fairly stable. It has been stress
tested on a number of non-trivial programs (including a program of more
than 100,000 lines of Haskell and wee bit of C). It contains most of the
features I want to have in it.

One future extension of GranSim might be the implementation of a more
accurate modelling of the spark/thread stealing mechanism (based on
GUM's fishing model).



File: user.info, Node: Bug Reports, Next: Determinant (Example), Prev: Future Extensions, Up: Top

Bug Reports
***********

Send all bug reports, including the source of the failing program, the
compile and RTS options and the error message to

     hwloidl@dcs.gla.ac.uk

Be sure to add "GranSim Bug" in the subject line.

Earlier versions of GranSim sometimes had problems when using the
generational garbage collector (default). If you meet problems with
GranSim you can try the following:

   * Use the option -Sstderr to get garbage collection messages. This
     will tell you whether there is any garbage collection going on.

   * Increase the heap size. This should give you a hint whether the
     problem is related to garbage collection at all.

   * Try to use the runtime-system options -F2s to force two-space
     garbage collection. This garbage collector uses more heap than the
     generational one. Whenever I had problems with a test program
     before it just went away with this garbage collector.

   * Perhaps also use -Z (this omits update frame squeezing; you don't
     have to worry about what that means, though).

In any case, please report it as a bug to the above address.



File: user.info, Node: Determinant (Example), Next: Options Index, Prev: Bug Reports, Up: Top

Example Program: Determinant Computation
****************************************


The example program in this chapter is a parallel determinant
computation. It uses the data type SqMatrix a to represent a matrix as a
list of list together with its bounds.


For getting a parallel version of the program strategies (*Note A Class
of Strategies::) are used. Thus, much of the parallelism comes from
applying the parList strategy. In order to demonstrate the use of
strategies in a parallel functional program this example contains
parallelism down to a very fine granularity.


     determinant :: (NFDataIntegral a) => SqMatrix a -> a

     determinant (SqMatrixC ((iLo,jLo),(iHi,jHi)) mat) 
       | jHi-jLo+1 == 1 =  let 
                             [[mat_1_1]] = mat 
                           in 
                             mat_1_1
       | jHi-jLo+1 == 2 =  let  
                             [[mat_1_1,mat_1_2],
                              [mat_2_1,mat_2_2] ] = mat
                             a = mat_1_1 * mat_2_2			       
                             b = mat_1_2 * mat_2_1			      
                             strategy r = a `par` b `par` ()
                           in
                             a - b `using` strategy
       | otherwise      =  
           sum (l_par `using` parList rnf)
               where
                 newLine _ [] = []
                 newLine j line = pre ++ post `using` strategyN
                                  where				  
                                    pre  = [ line !! (k-1) | k <- [jLo..j-1] ]
                                    post = [ line !! (k-1) | k <- [j+1..jHi] ]
                                    strategyN r = pre `par` post `par` ()
                 determine1 j = (if pivot > 0 then
                                   sign*pivot*det' `using` strategyD1
                                 else
                                   0) `using` sPar sign 
                                 where
                                   sign = if (even (j-jLo)) then 1 else -1
                                   pivot = (head mat) !! (j-1)
                                   mat' = SqMatrixC ((iLo,jLo),(iHi-1,jHi-1))
                                                    (map (newLine j) (tail mat))
                                   det' = determinant mat'
                                   strategyD1 r = 
                                     parSqMatrix (parList rwhnf) mat' `seq`
                                     det' `par` ()
               l_par = map determine1 [jLo..jHi]



File: user.info, Node: Options Index, Next: Concept Index, Prev: Determinant (Example), Up: Top

Runtime-System Options  Index
*****************************


* Menu:

* -baN:                         Communication Parameters.  4.
* -bAN:                         Processor Characteristics.  4.
* -bBN:                         Processor Characteristics.  4.
* -bC:                          Miscellaneous Options.  4.
* -bcN:                         Runtime-System Parameters.  4.
* -bDE:                         Debugging Options.      4.
* -bdN:                         Runtime-System Parameters.  4.
* -be:                          Miscellaneous Options.  4.
* -bfN:                         Miscellaneous Options.  4.
* -bFN:                         Processor Characteristics.  4.
* -bG:                          Bulk Fetching.          4.
* -bgN:                         Communication Parameters.  4.
* -bh:                          Basic Options.          4.
* -bHN:                         Processor Characteristics.  4.
* -bI:                          Granularity Control Mechanisms.  4.
* -bKN:                         Granularity Control Mechanisms.  4.
* -blN:                         Communication Parameters.  4.
* -bLN:                         Processor Characteristics.  4.
* -bM:                          Migration.              4.
* -bmN:                         Communication Parameters.  4.
* -bN:                          Miscellaneous Options.  4.
* -bnN:                         Runtime-System Parameters.  4.
* -bON:                         Granularity Control Mechanisms.  4.
* -bP:                          Basic Options.          4.
* -bpN:                         Basic Options.          4.
* -bQN:                         Bulk Fetching.          4.
* -bqN:                         Runtime-System Parameters.  4.
* -brN:                         Communication Parameters.  4.
* -bs:                          Basic Options.          4.
* -bSN:                         Processor Characteristics.  4.
* -bT:                          Miscellaneous Options.  4.
* -btN:                         Runtime-System Parameters.  4.
* -buN:                         Runtime-System Parameters.  4.
* -bwN:                         Miscellaneous Options.  4.
* -bxN:                         Communication Parameters.  4.
* -bXX:                         Granularity Control Mechanisms.  4.
* -byN:                         Asynchronous Communication.  4.
* -bYN:                         Granularity Control Mechanisms.  4.
* -bZ:                          Asynchronous Communication.  4.
* -oN:                          General GHC RTS Options.  4.
* -QN:                          Bulk Fetching.          4.
* -SF:                          General GHC RTS Options.  4.



File: user.info, Node: Concept Index, Prev: Options Index, Up: Top

Concept Index
*************


* Menu:

* ACQUIRED event:               Spark Events.           4.
* Asynchronous Communication :  Asynchronous Communication.  4.
* BLOCK event:                  Basic Thread Events.    4.
* Bucket statistics:            Bucket Graphs.          4.
* Bulk Fetching:                Bulk Fetching.          4.
* Compiling under GranSim:      Running it.             4.
* Cumulative statistics:        Cumulative Graphs.      4.
* DESCHEDULE event:             Basic Thread Events.    4.
* Determinant computation (parallel): Determinant (Example).  4.
* Distributed memory setup:     Distributed Memory Machines (DMMs).  4.
* Distributed memory setup:     Strongly Connected DMMs.  4.
* Emacs mode for GranSim profiles: The Emacs GranSim Profile Mode.  4.
* Emacs support:                The Emacs GranSim Profile Mode.  4.
* Emacs support (highlighting): Features.               4.
* Emacs support (narrowing):    Features.               4.
* END event:                    END Events.             4.
* Evaluation degree:            Sum of Squares (Example).  4.
* Events:                       Body.                   4.
* Example:                      Determinant (Example).  4.
* Example:                      Example.                4.
* Example:                      Parallel NoFib Suite.   4.
* Example:                      Sum of Squares (Example).  4.
* EXPORTED event:               Spark Events.           4.
* FETCH event:                  Communication Events.   4.
* Fetching Strategies:          Asynchronous Communication.  4.
* FETCHME closures:             GranSim vs GUM.         4.
* Forcing functions:            Sum of Squares (Example).  4.
* Garbage collectors:           General GHC RTS Options.  4.
* GHC:                          Components.             4.
* `gr2ap':                      Activity Profiles.      4.
* `gr2pe':                      Activity Profiles.      4.
* `gr2ps':                      Activity Profiles.      4.
* GranSim profile mode:         The Emacs GranSim Profile Mode.  4.
* GranSim-Light:                GranSim Modes.          4.
* Granularity Control Mechanisms: Granularity Control Mechanisms.  4.
* GUM:                          GranSim vs GUM.         4.
* HBCPP:                        Efficiency.             4.
* Highlighting in GranSim profile mode: Features.       4.
* Histogram:                    Bucket Graphs.          4.
* Ideal setup:                  The Ideal Machine.      4.
* Incremental Fetching:         Bulk Fetching.          4.
* Migration:                    Migration.              4.
* Narrowing in GranSim profile mode: Features.          4.
* nfib:                         Example.                4.
* Node:                         General Structure.      4.
* Overview:                     Overview.               4.
* Packing Strategies:           Bulk Fetching.          4.
* par:                          Basic Annotations.      4.
* Parallel determinant computation: Determinant (Example).  4.
* Parallel module:              Example.                4.
* Parallel nofib suite:         Parallel NoFib Suite.   4.
* parAt:                        Advanced Annotations.   4.
* parAtAbs:                     Experimental Annotations.  4.
* parAtForNow:                  Advanced Annotations.   4.
* parAtRel:                     Experimental Annotations.  4.
* parfib:                       Example.                4.
* parGlobal:                    Advanced Annotations.   4.
* parLocal:                     Advanced Annotations.   4.
* Producer consumer parallelism: Sum of Squares (Example).  4.
* Profiling:                    GranSim Modes.          4.
* PRUNED event:                 Spark Events.           4.
* Quicksort:                    Quicksort (Example).    4.
* REPLY event:                  Communication Events.   4.
* RESUME event:                 Basic Thread Events.    4.
* RESUME(Q) event:              Basic Thread Events.    4.
* Running an example program:   Running it.             4.
* SCHEDULE event:               Basic Thread Events.    4.
* seq:                          Basic Annotations.      4.
* Setups:                       Specific Setups.        4.
* Setups (distributed memory):  Distributed Memory Machines (DMMs).  4.
* Setups (distributed memory):  Strongly Connected DMMs.  4.
* Setups (ideal):               The Ideal Machine.      4.
* Setups (shared memory):       Shared Memory Machines.  4.
* Shared memory setup:          Shared Memory Machines.  4.
* Spark:                        Basic Annotations.      4.
* SPARK event:                  Spark Events.           4.
* SPARKAT event:                Spark Events.           4.
* Stack size:                   General GHC RTS Options.  4.
* START event:                  Basic Thread Events.    4.
* START(Q) event:               Basic Thread Events.    4.
* STEALING event:               Thread Migration Events.  4.
* STOLEN event:                 Thread Migration Events.  4.
* STOLEN(Q) event:              Thread Migration Events.  4.
* Strategies:                   Strategies.             4.
* Sum of squares (parallel):    Sum of Squares (Example).  4.
* Synchronous Communication :   Asynchronous Communication.  4.
* SYSTEM_END event:             Debugging Events.       4.
* SYSTEM_START event:           Debugging Events.       4.
* Thread id:                    General Structure.      4.
* Thread migration:             Migration.              4.
* Thread stealing:              Migration.              4.
* Thunk:                        Bulk Fetching.          4.
* Time stamps:                  General Structure.      4.