Info file: user.info,    -*-Text-*-
produced by `texinfo-format-buffer'
from file `user.texi'
using `texinfmt.el' version 2.32 of 19 November 1993.


The GranSim User's Guide discusses how to use the GranSim simulator for
the parallel execution of Haskell programs.

Copyright (C) 1994 -- 1996, Hans-Wolfgang Loidl for the GRASP/AQUA
Project, Glasgow University



File: user.info, Node: Top, Next: Quick Intro, Prev: (dir), Up: (dir)

GranSim User's Guide
********************

This user's guide describes how to use GranSim for simulating the
parallel execution of (annotated) Haskell programs. In passing we will
discuss how to write parallel, lazy functional programs and how to tune
their performance.  To this end, some visualisation tools for generating
activity and granularity profiles of the execution will be discussed. A
set of example programs demonstrates the use of GranSim.

GranSim is part of the Glasgow Haskell Compiler (GHC), in fact it is a
special setup of GHC, which uses a slightly modified compiler (for
instrumenting the code) and an extended runtime-system.  For users who
are already familiar with the GHC and parallel functional programming in
general there is a quick introduction to GranSim available (*Note Quick
Intro::).


* Menu:

* Quick Intro::                 
* Overview::                    
* Setting-up GranSim::          
* Parallelism Annotations::     
* Runtime-System Options::      
* Strategies::                  
* Visualisation Tools::         
* GranSim Profiles::            
* Parallel NoFib Suite::        
* Internals of GranSim::        
* GranSim vs GUM::              
* Future Extensions::           
* Bug Reports::                 
* Determinant (Example) ::      
* Options Index::               
* Concept Index::               



File: user.info, Node: Quick Intro, Next: Overview, Prev: Top, Up: Top

A Quick Introduction to GranSim
*******************************

If you already know how to compile a Haskell program with GHC and if you
have an installed version of GranSim available there are only a few
changes necessary to simulate parallel execution.  Basically, a compile
time flag has to be used to generate instrumented code. Runtime-system
flags then control the behaviour of the simulation.

  1. Compile all modules with the additional options -gransim and
     -fvia-C. Use -gransim also when linking object files. For example

          ghc -gransim -fvia-C -o foo foo.hs

     creates a GranSim executable file foo.
  2. When running the program use the runtime-system option -bP to
     generate a full GranSim profile (*Note Types of GranSim Profiles::).
     *Note Runtime-System Options:: for a description of all options that
     allow you to control the behaviour of the simulated parallel
     architecture. For example

          ./foo +RTS -bP -bp16 -bl400

     starts a simulation for a machine with 16 processors and a latency of 400
     machine cycles. It generates a GranSim profile `foo.gr'
     .
  3. Use one of the visualisation tools (*Note Visualisation Tools::) to
     examine the behaviour of the program.  The first bet is to use
     `gr2ps', which generates a graph showing the overall activity of
     the machine in a global picture.  For example

          gr2ps -O foo.gr

     generates an activity profile as a colour PostScript file `foo.ps'.
     It shows how many threads in total have been running, runnable (but
     not running), blocked (on data under evaluation), fetching (remote
     data) and migrating (to another processor) at each point during the
     execution.  Other tools you might want to try are `gr2pe' (giving a
     per-PE activity profile) and `gr2ap' (giving a per-thread activity
     profile).

     Additionally, another set of visualisation tools allows to focus on
     the granularity of the generated threads.  The most important one
     is `gr2gran', which generates bucket statistics showing the runtime
     of the individual threads (*Note Granularity Profiles::).



File: user.info, Node: Overview, Next: Setting-up GranSim, Prev: Quick Intro, Up: Top

Overview
********

GranSim is mainly intended to be a research tool for studying the
dynamic behaviour of parallel, lazy functional programs. By simulating
the parallel execution it is possible to concentrate on the amount and
structure of the parallelism exposed by the program without being
distracted by `external' details like load of the machine produced by
other users or the basic network traffic. However, for obtaining results
that are of relevance for a particular machine it is possible to set
parameters in GranSim that very accurately model the behaviour of this
machine.


* Menu:

* Components::                  
* Semi-explicit Parallelism::   
* GranSim Modes::               
* Example::                     
* Running it::                  



File: user.info, Node: Components, Next: Semi-explicit Parallelism, Prev: Overview, Up: Overview

Components of the GranSim System
================================

The overall GranSim System consists of the following components:


   * The *GranSim Simulator* proper.

   * Some *Visualisation Tools*.

   * The *GranSim Toolbox*.

   * The *GranSim T-shirt* (not available, yet, sorry).

The GranSim simulator is built on top of the "runtime-system" (RTS) of
the Glasgow Haskell Compiler ("GHC").  Thus, the major part of it is a
(rather big) *runtime system for GHC*.  This part is responsible for
implementing the event driven simulation of parallelism.  For example
communication is modelled by adding new features to the appropriate
routines in the sequential runtime-system.

For simulating parallelism the generated code has to be instrumented,
which is achieved by a slightly modified code generator in GHC. To
produce instrumented code the compile time option -gransim of the
Haskell compiler has to be used.  This is what we mean by saying
`compiling under GranSim'. Note that it is not possible to use object
file compiled for another setup of GHC to generate a GranSim executable
file.

Our approach is similar to that of "GUM" (a portable parallel
implementation of Haskell), which has also been developed in this
department.  Both systems are basically extensions of the sequential
runtime-system. GranSim is also similar to GUM in the way it implements
several features like the communication mechansim. To a large extend
both systems actually use the same code.  *Note GranSim vs GUM:: for a
more detailled comparison.

In order to analyse the result of a simulation it is very important to
visualise the abundance of data. To this end the GranSim system contains
a set of *visualisation tools* that generate PostScript graphs, focusing
on specific aspects of the execution. The most important usage of these
tools is the generation of an overall activity profile of the execution.

The system contains a set of small tools, usually scripts processing the
output of  a simulation.  These  tools are   collected in the  so-called
*GranSim Toolbox*.

The final component, the *GranSim T-shirt* has not been implemented,
yet. If you have suggestions, feel free to email me.



File: user.info, Node: Semi-explicit Parallelism, Next: GranSim Modes, Prev: Components, Up: Overview

Semi-explicit Parallelism
=========================

The underlying execution model of GranSim is one of semi-explicit
parallelism where expressions that should be evaluated in parallel have
to be annotated by using the par annotation. However, it is up to the
runtime-system to decide when and where to create a parallel thread. It
may even discard potential parallelism altogether.

The communication between the threads is performed via accessing shared
data structures.  If a thread tries to access a value that is under
evaluation, it is put into a blocking queue attached to that closure.
When the closure is updated by the result this blocking queue is
released. So, the only extension of the sequential evaluation mechanism
is the addition of blocking queues.

Because of these characteristics of hiding as many low-level details as
possible in the runtime-system we call this model semi-explicit. It is
only necessary to mark potential parallelism in the program. All issues
related to communication, load balancing etc are handled by the runtime
system and can be ignored by the programmer.

For example in the expression

      let 
        squares = [ i^2 | i <- [1..n] ] 
      in 
      squares `par` sum squares

the list of squares will be produced by one thread (the "producer") and
the result sum will be computed by another thread (the "consumer"). Note
that this producer-consumer structure of the algorithm is solely
determined by the data-dependencies in the program.  All the programmer
has to do is to annotate a named subexpression (in this example
squares), which should be computed in parallel. Thread placement,
communication and whether the thread should be created at all are up to
the runtime-system.

In this introduction we don't go into the operational details of this
example. *Note Example: Sum of Squares:: for a discussion how to improve
the parallelism of the algorithm.

This model of using par and seq annotations is the same as it is
used for the GRIP machine and for the GUM system. In fact, there
is  a strong correspondence between the  GranSim and GUM. This allows to
carry results of  a GranSim simulation over to  a real parallel  machine
operating under  GUM (this  issue will  be addressed in  more  detail in
*Note GranSim vs GUM::.).



File: user.info, Node: GranSim Modes, Next: Example, Prev: Semi-explicit Parallelism, Up: Overview

GranSim Modes
=============

This section outlines a methodology for parallelising lazy functional
programs.  The development and performance tuning of a parallel
algorithm typically proceeds in several stages:


  1. *Profiling* of  the  sequential  algorithm   if  the goal   is  the
     parallelisation of an already existing algorithm.  This stage
     should give an idea about the `big computations' in the
     program. The programmer can then concentrate on parallelising this
     part of the algorithm.

  2. *Extracting*  parallelism  out of the algorithm.   In  our model of
     computation this means adding seq and par annotations to the
     program (*Note Parallelism Annotations::, for a discussion of all
     available annotations).

  3. *Restructuring*  parts of the algorithm   to increase the amount of
     parallelism.  This can be called performance tuning on an abstract
     level where the performance of the algorithm is solely modelled by
     the amount of parallelism in the program.

  4. *Optimising* the  parallel  algorithm for  execution on  a specific
     machine. In this stage low-level details have to be addressed.


To facilitate such a process of parallelisation the GHC system provides
several tools. In particular GHC and GranSim support different modes
reflecting different stages:


  1. The sequential *profiling* setup of GHC allows to get accurate
     information about computation time and heap usage of program
     expressions.  Especially for parallelising big sequential programs
     the profiler gives invaluable information for the proper
     parallelisation. A more detailed description of the profiler is
     part of the overall GHC documentation.

  2. The *GranSim-Light* mode simulates the parallel execution of a
     program with an unbounded number of processors.  In this mode no
     communication is modelled (i.e.  access of remote data is as
     expensive as access of local data).  Thus, this mode mainly shows
     the amount of parallelism in the program and indicates an upper
     bound of the parallelism when running the program on a real
     machine.

  3. In the usual mode *GranSim* can simulate the execution of up to 32
     processors (1) (*Note GranSim Modes-Footnotes::) with an exact
     modelling of communication. Thus, the execution on a particular
     machine with the chosen characteristics is simulated.

  4. A set of *visualisation tools* allows to generate activity and
     granularity profiles of the execution.  These tools allow to
     generate overall profiles as well as per-PE and per-thread profiles
     (*Note Visualisation::).  This especially facilitates the
     performance tuning of the program.



File: user.info  Node: GranSim Modes-Footnotes, Up: GranSim Modes

(1) In general the word size of the machine is the upper bound for
the number of processors.



File: user.info, Node: Example, Next: Running it, Prev: GranSim Modes, Up: Overview

A Simple Example Program
========================

This section gives an example of how to write a simple parallel program
in the parallel, lazy functional execution model. Due to a sad lack of
imagination we take our good old friend, nfib, as an example
program. This is the sequential function from which we start:

     nfib :: Int -> Int
     nfib 0 = 1
     nfib 1 = 1
     nfib n = nf1+nf2+1
              where nf1  = nfib (n-1)
                    nf2  = nfib (n-2)

The straightforward idea for parallelising nfib is to start parallel
processes for computing both recursive calls. This is expressed with the
par annotation, which `sparks' a  parallel process for computing its
first argument   and whose result  is  its  second argument  (the  exact
operational semantics of par is  discussed in more detail in chapter
*Note Parallelism   Annotations::.).  This  gives  the following  parallel
program:

     parfib :: Int -> Int
     parfib 0 = 1
     parfib 1 = 1
     parfib n = nf2 `par` (nf1 `par` (nf1+nf2+1))
                where nf1 = parfib (n-1)
                      nf2 = parfib (n-2)

The drawback of this program is the blocking of the main task on the two
created child-tasks. Only when both child tasks have returned a result,
the main task can continue. It is more efficient to have one of the
recursive calls computed by the main task and to spark only one parallel
process for the other recursive call.  In order to guarantee that the
main expression is evaluated in the right order (i.e. without blocking
the main task on the child task) the seq annotation is used:

     parfib :: Int -> Int
     parfib 0 = 1
     parfib 1 = 1
     parfib n = nf2 `par` (nf1 `seq` (nf1+nf2+1))
                where nf1 = parfib (n-1)
                      nf2 = parfib (n-2)

The operational reading of this function is as follows: First spark a
parallel process for computing the expression parfib (n-2).  Then
evaluate the expression parfib (n-1). Finally, evaluate the main
expression. If the parallel process has not finished by this time the
main task will be blocked on nf2. Otherwise it will just pick up the
result.

This simple example already shows several basic aspects of parallel,
lazy functional programming:


   * For adding parallelism to the program par annotations on local
     variables (that appear in the main expression) are used.

   * To specify evaluation order of the program seq annotations are
     used. It reduces the first argument (usually a variable occurring
     in the definition of the second argument) to weak head normal form
     (WHNF).

   * For the efficiency of the parallel program it is important to avoid
     unnecessary blocking during the execution.


Another aspect not shown in this example is the importance of evaluation
degree. If the parallel expressions create a compound type (e.g.  a
list) then it is important that they are evaluated to a certain
degree. Otherwise there won't be a lot of parallelism in the program. We
will revisit this aspect again in *Note Strategies::.

Finally, here is the total example as a stand-alone program:

     module Main where

     import Parallel

     main = getArgs abort ( \ a -> 
            let 
              n = head ( map (\ a1 -> fst ((readDec a1) !! 0)) a )
            in 
            print ("parfib " ++ (show n) ++ " = " ++ (show (parfib n)) ++ "\n")  )

     nfib :: Int -> Int
     nfib 0 = 1
     nfib 1 = 1
     nfib n = nf1+nf2+1
              where nf1 = nfib (n-1)
                    nf2 = nfib (n-2)

     parfib :: Int -> Int
     parfib 0 = 1
     parfib 1 = 1
     parfib n = nf2 `par` (nf1 `seq` (nf1+nf2+1) )
                where nf1 = parfib (n-1)
                      nf2 = parfib (n-2)



File: user.info, Node: Running it, Prev: Example, Up: Overview

Running the Example Program
===========================

In  a standard installation  of  GHC with  GranSim  enabled the  example
program can be compiled as  usually but with  adding the compiler option
-gransim.   So, if the  above program  is   in file `parfib.hs'
compile a GranSim version of the program with

     ghc -gransim -fvia-C parfib.hs

GranSim requires a compilation using C code as an intermediate
stage. Compiling with the native code generator is not supported.

To simulate the execution on a machine with 4 processors type

     ./a.out +RTS -bP -bp4

This will generate a granularity profile (`.gr' extension). For getting
an overall activity profile of the execution use

     gr2ps -O a.out.gr

Use your favourite PostScript viewer to examine the activity profile in
`a.out.ps'. You'll get the best results with GNU ghostview and
ghostscript version 2.6.1 or later.



File: user.info, Node: Setting-up GranSim, Next: Parallelism Annotations, Prev: Overview, Up: Top

Setting-up GranSim
******************

[WARNING: THIS IS VERY DRAFTY FOR NOW. IF YOU HAVE SEVERE INSTALLATION
PROBLEMS PLEASE LET ME KNOW.]

This chapter describes how to get the latest version of GranSim, and how
to install it.

The system requirements are basically the same as for GHC itself. So
far, GranSim has been tested on SUNs under SunOS 4.1 and Solaris 2, and
on DEC Alphas under OSF 1 and OSF 3.2. The visualisation tools require
Perl, Bash and Gnuplot (only for granularity profiles).

* Menu:

* Retrieving::                  
* Installing::                  
* Trouble Shooting::            



File: user.info, Node: Retrieving, Next: Installing, Prev: Setting-up GranSim, Up: Setting-up GranSim

Retrieving
==========


The important addresses to check for the latest information about GHC
and GranSim:

   * GranSim Home Page --
     http://www.dcs.glasgow.ac.uk/fp/software/gransim.html
   * Anonymous FTP Server --
     ftp://ftp.dcs.glasgow.ac.uk/pub/haskell/glasgow/
   * GHC home page -- http://www.dcs.glasgow.ac.uk/fp/software/ghc.html
     (*Note (GHC)::)
   * Glasgow FP group page -- http://www.dcs.glasgow.ac.uk/fp/

To get the current version of GranSim go to the anonymous FTP Server at
Glasgow and retrieve version 0.29 of GHC.  The sources of the compiler
with GranSim support are in

     ghc-0.29-src.tar.gz

Only if you don't already have an installed version of GHC for
bootstrapping the new version you'll have to download the HC files in
that directory, too.

Check the README file on the FTP server to get more information about
the different versions you can download. There should be at least one
binary installation of GranSim available for Suns. If you have the right
machine and operating system you can just download and unpack this
version.



File: user.info, Node: Installing, Next: Trouble Shooting, Prev: Retrieving, Up: Setting-up GranSim

Installing
==========

Follow the instructions in the GHC Installation Guide (in the
subdirectory `ghc/docs/install_guide'.

Note that so far GranSim has only been tested with Haskell 1.2. I
recommend using that version until I have had a closer look at the
interaction of GranSim with the new, shiny GHC for 1.3.


The only things different from a normal installation are:

   * Call the `configure' script with the additional option

          --enable-gransim

   * Typing

          make all

     (after `configure' and `STARTUP') should create all necessary
     libraries. All GranSim specific libraries (and hi files) have the
     extension `_mg.a' (`_mg.hi').

You can find a copy of the GranSim User's Guide in the subdirectory
`ghc/docs/gransim'.



File: user.info, Node: Trouble Shooting, Prev: Installing, Up: Setting-up GranSim

Trouble Shooting
================

If an installation with make all doesn't run through smoothly, try to
build the components by hand. Often it's just a problem with
dependencies or old interface files. Try to recompile the modules by
hand on which the failing module depends.

The top level files for the installation are

   * `hsc' in the directory `ghc/compiler/': the compiler proper.
   * The libraries `libHS_mg.a' and `libHS13_mg.a' in the directory
     `ghc/lib/': these are the necessary built-in libraries
     (preludes). The libraries `libHSghc_mg.a' and `libHShbc_mg.a' are
     optional.
   * `libHSrts_mg.a' in the directory `ghc/runtime/': the
     runtime-system.

You can create a version of the RTS with debugging information (and all
GranSim debugging options) by typing

     make EXTRA_HC_OPTS="-optcO-DGRAN -optcO-DGRAN_CHECK -optcO-DDEBUG -optcO-g" 
          libHSrts_mg.a

For those Bravehearts who want to actually hack on the RTS I recommend
first having a closer look at the hackers sections of the general GHC
User's Guide (*Note (ghc-user-guide.info)Top::). Eventually, I'll add
GranSim specific stuff to the chapter about the internals of GranSim
(*Note GranSim Internals::) but for now there is not much in this
chapter.



File: user.info, Node: Parallelism Annotations, Next: Runtime-System Options, Prev: Setting-up GranSim, Up: Top

Parallelism Annotations
***********************

This chapter discusses the constructs for adding parallelism to a
Haskell program. All constructs are annotations that do not change the
semantics of the program (one exception to this rule is the annotation
for forcing sequential evaluation since it may change the strictness
properties of the program).

Normally the basic annotations that are discussed first are sufficient
to write and tune a parallel program.  The advanced annotations allow to
name static spark sites and to provide granularity information.

*NB:* To use these annotations the module `Parallel' must be
imported as shown in the introductory example (*Note Example::).

* Menu:

* Basic Annotations::           
* Advanced Annotations::        
* Experimental Annotations::    



File: user.info, Node: Basic Annotations, Next: Advanced Annotations, Prev: Parallelism Annotations, Up: Parallelism Annotations

Basic Annotations
=================

The two basic parallelism annotations in GranSim (as well as in GUM and
GRIP) are `par' and `seq'. Both take two arguments and return the value
of the second argument as result.  However, they differ in their
operational behaviour:


 -- Annotation:  par :: a -> b -> b
     `par x y' creates a spark for evaluating X and returns the value of
     Y as a result.

 -- Annotation:  seq :: a -> b -> b
     `seq x y' reduces X to weak head normal form and then returns the
     value of Y as a result.


The process of sparking a parallel thread creates potential parallelism
(a "spark"). Such a spark may be turned into a thread or may be
discarded by the runtime system.  Several important facts should be
noted about these annotations:


   * Semantically the value of `seq x y' equals that of Y only if the
     evaluation of X terminates and is error free. Thus, `seq' is strict
     in its first argument.

   * Typically X is a local variable and Y is an expression with X as a
     free variable. A typical idiom for using `par' is

          let x = ... in x `par` f x

     If X does not occur in Y speculative parallelism is created, which
     might trigger the evaluation of objects that are not needed for
     computing the result. This may also change the termination
     properties of the program.

   * The evaluation degree of Y is unaffected by these annotations. It
     is solely determined by the expression in which the result of this
     annotated expression is used.



File: user.info, Node: Advanced Annotations, Next: Experimental Annotations, Prev: Basic Annotations, Up: Parallelism Annotations

Advanced Annotations
====================

The annotations in the previous section are actually specialisations of
the following set of built-in functions which can be called directly:


 -- Annotation:  parGlobal :: Int -> Int -> Int -> Int -> a -> b -> b
     `parGlobal n g s p x y' creates a spark for evaluating X and
     returns the value of Y as a result. The spark gets the name N with
     the granularity information G, the size information S and a degree
     of parallelism of P. The G, S and P fields just forward information
     to the runtime system.

 -- Annotation:  parLocal :: Int -> Int -> Int -> Int -> a -> b -> b
     `parLocal n g s p x y' behaves like `parGlobal' except that the
     thread (if it is generated at all) must be executed on the current
     processor.

 -- Annotation:  parAt :: Int -> Int -> Int -> Int -> a -> b -> c -> c
     `parAt n g s p v x y' behaves like `parGlobal' except that the
     thread (if it is generated at all) must be executed on the
     processor that contains the expression V.

 -- Annotation:  parAtForNow :: Int -> Int -> Int -> Int -> a -> b -> c -> c
     `parAtForNow n g s p v x y' behaves like `parAt' except that the
     generated thread may be migrated to another processor during the
     execution of the program.



File: user.info, Node: Experimental Annotations, Prev: Advanced Annotations, Up: Parallelism Annotations

Experimental Annotations
========================

Some experimental annotations currently available in GranSim are:


 -- Annotation:  parAtAbs :: Int -> Int -> Int -> Int -> Int -> a -> b -> b
     `parAtAbs n g s p m x y' behaves like `parAt' except that the
     thread must be executed on processor number M (the processors are
     numbered from 0 to p-1 where p is the runtime system argument
     supplied via the -bp option).

 -- Annotation:  parAtRel :: Int -> Int -> Int -> Int -> Int -> a -> b -> b
     `parAtRel n g s p m x y' behaves like `parAtAbs' except that the
     value of M is added to the number of the current processor to
     determine the target processor.

 -- Annotation:  copyable :: a -> a
     `copyable x' marks the expression X to be copyable. This means that
     the expression will be transferred in its unevaluated form if it is
     needed on another processor. This may duplicate work.

     *This annotation is not yet implemented.*

 -- Annotation:  noFollow :: a -> a
     `noFollow x'

     *This annotation is not yet implemented.*

 

File: user.info, Node: Runtime-System Options, Next: Strategies, Prev: Parallelism Annotations, Up: Top

Runtime-System Options
**********************


GranSim provides a large number of runtime-system options for
controlling the simulation. Most of these options allow to specify a
particular parallel architecture in very much detail.

As general convention all GranSim related options start with -b to
separate them from other GHC RTS options. To separate the RTS options
from usual options given to the Haskell program the meta-option +RTS has
to be used.

If you are not interested in the details of the available options and
just want to specify a somewhat generic setup for one class of parallel
machines go to the last section of this chapter (*Note Specific
Setups::).


* Menu:

* Basic Options::               
* Special Features::            
* Communication Parameters::    
* Runtime-System Parameters::   
* Processor Characteristics::   
* Granularity Control Mechanisms::  
* Miscellaneous Options::       
* Debugging Options::           
* General GHC RTS Options::     
* Specific Setups::             



File: user.info, Node: Basic Options, Next: Special Features, Prev: Runtime-System Options, Up: Runtime-System Options

Basic Options
=============

The options in this section are probably the most important GranSim
options the programmer has to be aware of. They define the basic
behaviour of GranSim rather than going down to a low level of specifying
machine characteristics.

 -- RTS Option: -bP
     This option controls the generation of a GranSim profile (*Note
     GranSim Profiles::). By default a reduced profile (only END events)
     is created. With -bP a full GranSim profile is generated. Such a
     profile is necessary to create activity profiles. With -b-P no
     profile is generated at all.

 -- RTS Option: -bs
     Generate a spark profile. The GranSim profile will contain events
     describing the creation, movement, consumption and pruning of sparks.
     *Note:* This option will drastically increase the size of the
     generated GranSim profile.
 
 -- RTS Option: -bh
     Generate a heap profile. The GranSim profile will contain events
     describing heap allocations.
     *Note:* This option will drastically increase the size of the
     generated GranSim profile.

 -- RTS Option: -bpN
     Choose the number of processors to simulate.  The value of N must
     be less than or equal to the word size on the machine (i.e. usually
     32).  If N is 0 GranSim-Light mode is enabled.

 -- RTS Option: -bp:
     Enable GranSim-Light (same as -bp0). In this mode there is no limit
     on the number of processors but no communication costs are
     recorded.



File: user.info, Node: Special Features, Next: Communication Parameters, Prev: Basic Options, Up: Runtime-System Options

Special Features
================

The options in this section allow to simulate special features in the
runtime-system of the simulated machine. This allows to study
how these options influence the behaviour of different kinds of
parallel machines. All of these flags can be turned of by inserting a
- symbol after b (as in -b-P).

* Menu:

* Asynchronous Communication::  
* Bulk Fetching::               
* Migration::                   



File: user.info, Node: Asynchronous Communication, Next: Bulk Fetching, Prev: Special Features, Up: Special Features

Asynchronous Communication
--------------------------

If a thread fetches remote data by default the processor is blocked
until the data arrives. This "synchronous communication" is usually only
advantageous on machines with very low latency. On such machines it is
better to wait and avoid the overhead of a context switch. Synchronous
communication may also increase data locality as a thread is only
descheduled when it gets blocked on data that is under evaluation by
another thread.

On machines with high latency "asynchronous communication" is usually
better as it allows the processor to perform some useful computation
while a thread waits for the arrival of remote data.  However, the
processor might be even more aggressive in trying to get work while
another thread is waiting for data. The aggressiveness of the processor
to get new work is determined by the "fetching strategy". Currently five
different strategies are supported.

 -- RTS Option: -byN
     Choose a Fetching Strategy (i.e. determine what to do while the
     running thread fetches remote data):
       0. Synchronous communication (default).
       1. This and all higher fetching strategies implement asynchronous
          communication. This strategy schedules another runnable thread
          if one is available. This gives the same behaviour as -bZ.
       2. If no runnable thread is available a local spark is turned
          into a thread.  This adds task creation overhead to context
          switch overhead for asynchronous communication.
       3. If no local sparks are available the processor tries to
          acquire a remote spark.
       4. If the processor can't get a remote spark it tries to acquire
          a runnable thread from another busy processor provided that
          migration is also turned on (-bM).


 -- RTS Option: -bZ
     Enable asynchronous communication. This causes a thread to be
     descheduled when it fetches remote data. Its processor schedules
     another runnable thread. If none is available it becomes idle.
     This gives the same behaviour as -by1.


Note that fetching strategies 3 and 4 involve the sending of messages to
other processors. Therefore, it's likely that by the time a spark (or
thread) has been fetched, the original thread has already received its
data and is being executed again. Therefore, in most cases strategies 1
and 2 yield better results than 3 and 4.



File: user.info, Node: Bulk Fetching, Next: Migration, Prev: Asynchronous Communication, Up: Special Features

Bulk Fetching
-------------

When fetching remote data there are several possibilities how to
transfer the data. The options -bG and -bQN allow to choose among
them. By default GranSim uses "incremental fetching" (also called single
closure fetching, or lazy fetching).  This means that only the closure
that is immediately required is fetched from a remote processor. Again,
this strategy is preferable for low latency systems as it minimises the
total amount of data that has to be transferred. However, if the
overhead for creating a packet and for sending a message are high it is
better to perform "bulk fetching", which transfers a subgraph with the
required closure as its root.  The size of the subgraph can be bounded
by specifying the maximal size of a packet or by specifying the maximal
number of "thunks" (unevaluated closures) that should be put into one
packet. The way how to determine which closures to put into a packet is
called "packing strategy".

 -- RTS Option: -bG
     Enable bulk fetching. This causes the whole subgraph with the
     needed closure as its root to be transferred.

 -- RTS Option: -bQN
     Pack at most N-1 non-root thunks into a packet. Choosing a value of
     1 means that only normal form closures are transfered with the
     possible exception of the root, of course. The value 0 is
     interpreted as infinity (i.e. pack the whole subgraph). This is the
     default setting.

 -- RTS Option: -QN
     Pack at most N words in one packet. This limits the size of the
     packet and does not distinguish between thunks and normal
     forms. The default packet size is 1024 words.

The option for setting the packet size differs from the usual GranSim
naming scheme as it is also available for GUM. In fact, both
implementations use almost the same source code for packing a graph.



File: user.info, Node: Migration, Prev: Bulk Fetching, Up: Special Features

Migration
---------

When an idle processor looks for work it first checks its local spark
pool, then it tries to get a remote spark. It might happen that no
sparks are available in the system any more and that some processors
have several runnable threads. In such a situation it might be
advantageous to transfer a runnable thread from a busy processor to an
idle one. However, this "thread migration" is very expensive and should
be avoided unless absolutely necessary. Therefore, by default thread
migration is turned off.

 -- RTS Options: -bM
     Enable thread migration. When an idle process has no local sparks
     and can't find global sparks it tries to migrate (steal) a runnable
     thread from another busy processor.

Note that thread migration often causes a lot of fetching in order to
move all required data to the new processor, too. This bears the risk of
destroying data locality.



File: user.info, Node: Communication Parameters, Next: Runtime-System Parameters, Prev: Special Features, Up: Runtime-System Options

Communication Parameters
========================

The options in this section allow to specify the overheads for
communication. Note that in GranSim-Light mode all of these values are
irrelevant (communication is futile).

 -- RTS Option: -blN
     Set the latency in the system to N machine cycles. Typical values
     for shared memory machines are 60 -- 100 cycles, for GRIP (a
     closely-coupled distributed memory machine) around 400 cycles and
     for standard distributed memory machines between 1000 and 5000
     cycles. The default value is 1000 cycles.

 -- RTS Option: -baN
     Set the additional latency in the system to N machine cycles. The
     additional latency is the latency of follow-up packets within the
     same message. Usually this is much smaller than the latency of the
     first packet (default: 100 cycles).

 -- RTS Option: -bmN
     Set the overhead for message packing to N machine cycles. This is
     the overhead for constructing a packet independent of its size.

 -- RTS Option: -bxN
     Set the overhead for tidying up the packet after sending it to N
     machine cycles. On some systems significant work is needed after
     having sent a packet.

 -- RTS Option: -brN
     Set the overhead for unpacking a message to N machine
     cycles. Again, this overhead is independent of the message size.

 -- RTS Option: -bgN
     Set the overhead for fetching remote data to N machine cycles.  By
     default this value is two times latency plus message unpack time.



File: user.info, Node: Runtime-System Parameters, Next: Processor Characteristics, Prev: Communication Parameters, Up: Runtime-System Options

Runtime-System Parameters
=========================

The options in this section model overhead that is related to the
runtime-system of the simulated parallel machine.

 -- RTS Option: -btN
     Set the overhead for thread creation to N machine cycles. This
     overhead includes costs for initialising a control structure
     describing the thread and allocating stack space for the execution.

 -- RTS Option: -bqN
     Set the overhead for putting a thread into the blocking queue of a
     closure to N machine cycles.

 -- RTS Option: -bcN
     Set the overhead for scheduling a thread to N machine cycles.

 -- RTS Option: -bdN
     Set the overhead for descheduling a thread to N machine cycles.

 -- RTS Option: -bnN
     Set the overhead for global unblocking (i.e. blocking on a remote
     closure) to N machine cycles. This value does not contain the
     overhead caused by the communication between the processors.

 -- RTS Option: -buN
     Set the overhead for local unblocking (i.e. putting a thread out of
     a blocking queue and into a runnable queue) to N machine
     cycles. This value does not contain the overhead caused by the
     communication between the processors.



File: user.info, Node: Processor Characteristics, Next: Granularity Control Mechanisms, Prev: Runtime-System Parameters, Up: Runtime-System Options

Processor Characteristics
=========================

The options in this section specify the characteristics of the
microprocessor of the simulated parallel machine. To this end the
instructions of the processor are divided into six groups.  These groups
have different weights reflecting their different relative costs.

The groups of processor instructions are:
   * Arithmetic instructions
   * Load instructions
   * Store instructions
   * Branch instructions
   * Floating point instruction
   * Heap allocations

The options for assigning weights to these groups are:

 -- RTS Option: -bAN
     Set weight for arithmetic operations to N machine cycles.

 -- RTS Option: -bLN
     Set weight for load operations to N machine cycles.

 -- RTS Option: -bSN
     Set weight for store operations to N machine cycles.

 -- RTS Option: -bBN
     Set weight for branch operations to N machine cycles.

 -- RTS Option: -bFN
     Set weight for floating point operations to N machine cycles.

 -- RTS Option: -bHN
     Set weight for heap allocations to N machine cycles.

Strictly speaking, the heap allocation costs is a parameter of the
runtime-system. However, in our underlying machine model allocating heap
is such a basic operation that one can think of it as a special
instruction.



File: user.info, Node: Granularity Control Mechanisms, Next: Miscellaneous Options, Prev: Processor Characteristics, Up: Runtime-System Options

Granularity Control Mechanisms
==============================

There are three granularity control mechanisms:
  1. *Explicit threshold*

     No spark whose priority is smaller than a given threshold will be
     turned into a thread.
  2. *Priority sparking*

     The spark queue is sorted by priority. This guarantees that the
     highest priority spark is turned into a thread. Priorities are not
     maintained for threads.
  3. *Priority scheduling*

     The thread queue is sorted by priority, too. This guarantees that
     the biggest available thread is scheduled next. This imposes a
     higher runtime overhead.

 -- RTS Option: -bYN
     Use the value N as a threshold when turning sparks into threads. No
     spark with a priority smaller than N will be turned into a thread.

 -- RTS Option: -bXX
     Enable priority sparking. The letter X indicates how to use the
     granularity information attached to a spark site in the source
     code:

        * Use the granularity information field as a priority.
        * I Use the granularity information field as an inverse
          priority.
        * R Ignore the granularity information field and use a random
          priority.
        * N Ignore the granularity information field and don't use
          priorities at all.
 

 -- RTS Option: -bI
     Enable priority scheduling.

 -- RTS Option: -bKN
     Set the overhead for inserting a spark into a sorted spark queue to
     N machine cycles.

 -- RTS Option: -bON
     Set the overhead for inserting a thread into a sorted thread queue
     to N machine cycles.



File: user.info, Node: Miscellaneous Options, Next: Debugging Options, Prev: Granularity Control Mechanisms, Up: Runtime-System Options

Miscellaneous Options
=====================


 -- RTS Option: -bC
     Force the system to eagerly turn a spark into a thread. This
     basically disables the lazy thread creation mechanism of GranSim
     and ensures that no sparks are discarded (except for sparks whose
     closures are already in normal form).

 -- RTS Option: -be
     Enable deterministic mode. This means that no random number
     generator is used for deciding where to get work from. With this
     option two runs of the same program with the same input will yield
     exactly the same result.

 -- RTS Option: -bT
     Prefer stealing threads over stealing sparks when looking for
     remote work. This is mainly an experimental option.

 -- RTS Option: -bN
     When creating a new thread prefer sparks generated by local
     closures over sparks that have been stolen from other processors.
     This is mainly an experimental option, which might improve data
     locality.

 -- RTS Option: -bfN
     Specify the maximal number of outstanding requests to acquire
     sparks from remote processors. High values of N may cause a more
     even distribution of sparks avoiding bottlenecks caused by a `spark
     explosion' on one processor. However, this might harm the data
     locality. The default value is 1.

 -- RTS Option: -bwN
     Set the time slice of the simulator to N machine cycles.  This is
     an internal variable that changes the behaviour of the simulation
     rather than that of the simulated machine. A longer time slice
     means faster but less accurate simulation. The default time slice
     is 1000 cycles.



File: user.info, Node: Debugging Options, Next: General GHC RTS Options, Prev: Miscellaneous Options, Up: Runtime-System Options

Debugging Options
=================

These options are mainly intended for debugging GranSim. Only those
options that might be of interest to the friendly (i.e. non-hacking)
user of GranSim are listed here for now.

*Note:* These options are only available if the RTS has been
compiled with the cpp flag GRAN_CHECK (*Note Installing::).

 -- RTS Option: -bDE
     Print an event statistics at the end of the computation. This also
     includes a statistics about the packages sent (if bulk fetching)
     has been enabled.

If you are *really* interested in all the hidden options in GranSim look
into the file `ghc/runtime/main/RtsFlags.lc'.



File: user.info, Node: General GHC RTS Options, Next: Specific Setups, Prev: Debugging Options, Up: Runtime-System Options

General GHC RTS Options
=======================

Some options of the GHC runtime-system that are not specific for GranSim
are of special interest, too. They are discussed in this section.

 -- RTS Option: -oN
     This option sets the initial size of the stack of a thread to N
     words. This might be of importance for GranSim-Light, which can
     create an abundance of parallel threads filling up the heap. The
     default stack size in GranSim-Light is already reduced to 200 words
     (usually the default is 1024 words). If you run into problems with
     heap size in a GranSim-Light setup you might want to reduce it
     further.

 -- RTS Option: -SF
     Print messages about garbage collections to the file F (or to
     STDERR).

 -- RTS Option: -F2s
     Use a two-space garbage collector instead of a generational garbage
     collector. Previous versions of GranSim had problems with the
     latter. If you experience problems try this option and send me a
     bug report (*Note Bug Reports::).



File: user.info, Node: Specific Setups, Prev: General GHC RTS Options, Up: Runtime-System Options

Specific Setups
===============

When using GranSim a programmer often just wants to specify the general
architecture of the machine rather than going down to the details of a
specific machine. To facilitate this approach this section presents
examples of standard set-ups for GranSim.

Note that these setups specify the characteristics of a machine, but not
of the runtime-system. Thus characteristics like thread creation costs
are left open. However, the default setting fairly closely reflect the
real costs for example under GUM. So, unless you have a different
implementation of runtime-system details in mind the default settings
should be sufficiently accurate.


* Menu:

* The Ideal Machine::           
* Shared Memory Machines::      
* Strongly Connected DMMs::     
* Distributed Memory Machines (DMMs)::  



File: user.info, Node: The Ideal Machine, Next: Shared Memory Machines, Prev: Specific Setups, Up: Specific Setups

The Ideal GranSim Setup
-----------------------

This setup reflects the ideal case, where communication is for free and
where there is no limit on the number of processors. This is used to
show the maximal amount of parallelism in the program. Using such a
*GranSim-Light* setup is usually the first step in tuning the
performance of a parallel, lazy functional program (*Note GranSim
Modes::).

The typical GranSim-Light setup is:

     +RTS -bP -b:



File: user.info, Node: Shared Memory Machines, Next: Strongly Connected DMMs, Prev: The Ideal Machine, Up: Specific Setups

GranSim Setup for Shared-Memory Machines
----------------------------------------

In a shared memory machine the latency is roughly reduced to the costs
of a load operation. Potentially, some additional overhead for managing
the shared memory has to be added. Also the caching mechanism might be
more expensive than in a sequential machine.  In general, the latency
should be between 5 and 20 machine cycles.

For machines where the latency is of the same order of magnitude as
loading and storing data, it is reasonable to assume incremental,
synchronous communication. Migration should also be possible.

This gives the following setup (for 32 processors):
      +RTS -bP -bp32 -bl10 -b-G -by0 -bM



File: user.info, Node: Strongly Connected DMMs, Next: Distributed Memory Machines (DMMs), Prev: Shared Memory Machines, Up: Specific Setups

GranSim Setup for Strongly Connected Distributed Memory Machines
----------------------------------------------------------------

Strongly connected DMMs put a specific emphasis on keeping the latency
in the system as low as possible. One example of such a machine is GRIP,
which has been built specifically for performing parallel graph
reduction. Therefore, this setup is of special interest for us.

Most importantly the latency in such machines is typically between 100
and 500 cycles (400 for GRIP). Furthermore, the GRIP runtime-system, as
an example for such kind of machines, uses incremental, synchronous
communication. Migration is also possible.

This gives the following setup (for 32 processors):
      +RTS -bP -bp32 -bl400 -b-G -by0 -bM



File: user.info, Node: Distributed Memory Machines (DMMs), Prev: Strongly Connected DMMs, Up: Specific Setups

GranSim Setup for Distributed Memory Machines
---------------------------------------------

General distributed memory machines usually have latencies that are a
order of magnitude higher than that of strongly connected DMMs. However,
especially in this class of machines the differences between specific
machines are quite significant. So, I strongly recommend to use the
exact machine characteristics if GranSim should be used to predict the
behaviour on such a machine.

The high latency requires a fundamentally different runtime-system to
avoid long delays for fetching remote data. Therefore, usually
synchronous bulk fetching is used. I'd recommend choosing a fetching
strategy of 1 or 2 (it's hard to say which one is better in general).
Thread migration is such expensive on DMMs that it is often not
supported at all.

This gives the following setup (for 32 processors):
      +RTS -bP -bp32 -bl2000 -bG -by2 -b-M



File: user.info, Node: Strategies, Next: Visualisation Tools, Prev: Runtime-System Options, Up: Top

Parallel Functional Programming in the Large: Strategies
********************************************************

[SEE THE STRATEGY PAPER: "STRATEGIES FOR WRITING PARALLEL NON-STRICT
PROGRAMS", TRINDER P.W., LOIDL H.W., HAMMOND K., PEYTON JONES S.L. (IN
PREPARATION). ]

This chapter deals with parallel, lazy functional programming in the
large. It introduces the notion of strategies and shows how they make it
easier to develop large parallel programs.

* Menu:

* Motivation::                  
* The Main Idea::               
* Quicksort (Example)::         
* Sum of Squares (Example)::    
* A Class of Strategies::       
* Experiences with Strategies::