Info file: user.info,    -*-Text-*-
produced by `texinfo-format-buffer'
from file `user.texi'
using `texinfmt.el' version 2.32 of 19 November 1993.


The GranSim User's Guide discusses how to use the GranSim simulator for
the parallel execution of Haskell programs.

Copyright (C) 1994 -- 1996, Hans-Wolfgang Loidl for the GRASP/AQUA
Project, Glasgow University



File: user.info, Node: Motivation, Next: The Main Idea, Prev: Strategies, Up: Strategies

Motivation for Strategies
=========================


The following naive version `parmap' often does not give good
parallelism:

     parmap :: (a -> b) -> [a] -> [b]
     parmap  f []     = []
     parmap  f (x:xs) = fx `par` pmxs `par` (fx:pmxs)
                         where fx = f x
                               pmxs = parmap f xs

If in this version the spark for pmxs is discarded almost all
parallelism in the program is lost. On the other hand, if all par's are
executed more parallel threads than necessary are created (two for each
list element).

An alternative version of parmap reduces the total number of generated
threads by replacing the second par with a seq:

     parmap :: (a -> b) -> [a] -> [b]
     parmap  f []     = []
     parmap  f (x:xs) = fx `par` pmxs `seq` (fx:pmxs)
                         where fx = f x
                               pmxs = parmap f xs

The problem of this version is that in a context that requires the
result of parmap only in weak head normal form, the recursive call to
parmap will be deferred until its value is needed. This drastically
reduces the total amount of parallelism in the program. Especially, if
such a function is used as the producer in a program with
producer-consumer parallelism, this gives very poor parallelism (*Note
Sum of Squares (Example)::).


To improve this behaviour one can write a forcing function, which
guarantees that all of the result is demanded immediately:

     forcelist []     = ()
     forcelist (x:xs) = x `seq` (forcelist xs)

Using this function in `parmap' gives the following function:

     parmap :: (a -> b) -> [a] -> [b]
     parmap  f []     = []
     parmap  f (x:xs) = fx `par`  (forcelist pmxs) `seq` (fx:pmxs)
                        where fx = f x
                              pmxs = parmap f xs


However, following this approach yields a big set of forcing functions
and a mixture of defining the result and driving the computation. We
want to avoid both.



File: user.info, Node: The Main Idea, Next: Quicksort (Example), Prev: Motivation, Up: Strategies

The Main Idea
=============

[FOR NOW MAINLY THE ABSTRACT OF THE STRATEGY PAPER]

Separate the components of a parallel, lazy functional program:


  1. Parallelism

  2. Evaluation degree

  3. Definition of the result

In most current parallel non-strict functional languages, a function
must both describe the value to be computed, and the *dynamic
behaviour*. The dynamic behaviour of a function has two aspects: *
parallelism*, i.e. what values could be computed in parallel, and *
evaluation-degree*, i.e. how much of each value should be constructed.
Our experience is that specifying the dynamic behaviour of a function
obscures its semantics. Strategies have been developed to address this
problem; a strategy being an explicit description of a function's
potential dynamic behaviour. The philosophy is that *it should be
possible to understand the semantics of a function without considering
it's dynamic behaviour*.

In essence a strategy is a runtime function that traverses a data
structure specifying how much of it should be evaluated, and possibly
sparking threads to perform the construction in parallel. Because
strategies are functions they can be defined on any type and can be
combined to specify sophisticated dynamic behaviour. For example a
strategy can control thread granularity or specify evaluation over an
infinite structure. A disadvantage is that a strategy requires an
additional runtime pass over parts of the data structure. Example
programs are given that use strategies on divide-and-conquer, pipeline
and data-parallel applications.



File: user.info, Node: Quicksort (Example), Next: Sum of Squares (Example), Prev: The Main Idea, Up: Strategies

Example Program: Quicksort
==========================

This section compares two version of parallel quicksort: one using a
forcing function and another one using strategies.

* Menu:

* Forcing Function Version::    
* Strategy Version::            



File: user.info, Node: Forcing Function Version, Next: Strategy Version, Prev: Quicksort (Example), Up: Quicksort (Example)

Parallel Quicksort using Forcing Functions
------------------------------------------

The following naive attempt to introduce parallelism in quicksort fails
because the threads generating loSort and hiSort only create a single
cons cell.

     quicksortN []     = []
     quicksortN [x]    = [x]
     quicksortN (x:xs) = 
      losort `par`
      hisort `par`
      result        
       where    
        losort = quicksortN [y|y <- xs, y < x] 
        hisort = quicksortN [y|y <- xs, y >= x]
        result = losort ++ (x:hisort)

The current practice of parallel Haskell programmers is to introduce a
*forcing function* that forces the evaluation of the required elements
of a data structure. For example

     forceList :: [a] -> ()
     forceList [] = ()
     forceList (x:xs) = x `seq` forceList xs

Quicksort can be rewritten to have the desired behaviour using forceList
as follows.

     quicksortF []      = []
     quicksortF [x]     = [x]
     quicksortF (x:xs)  = 
      (forceList losort) `par`
      (forceList hisort) `par`
      losort ++ (x:hisort)
       where
        losort = quicksortF [y|y <- xs, y < x] 
        hisort = quicksortF [y|y <- xs, y >= x]

To obtain the required dynamic behaviour for the parMap example we might
use forceList within the definition of f:

     f x = forceList result `seq` result
           where
             result = [fib x]

When programs are written in this style, a number of forcing functions
are required: at least one for each type, e.g. forcePair. To obtain good
dynamic behaviour, par, seq and forcing functions are inserted
throughout the program. In consequence the dynamic behaviour and static
semantics of the computation are intertwined. This intertwining obscures
the semantics of the program. Small-scale programs remain manageable,
but in larger programs, particularly those with complex data structures
and parallel behaviour, discerning the meaning of a program becomes very
hard.



File: user.info, Node: Strategy Version, Prev: Forcing Function Version, Up: Quicksort (Example)

Parallel Quicksort using Strategies
-----------------------------------

In the strategy version we can specify the evaluation degree by applying
one of the predefined strategies to the components of the result. These
strategies are then combined by using either par or seq to specify the
parallelism.

In quicksort, each parallel thread should construct all of its result
list, and rnf expresses this neatly. The interesting equation becomes

     quicksortS (x:xs)  = losort ++ (x:hisort) `using` strategy 
     		    where
                           losort = quicksortS [y|y <- xs, y < x] 
                           hisort = quicksortS [y|y <- xs, y >= x]
                           strategy result = 
                             rnf losort `par`
                             rnf hisort `par`
                             rnf result `par`
                             ()                  



File: user.info, Node: Sum of Squares (Example), Next: A Class of Strategies, Prev: Quicksort (Example), Up: Strategies

Example Program: Sum of Squares
===============================

Although the sum-of-squares program is a very simple producer-consumer
program, it already shows some basic problems in this model.
Unfortunately, the simple version of the program, which was presented in
*Note Semi-explicit Parallelism::, produces hardly any parallelism at
all. The reason for this behaviour is that because of the lazy
evaluation approach the producer will only create the top cons cell of
the intermediate list. When the consumer then demands the rest of the
list it computes squares as well as the result sum. In the example below
version 1 (with res_naive) shows this behaviour.

To improve the parallelism one can use a "forcing function" on the
intermediate list. The parallel thread is then the application of this
forcing function on the squares list. As expected, this creates two
threads that are computing most of the time and occasionally have to
exchange data. Version 2 (with res_force) shows this behaviour.

A better way to achieve the same operational behaviour is to define a
strategy, how to compute the result.  The strategy describes the
parallelism and the evaluation degree. In doing so, it uses the reduce
to normal form (rnf) strategy, which is defined in the class
NFData. Compared to version 2 the strategy version 3 (with res_strategy)
achieves a clean separation of defining the result and specifying
operational details like parallelism and evaluation degree.

     module Main where

     import StrategiesVII

     main = getArgs abort ( \ a -> 
            let 
              args :: [Int]
              args = map (\ a1 -> fst ((readDec a1) !! 0)) a       
              version =  args !! 0      
              n = args !! 1
              -- l = [1..]
              squares :: [Int] 
              squares = [ i^2 | i <- [1..n] ]
              -- Naive version sparks a producer for list squares but doesn't force it
              res_naive = squares `par` sum squares
              -- This version sparks a forcing function on the list (producer)
              res_force = (foldl1 seq squares) `par` sum squares
              -- The strategy version
              res_strat = sum squares `using` (rnf squares `par` rnf)

              res = case version of
                     1 -> res_naive		     
                     2 -> res_force		     
                     3 -> res_strat		     
                     _ -> res_naive
              str = case version of
                     1 -> "Naive version"
                     2 -> "Forcing version"		     
                     3 -> "Strategy version"
                     _ -> "Naive version"
            in 
            print ("Sum of squares (" ++ str ++ ") of length " ++ (show n) ++ 
                   " = " ++ (show res) ++ "\n")  )


When running the naive version of this program the spark for the
consumer process is pruned and all the computation of the consumer is
subsumed by the producer. This gives a totally sequential program. The
relevant parts of the GranSim profile are:

     Granularity Simulation for sq 1 99 +RTS -bP -bp: -bs 

     ++++++++++++++++++++

     PE  0 [0]: START    	0	0x100620	[SN 0]
     PE  0 [67445]: SPARK    	20004	0x347dbc	[sparks 0]
     PE  0 [68241]: PRUNED   	20004	0x347dbc	[sparks 1]
     PE  0 [327088]: END 0, SN 0, ST 0, EXP F, BB 847, HA 5426, RT 327088, BT 0 (0), FT 0 (0), LS 0, GS 1, MY T

The other two versions of the program create one producer and one
consumer thread as expected.



File: user.info, Node: A Class of Strategies, Next: Experiences with Strategies, Prev: Sum of Squares (Example), Up: Strategies

Using a Class of Strategies
===========================

[FOR NOW THIS SECTION IS JUST A SHORT DISCUSSION OF THE STRATEGIES
MODULE]

This section discusses the contents of the strategies module. It shows
how strategies are implemented on top of the basic par and seq
annotations. Haskell's overloading mechanism is used to define a
strategy for reducing to normal form. The current implementation is
based on Haskell 1.2 but in the near future we want to use constructor
classes as provided in Haskell 1.3.

* Menu:

* Type and Application::        
* Primitive Strategies::        
* Basic Strategies::            
* Strategy Combinators::        



File: user.info, Node: Type and Application, Next: Primitive Strategies, Prev: A Class of Strategies, Up: A Class of Strategies

Type Definition of Strategies
-----------------------------

A strategy does not return a meaningful result it only specifies
parallelism and evaluation degree. Therefore, we use the following type
for strategies

     type Strategy a = a -> ()

We can now define a "strategy application", which adds information about
the operational behaviour of the execution to an expression

     using :: a -> Strategy a -> a
     using x s = s x `seq` x

Note that `x `using` s' is a projection on `x', i.e. both

   * a retraction: `x `using` s [ x'
   * 		    - and idempotent: `(x `using` s) `using` s = x
     `using` s'



File: user.info, Node: Primitive Strategies, Next: Basic Strategies, Prev: Type and Application, Up: A Class of Strategies

Primitive Strategies
--------------------

The primitive strategies are basically just syntactic sugar to fit par
and seq into the strategy scheme.

`sPar' is a strategy corresponding to `par', i.e.  `x `par` e <=> e
`using` sPar x'

     sPar :: a -> Strategy b
     sPar x y = x `par` ()

`sSeq' is a strategy corresponding to `seq', i.e.  `x `seq` e <=> e
`using` sSeq x'

     sSeq :: a -> Strategy b
     sSeq x y = x `seq` ()



File: user.info, Node: Basic Strategies, Next: Strategy Combinators, Prev: Primitive Strategies, Up: A Class of Strategies

Basic Strategies
----------------

Three basic strategies describe the evaluation degree of a data
structure:

   * r0 is a strategy which performs no evaluation at all

          r0 :: Strategy a 
          r0 x = ()

   * rwhnf reduces a data structure to weak head normal form

          rwhnf :: Strategy a 
          rwhnf x = x `seq` ()  

   * rnf reduces a data structure to normal form. This strategy can't be
     defined independent of the type of the data structure. Therefore,
     we define a class NFData and overload the rnf strategy

          class NFData a where
            -- rnf reduces it's argument to (head) normal form
            rnf :: Strategy a
            -- Default method. Useful for base types. A specific method is necessary for
            -- constructed types 
            rnf = rwhnf


For primitive types like Ints and Chars the default methods can be used

     instance NFData Int 
     instance NFData Char
     ...



File: user.info, Node: Strategy Combinators, Prev: Basic Strategies, Up: A Class of Strategies

Strategy Combinators
--------------------

When defining a strategy on a compound data type we can use these three
basic strategies and compose them with the strategy combinators
below. These combinators describe the parallelism in the evaluation.

     -- Pairs
     instance (NFData a, NFData b) => NFData (a,b) where
       rnf (x,y) = rnf x `seq` rnf y

     seqPair :: Strategy a -> Strategy b -> Strategy (a,b)
     seqPair strata stratb (x,y) = strata x `seq` stratb y 

     parPair :: Strategy a -> Strategy b -> Strategy (a,b)
     parPair strata stratb (x,y) = strata x `par` stratb y

     -- Lists
     instance NFData a => NFData [a] where
       rnf [] = ()
       rnf (x:xs) = rnf x `seq` rnf xs

     -- Applies a strategy to every element of a list in parallel
     parList :: Strategy a -> Strategy [a]
     parList strat []     = ()
     parList strat (x:xs) = strat x `par` (parList strat xs)

     -- Applies a strategy to the first  n elements of a list  in parallel 
     parListN :: (Integral b) => b -> Strategy a -> Strategy [a]
     parListN n strat []     = ()
     parListN 0 strat xs     = ()
     parListN n strat (x:xs) = strat x `par` (parListN (n-1) strat xs)

     -- Sequentially applies a strategy to each element of a list 
     seqList :: Strategy a -> Strategy [a]
     seqList strat []     = ()
     seqList strat (x:xs) = strat x `seq` (seqList strat xs)


     -- Sequentially applies a strategy to the first  n elements of a list 
     seqListN :: (Integral a) => a -> Strategy b -> Strategy [b]
     seqListN n strat []     = ()
     seqListN 0 strat xs     = ()
     seqListN n strat (x:xs) = strat x `seq` (seqListN (n-1) strat xs)

Now we can use these predefined strategies to define a parallel map
function

     -- parMap applies a function to each element of the argument list in
     -- parallel.  The result of the function is evaluated using `strat' 
     parMap :: Strategy b -> (a -> b) -> [a] -> [b]
     parMap strat f xs 	= map f xs `using` parList strat

For bigger programs the programmer only has to define strategies on the
most important data structures in order to introduce parallelism in to
his program.  With this approach the parallel program will not be
cluttered with par annotations and it's easier to determine semantics
and dynamic behaviour of the individual functions.



File: user.info, Node: Experiences with Strategies, Prev: A Class of Strategies, Up: Strategies

Experiences with Strategies
===========================

[STRATEGIES ARE COOL]



File: user.info, Node: Visualisation Tools, Next: GranSim Profiles, Prev: Strategies, Up: Top

Visualisation Tools
*******************

The visualisation tools that come together with GranSim take a GranSim
profile as input and create level 2 PostScript files showing either
activity or the granularity of the execution. A collection of scripts in
the GranSim Toolbox allows to focus on specific aspects of the execution
like the `node flow' i.e. the movement of nodes (closures) between the
PEs.

Most of these tools are implemented as Perl scripts. This means that
they are very versatile and it should be easy to modify them. However,
processing a large GranSim profile can involve a quite big amount of
computation (and of memory). Speeding up crucial parts of the scripts is
on my ToDo-list but better don't hold your breath.

* Menu:

* Activity Profiles::           
* Granularity Profiles::        
* Scripts::                     



File: user.info, Node: Activity Profiles, Next: Granularity Profiles, Prev: Visualisation Tools, Up: Visualisation Tools

Activity Profiles
=================


Three tools allow to show the activity during the execution in three
levels of detail:

   * The *overall activity profile* (created with `gr2ps') shows the
     activity of the whole machine (*Note Overall Activity Profile::).
   * The *per-processor activity profile* (created with `gr2pe') shows
     the activity of all simulated processors (*Note Per-Processor
     Activity Profile::).
   * The *per-thread activity profile* (created with `gr2ap') shows the
     activity of all generated threads (*Note Per-Thread Activity
     Profile::).

All tools discussed in this section print a help message when called
with the option -h.  This message shows the available options.  In
general, all tools understand the option -o <file> for specifying the
output file and -m for generating a monochrome profile (by default the
tools generate colour PostScript).

The `gr2ps' and `gr2ap' scripts work in two stages:

  1. First the `.gr' file is translated into a `.qp' file, which is
     basically a stripped down version of a GranSim profile.

  2. Then the `.qp' file is translated into a `.ps' file.

The `gr2pe' script works directly on the `.gr' file.

* Menu:

* Overall Activity Profile::    
* Per-Processor Activity Profile::  
* Per-Thread Activity Profile::  



File: user.info, Node: Overall Activity Profile, Next: Per-Processor Activity Profile, Prev: Activity Profiles, Up: Activity Profiles

Overall Activity Profile
------------------------

The *overall activity profile* (created via gr2ps) shows the activity of
the whole machine by separating the threads into up to five different
groups. These groups describe the number of

   * running threads (i.e. threads that are currently performing a
     reduction),
   * runnable threads (i.e. threads that could be executed but that have
     not found an idle PE),
   * blocked threads (i.e. threads that wait for a result that is being
     computed by another thread),
   * fetching threads (i.e. threads that are currently fetching data
     from a remote PE),
   * migrating threads (i.e. threads that are currently being
     transferred from a busy PE to an idle PE).

If the GranSim profile includes information about sparks (-bs option) it
is also possible to show the number of sparks. However, this number is
usually much bigger than the number of all threads. Therefore, it
doesn't make much sense mixing the groups for sparks and threads in one
profile.

The option -O reduces the size of the generated PostScript file.  The
option -I <string> is used to specify which groups of threads should be
shown in which order. For example

     gr2ps -O -I "arb" parfib.gr

generates a profile `parfib.ps' showing only active (`a'), runnable
(`r') and blocked (`b') threads. The letter code for the other groups
are `f' for fetching, `m' for migrating and `s' for sparks.

In the current version the marks on the y-axis of the generated profile
may be stretched or compressed.  This might happen if many events occur
at exactly the same time. If this is the case, the initial count of the
maximal number of y-values may be wrong causing a rescaling at the very
end. In practice that happens rarely (more often for GranSim-Light
profiles, though).



File: user.info, Node: Per-Processor Activity Profile, Next: Per-Thread Activity Profile, Prev: Overall Activity Profile, Up: Activity Profiles

Per-processor Activity Profile
------------------------------

The idea of the *per-processor activity profile* is to show the most
important pieces of information about each processor in one
graph. Therefore, it is easy to compare the behaviour of the different
processors and to spot imbalances in the computation.

This profile shows one strip for each of the simulated processors. Each
of these strips encodes three kinds of information:

   * Is the processor *active* at a certain point? If it is active the
     strip appears in some shade of green (gray in the monochrome
     version). If it is idle it appears in red (white in the monochrome
     version). The area before starting the first thread and after
     terminating the last thread is left blank in both versions.

   * How high is the *load* of the processor?  The load is measured by
     the number of runnable threads on this processor. A high load is
     shown by a dark shading of green (or grey). A palette at the top of
     the graph shows the available shadings (two ticks indicate the
     range that is used in the graph).  It is possible to distinguish
     between 20 different values. Therefore, all processors with more
     than 20 runnable threads are shown in the same (dark) colour.

   * How many *blocked threads* are on the processor? This information
     is shown by the thickness of blue (black) bar at to bottom of each
     strip. Without any blocked threads no bar is shown. If 20 or more
     threads are blocked the bar covers 80% of strip. Thus, the load
     information is always visible `in the background'.

This script also allows to produce variants of the same kind of graph
that focus on different features of GranSim:

   * *Migration:* With the option -M this script produces a graph
     that draws arrows between processors indicating the migration of a
     thread from one processor to another. No load or blocking
     information is shown in this graph.
   * *Sparking:* With the option -S a spark graph is generated. It
     shows information about the number of sparks on a processor in the
     same way as the the number of runnable threads (i.e. by
     shading). This graph is useful to highlight processors that create
     a lot of work.

No more than about 32 processors should be shown in one graph otherwise
the strips are getting too small. This profile can not be generated for
a GranSim-Light profile.



File: user.info, Node: Per-Thread Activity Profile, Prev: Per-Processor Activity Profile, Up: Activity Profiles

Per-Thread Activity Profile
---------------------------

The *per-thread activity profile* shows the activity of all generated
threads. For each thread a horizontal line is shown. The line starts
when the thread is created and ends when it is terminated. The thickness
of the line indicates the state of the thread. The possible states
correspond to the groups shown in the overall activity profile (*Note
Overall Activity Profile::).

The states are encoded in the following way:
   * A *running* thread is shown as a thick green (gray) line.
   * A *runnable* thread is shown as a medium red (black) line.
   * A *fetching* (or *migrating*) thread is shown as a thin blue
     (black) line.
   * A *blocked* thread is shown as a gap in the line.

This profile gives the most accurate kind of information and it often
allows to `step through' the computation by relating events on different
processors with each other. For example the typical pattern at the
beginning of the computation is some short computation for starting the
thread followed by fetching remote data. After that the thread may
become runnable if another thread has been started in the meantime.


However, such a detailed analysis is only possible for programs with a
rather small number of threads.  Usually, GranSim profiles of bigger
executions have to be pre-processed to reduce the number of threads that
are shown on one graph (*Note Scripts::).  As the level of detail
provided by this graph is rarely needed for bigger executions no
automatic splitting of a profile into several graphs has been
implemented.



File: user.info, Node: Granularity Profiles, Next: Scripts, Prev: Activity Profiles, Up: Visualisation Tools

Granularity Profiles
====================


The tools for generating *granularity profiles* aim at showing the
relative sizes of the generated threads. Especially the number of tiny
threads, for which the overhead of thread creation is relatively high is
of interest.

All tools discussed in this section require Gnuplot to generate the
granularity profiles. I am using version 3.5 but it should work with
older versions, too.

For showing granularity basically two kinds of graphs can be generated:
   * A *bucket graph* (*Note Bucket Graphs::), which collects threads
     with similar runtime in the same bucket and shows the number of
     threads in each bucket.
   * A *cumulative graph* (*Note Cumulative Graphs::), which shows how
     many threads have a runtime less than or equal a given number..

The main tools for generating such graphs are:
   * gr2gran creates one bucket graph and one cumulative graph from a
     given GranSim profile. The information about the partitioning for the
     bucket statistics and other set-up information is usually provided in a
     *template file* (*Note Template Files::), which is specified via
     the -t option (-t , uses the global template file in $GRANDIR/bin).

     This script works in three stages:
       1. First a `RTS' file is generated, which is only a sorted list
          of runtimes extracted out of the END events of a GranSim
          profile (*Note GranSim Profiles::).
       2. The main stage generates a `gnuplot' file by grouping the
          threads into buckets and computing cumulative values.
       3. Finally, Gnuplot is used to generate `PostScript' files
          showing the graphs.

   * gran-extr is based on the same idea as gr2gran, but it produces
     even more graphs, showing the communication percentage, determining
     a correlations coefficient between heap allocations and runtime
     etc.


* Menu:

* Bucket Graphs::               
* Cumulative Graphs::           
* Template Files::              
* Statistics Package::          



File: user.info, Node: Bucket Graphs, Next: Cumulative Graphs, Prev: Granularity Profiles, Up: Granularity Profiles

Bucket Graphs
-------------

In a bucket graph the x-axis indicating execution times of the threads
is partitioned into intervals (dfn{buckets}). The graph shows a
histogram of the number of threads in each bucket (i.e. whose execution
time falls into this interval).  For generating this kind of graph only
a restricted GranSim profile (containing only END events) is required.

For example one of the files generated by running

     gr2gran -t , pf.gr

is g.ps, which contains such a bucket statistics. The -t option of this
tool selects the right template file (, is a shorthand for the global
template in $GRANDIR/bin).


The necessary change in the template file for this bucket statistics is

     -- Intervals for pure exec. times
     G: (100, 200, 500, 1000, 2000, 5000, 10000)



File: user.info, Node: Cumulative Graphs, Next: Template Files, Prev: Bucket Graphs, Up: Granularity Profiles

Cumulative Graphs
-----------------

In a cumulative graph the x-axis again represents execution times of the
individual threads. The value in the graph at the time t represents the
number of threads whose execution time is smaller than t. Therefore, the
values in the graph are monotonically increasing until the right end
shows the total number of threads in the execution.

Again, running

     gr2gran -t , pf.gr

generates cumulative graphs for the runtime in the files cumu-rts.ps and
cumu-rts0.ps (one file shows absolute numbers of threads, the other the
percentage of the threads on the y-axis).



File: user.info, Node: Template Files, Next: Statistics Package, Prev: Cumulative Graphs, Up: Granularity Profiles

Template Files
--------------


The functions for reading template files can be found in
`template.pl'. This file also contains documentation about the available
fields.



File: user.info, Node: Statistics Package, Prev: Template Files, Up: Granularity Profiles

Statistics Packages
-------------------

A set of statistics functions for computing mean value, standard
deviation, correlation coefficient etc can be found in `stats.pl'.



File: user.info, Node: Scripts, Prev: Granularity Profiles, Up: Visualisation Tools

Scripts
=======


The GranSim Toolbox contains not only visualisation tools but also a set
of scripts that work on GranSim profiles and provide specific
information.

   * The tf script aims at showing the task flow (as well as node flow)
     in the execution of a program.  It is used in the GranSim Emacs
     mode to narrow a GranSim profile (*Note The Emacs GranSim Profile
     Mode::).
   * The SN script creates a summary of spark names that occur in a
     GranSim profile. This summary is shown as a impulses graph via
     Gnuplot.  It allows to compare the relative number of threads
     generated by each static spark site.
   * The AVG and avg-RTS scripts compute the average runtime from an RTS
     file, which is generated by `gr2RTS'.



File: user.info, Node: GranSim Profiles, Next: Parallel NoFib Suite, Prev: Visualisation Tools, Up: Top

GranSim Profiles
****************

This chapter describes the contents of a "GranSim profile" (a `.gr'
file). In most cases the profiles generated by the visualisation tools
should provide sufficient information for tuning the performance of the
program. However, it is possible to extract more information out of the
generated GranSim profile. This chapter provides information how to do
that.

* Menu:

* Types of GranSim Profiles::   
* Contents of a GranSim Profile::  
* The Emacs GranSim Profile Mode::  



File: user.info, Node: Types of GranSim Profiles, Next: Contents of a GranSim Profile, Prev: GranSim Profiles, Up: GranSim Profiles

Types of GranSim Profiles
=========================

Depending on some runtime-system options different kind of profiles are
generated:

   * A *reduced* GranSim profile contains in the body component only END
     events. This is sufficient to extract granularity profiles, but it
     is not sufficient to generate activity profiles. This is the
     default setting for GranSim profiles.

   * A *full* GranSim profile contains one line for every major event in
     the system (*Note Contents of a Granularity Profile::). A
     generation of such a profile is enabled by the RTS option -bP.

   * A *spark* profile additionally contains events related to sparks
     (for creating, using, pruning, exporting, acquiring sparks). Such a
     profile is generated when using the RTS option -bs.

   * A *heap* profile additionally contains events for allocating heap.
     Such a profile is generated when using the RTS option -bh.



File: user.info, Node: Contents of a GranSim Profile, Next: The Emacs GranSim Profile Mode, Prev: Types of GranSim Profiles, Up: GranSim Profiles

Contents of a GranSim Profile
=============================

This section describes the syntactic structure of a GranSim profile.

* Menu:

* Header::                      
* Body::                        



File: user.info, Node: Header, Next: Body, Prev: Contents of a GranSim Profile, Up: Contents of a GranSim Profile

Header
------

The header contains general information about the execution. It is split
into several sections separated by lines only consisting of -
symbols. The end of the header is indicated by a line only consisting of
+ symbols.

The sections of the header are:

   *  Name of the program, arguments and start time.
   *  General parameters describing the parallel architecture. This
      covers the number of processors, flags for thread migration,
      asynchronous communication etc. Finally, this section describes
     basic
      costs of the parallel machine like thread creation time, context
     switch
      time etc.
   *  Communication parameters describing basic costs for sending
      messages like latency, message creation costs etc.
   *  Instruction costs describing the costs of different classes of
      machine operations.



File: user.info, Node: Body, Prev: Header, Up: Contents of a GranSim Profile

Body
----

The body of the GranSim profile contains events that are generated
during the execution of the program. The following subsections first
describe the general structure of the events and then go into details of
several classes of events.

* Menu:

* General Structure::           
* END Events::                  
* Basic Thread Events::         
* Communication Events::        
* Thread Migration Events::     
* Spark Events::                
* Debugging Events::            



File: user.info, Node: General Structure, Next: END Events, Prev: Body, Up: Body

General Structure of an Event
.............................

Each line in the body of a GranSim profile represents one event during
the execution of the program. The general structure of one such line is:

   * The keyword PE.
   * The *processor number* where the event happened.
   * The *time stamp* of the event (in square brackets).
   * The *name of the event*.
   * The *thread id* of the affected thread (a hex number).
   * Optionally a *node* as an additional argument to the event
     (e.g. the node to be reduced in case of a START event).  This is
     either a hex number or the special string ______ indicating a
     Nil-closure.
   * Additional information depending on the event. This can be the
     processor from which data is fetched or the length of the spark
     queue after starting a new thread.

The fields are separated by whitespace. A : symbol must follow the time
stamp (which must be in sqare brackets).



File: user.info, Node: END Events, Next: Basic Thread Events, Prev: General Structure, Up: Body

END Events
..........

END events are an exception to this general structure. The reason for
their special structure is that they summarise the most important
information about the thread. Therefore, information about e.g. the
granularity of the threads can be extracted out of END events alone
without having to generate a full GranSim profile.

The structure of an END event is:

   * The keyword PE.
   * The *processor number* where the event happened.
   * The *time stamp* of the event (in square brackets).
   * The *name of the event* (END in this case).
   * The keyword SN followed by the *spark name* of the thread. This
     information allows to associate a thread with its static spark site
     in the program (*Note Parallelism Annotations::, on how to give
     names to spark sites.)
   * The keyword ST followed by the *start time* of the thread.
   * The keyword EXP followed by a flag indicating whether this thread
     has been *exported* to another processor or has been evaluated
     locally (possible values are T and F).
   * The keyword BB followed by the number of *basic blocks* that have
     been executed by this thread.
   * The keyword HA followed by the number of *heap allocations*.
   * The keyword RT followed by the total *runtime*. This is the most
     important information in an END event. It is used by the
     visualisation tools for generating granularity profiles.
   * The keyword BT followed by the total *block time* and a block count
     (i.e. how often the thread has been blocked).
   * The keyword FT followed by the total *fetch time* and a fetch count
     (i.e. how often the thread fetched remote data).
   * The keyword LS followed by the *number of local sparks* (sparks
     that have to be executed on the local processor) generated by the
     thread.
   * The keyword GS followed by the *number of global sparks* generated
     by the thread.
   * The keyword EXP followed by a flag indicating whether this thread was
     *mandatory* or only advisable (in the current version this flag is
     not used; it would be important in a combination of GranSim with a
     concurrent set-up).



File: user.info, Node: Basic Thread Events, Next: Communication Events, Prev: END Events, Up: Body

Basic Thread Events
...................

The main events directly related to threads are:

   * START: Generated when starting a thread (after adding overhead for
     thread creation to the clock of the current processor). After the
     thread id it has two additional fields: one specifying the node to
     be evaluated (as a hex number) and the spark site that generated
     this thread (format: [SN N] where N is a dec number).

   * START(Q): Same as START but the new thread is put into the runnable
     queue rather than being executed (only if the current processor is
     busy at that point).

   * BLOCK: A thread is blocked on data that is under evaluation by
     another thread. It is descheduled and put into the blocking queue
     of that node.  Two additional fields contain the node on which the
     thread is blocked and the processor on which this node is lying
     (format: (from N) where N is a processor (dec) number).

   * RESUME: Continue to execute the thread after it has been blocked or
     has been waiting for remote data. This event does not contain
     additional fields.

   * RESUME(Q): Same as RESUME but the new thread is put into the
     runnable queue rather than being executed (only if the current
     processor is busy at that point).

   * SCHEDULE: The thread is scheduled on the given processor (no
     additional fields). This event is usually emitted after terminating
     a thread on the processor. It may also occur after a FETCH (if
     asynchronous communication is turned on) or after a BLOCK event.

   * DESCHEDULE: The thread is descheduled on the given processor (no
     additional fields). After this event the thread is in the runnable
     queue. This event is not used for implicit descheduling that is
     performed after events like BLOCK or FETCH. DESCHEDULE events
     should only occur if fair scheduling is turned on.



File: user.info, Node: Communication Events, Next: Thread Migration Events, Prev: Basic Thread Events, Up: Body

Communication Events
....................

Events that are issued when sending data between processors are:


   * FETCH: Send a fetch request from the given thread (on the given
     processor) to another processor. This event has two additional
     fields: The first field is the node (hex number) that should be
     fetched. The last field is the processor where this node is lying
     and from which the data has to be fetched (format: (from N) where N
     is a processor (dec) number).

   * REPLY: A reply for a fetch request of the given thread arrived at
     the given processor. The first additional field contains the node
     and the last field contains the processor from which it arrived
     (format: (from N) where N is a processor (dec) number). Note: This
     event only marks the arrival of the data. It is usually followed by
     a RESUME or RESUME(Q) event for the thread that asked for the data.



File: user.info, Node: Thread Migration Events, Next: Spark Events, Prev: Communication Events, Up: Body

Thread Migration Events
.......................

These events are only produced when thread migration is enabled (-bM):


   * STEALING: Indicates the stealing of a thread on the given
     processor. The thread which is being stolen appears in the thread
     field. One additional field (the last field) indicates which
     processor is stealing that thread (format: (by N) where N is a
     processor (dec) number).

   * STOLEN: Indicates the arrival of a stolen thread on the given
     processor. Two additional fields show the node which will be
     evaluated by this thread next. The last field shows from which
     processor the thread has been stolen (format: (from N) where N is a
     processor (dec) number). Note: This thread is immediately being
     executed by the given processor (no RESUME event follows).

   * STOLEN(Q): Same as STOLEN but the new thread is put into the
     runnable queue rather than being executed (only if the current
     processor is busy at that point).



File: user.info, Node: Spark Events, Next: Debugging Events, Prev: Thread Migration Events, Up: Body

Spark Events
............

When enabling spark profiling, events related to sparks will appear in
the profile:


   * SPARK: Indicates the generation of a spark on the given processor
     for the given node. At that point it is added to this processor's
     spark pool.  Two additional fields show the node to which this
     spark is pointing and the current size of the spark pool (format:
     [sparks N] where N is a dec number).

   * SPARKAT: Same as SPARK but with explicit placement of the spark on
     this processor. This is usually achieved in the program by using a
     parLocal or parAt rather than a parGlobal annotation (*Note
     Parallelism Annotations::).

   * USED: Indicates that this spark is turned into a thread on the
     given processor. A START or START(Q) event will follow soon
     afterwards.

   * PRUNED: A spark is removed from the spark pool of the given
     processor. This might occur when the spark points to a normal form
     (there is no work to do for that spark). This is checked when
     creating a spark and when searching the spark pool for new work.
 
   * EXPORTED: A spark is exported from a given processor. Two
     additional fields show the node to which this spark is pointing and
     the current size of the spark pool (format: [sparks N] where N is a
     dec number).

   * ACQUIRED: A spark that has been exported by another proceessor is
     acquired by the given processor. Two additional fields have the
     same meaning as for EXPORTED.



File: user.info, Node: Debugging Events, Prev: Spark Events, Up: Body

Debugging Events
................

Certain debug options generate additional events that allow to monitor
the internal behaviour of the simulator. This information shouldn't be
of interest for the friendly user but might come in handy for those who
dare hacking at the runtime-system:


   * SYSTEM_START: Indicates that the simulator is executing a "system"
     routine (a routine in the runtime-system that is not directly
     related to graph reduction). This allows to show when exactly
     rescheduling is done in the simulator. It may be useful in
     GranSim-Light to check that the costs during system operations are
     attached to the right thread.

   * SYSTEM_END: See previous event. From this point on normal graph
     reduction is performed.



File: user.info, Node: The Emacs GranSim Profile Mode, Prev: Contents of a GranSim Profile, Up: GranSim Profiles

The Emacs GranSim Profile Mode
==============================

Looking up information directly in a GranSim profile is very tedious
(believe me, I have done it quite often). To make this task easier the
GranSim Toolbox contains a GNU Emacs mode for GranSim profiles: the
"GranSim Profile Mode".

The most useful features (IMNSHO) are highlighting of parts of a GranSim
profile and narrowing of the profile to specific PEs, threads, events
etc.

* Menu:

* Installation::                
* Customisation::               
* Features::                    



File: user.info, Node: Installation, Next: Customisation, Prev: The Emacs GranSim Profile Mode, Up: The Emacs GranSim Profile Mode

Installation
------------

To use this mode just put the file `GrAnSim.el' somewhere on your Emacs
LOAD-PATH and load the file.  I don't have autoload support at the
moment, but the file is very short anyway, so directly loading it is
quite fast.  Currently, the mode requires the HILIT19 package for
highlighting parts of the profile.  It also requires the `tf' script in
the bin dir of your GranSim installation.

I use Emacs 19.31 with the default `hilit19.el' package, but the GranSim
profile mode has been successfully tested with Emacs 19.27. However, if
you have problems with the mode please report it to the address shown at
the end of this document (*Note Bug Reports::).



File: user.info, Node: Customisation, Next: Features, Prev: Installation, Up: The Emacs GranSim Profile Mode

Customisation
-------------

A few Emacs variables control the behaviour of the GranSim Profile mode:

 -- Variable: gransim-auto-hilit
     This variable indicates whether highlighting is turned on by
     default.  Note that you can customise `hilit19' such that it does
     not automatically highlight buffers that are bigger than a given
     size.  Since GranSim profiles tend to be extremely large you might
     want to reduce the default value.

 -- Variable: grandir
     The root of the GranSim installation. The mode searches for scripts
     of the GranSim Toolbox in the directory grandir/bin.  By default
     this variable is set to the contents of the environment variable
     `GRANDIR'.

 -- Variable: hwl-hi-node-face
     Face to be used for specific highlighting of a node.

 -- Variable: hwl-hi-thread-face
     Face to be used for specific highlighting of a thread.

Here are the hilit19 variables that are of some interest for the GranSim
Profile Mode:

 -- Variable: hilit-auto-highlight
     T if we should highlight all buffers as we find 'em, nil to disable
     automatic highlighting by the find-file hook.  Default value: t.

 -- Variable: hilit-auto-highlight-maxout
     Auto-highlight is disabled in buffers larger than this.  Default
     value: 60000.



File: user.info, Node: Features, Prev: Customisation, Up: The Emacs GranSim Profile Mode

Features
--------

The main features of the GranSim profile mode are:


   * *Highlighting* of parts of the profile. Colour coding is used to
     distinguish between events that start a reduction, finish a
     reduction and block a reduction. Within END events the total
     runtime is specially highlighted.

   * *Narrowing* of the profile. This should not be confused with the
     narrowing mode in Emacs. The narrowing in GranSim profile mode is
     done by running a script (`tf') over the buffer and displaying the
     output in another buffer. Hence, narrowing can be further refined
     be improving the `tf' script, which is written in Perl.

     It is possible to narrow a GranSim profile to a specific


        * *processor* (PE),

        * *event*,

        * *thread*,

        * *node*,

        * *spark* (only possible for spark profiles)


     This feature is particularly useful to e.g. follow a node, which
     has been moved between processors or to concentrate on the
     reductions on one specific processor.

     Of course, for those pagans, who don't believe in Emacs it is also
     possible to run the `tf' script directly on a `.gr' file.

   * A second form of *highlighting*, specialised for nodes and threads
     is available, too. With the commands `hwl-hi-thread' and
     `hwl-hi-node' every occurrence of the thread or node in the profile
     after the current point is highlighted. The function `hwl-hi-clear'
     undoes all such highlighting.

   * There is a menu item for calling most of the functions described
     here. It automatically appears in any GranSim profile (i.e. any
     file that has a `.gr' extension).


Default key bindings in GranSim profile mode:

 -- : `C-c T', `M-X HWL-TRUNCATE'
     Truncate event lines such that exactly one line is shown for one
     event in the body of a profile.

 -- : `C-c W', `M-X HWL-WRAP'
     Wrap lines to show them in full length.

 -- : `C-c ', `M-X HWL-TOGGLE-TRUNCATE-WRAP'
     Toggle between the above two modes.

 -- : `C-c H', `M-X HILIT-REHIGHLIGHT-BUFFER'
     Rehighlight the whole buffer.

 -- : `C-c P', `M-X HWL-NARROW-TO-PE'
     Narrow the profile to a PE.

 -- : `C-c T', `M-X HWL-NARROW-TO-THREAD'
     Narrow the profile to a thread.

 -- : `C-c E', `M-X HWL-NARROW-TO-EVENT'
     Narrow the profile to an event.

 -- : `C-c C-E', `(LAMBDA () (HWL-NARROW-TO-EVENT "END"))'
     Narrow the profile to an END event.

 -- : `C-c ', `M-X HWL-TOGGLE-TRUNCATE-WRAP'
     Toggle between the above two modes.

 -- : `C-c N', `M-X HWL-HI-NODE'
     Highlight a node in the profile.

 -- : `C-c T', `M-X HWL-HI-THREAD'
     Highlight a thread in the profile.

 -- : `C-c C-C', `M-X HWL-HI-CLEAR'
     Remove highlightings of nodes and threads.



File: user.info, Node: Parallel NoFib Suite, Next: Internals of GranSim, Prev: GranSim Profiles, Up: Top

The Parallel NoFib Suite
************************

While developing GranSim (and GUM) we have started to collect a set of
interesting and non-trivial parallel programs written in parallel
Haskell. These programs are used as a test suite for GranSim and part of
the NoFib Suite of GHC.


Currently this suite contains the following programs:


  1. Sieve of Erathostenes (`sieve'), which creates a bidirectional
     pipeline, where each processor filters the multiples of one prime
     number out of an input list.

  2. Warshall's shortest path algorithm in a graph (`warshall'), which
     creates a cyclic pipeline of processors.

  3. A function `matmult', which performs a parallel matrix
     multiplication.

  4. A function `determinant', which computes for a given square matrix
     its determinant.

  5. A function `LinSolv', which solves a given set of linear equations
     over integers by using multiple homomorphic images, computing the
     result in each image via Cramer's Rule. The main part of this
     algorithm is the parallel determinant computation.

  6. A function `coins', which computes the number of possibilities how
     to pay a given value with a given set of coins.

  7. An award assignment program, that basically has to compute
     permutations of a given list (`pperms').

  8. A parallel verion of Quicksort (`sort').

  9. A function word search program `soda7', which searches a given grid
     of letters for a given set of words.

 10. An RSA encryption program `rsa' (taken out of the sequential nofib
     suite).

 11. An n-queens program `queens'.

 12. The parallel database manager for GUM `dcbm'. It performs queries
     to a database in parallel.

 13. The bill of materials program `bom' developed (under GranSim) by
     Phil Trinder as part of the Parallel Databases Project.

 14. A ray-tracer `nray' based on a version taken from Kelly's thesis
     and ported to GRIP by Hammond.

 15. A univariant resultant computation, using the SACLIB library for
     computer algebra. Mainly the basic polynomial arithmetic has been
     taken form that library.

 16. One of the FLARE programs, solving a problem in the area of
     computational fluid dynamics `cfd'.

 17. A function `minimax', which computes for a given position in the
     game of tic-tac-toe the best next move using an Alpha-Beta pruning
     algorithm.


 18. Several NESL programs, including a numerical integration algorithm
     `integrate', a quick hull algorithm `qhull' for computing the
     convex hull of a set of points in a plane, a matrix inversion based
     on gauss-jordan elimination `mi' and a fast fourier transformation
     `fft'.  All algorithms are based on the NESL versions published by
     Guy Blelloch in CACM 39(3) and translated to Haskell.

 19. A Newton-Raphson iteration for finding a root of a polynomial. This
     program has been taken from the Impala suite of implicitly parallel
     benchmark programs (written in Id).


Despite its name the parallel nofib suite also contains some fib-ish
programs. These programs should be of interest for getting a start with
parallel functional programming using GranSim:


  1. A parallel factorial function (`parfact'), which computes the sum
     of all numbers from 1 up to a given value n by bisecting the
     interval and computing results of the intervals in parallel.

  2. A parallel fib-like function (`parfib'), which performs an
     additional gcd computation and 2 additional multiplications in each
     recursive call.