Info file: user.info, -*-Text-*- produced by `texinfo-format-buffer' from file `user.texi' using `texinfmt.el' version 2.32 of 19 November 1993. The GranSim User's Guide discusses how to use the GranSim simulator for the parallel execution of Haskell programs. Copyright (C) 1994 -- 1996, Hans-Wolfgang Loidl for the GRASP/AQUA Project, Glasgow University  File: user.info, Node: Top, Next: Quick Intro, Prev: (dir), Up: (dir) GranSim User's Guide ******************** This user's guide describes how to use GranSim for simulating the parallel execution of (annotated) Haskell programs. In passing we will discuss how to write parallel, lazy functional programs and how to tune their performance. To this end, some visualisation tools for generating activity and granularity profiles of the execution will be discussed. A set of example programs demonstrates the use of GranSim. GranSim is part of the Glasgow Haskell Compiler (GHC), in fact it is a special setup of GHC, which uses a slightly modified compiler (for instrumenting the code) and an extended runtime-system. For users who are already familiar with the GHC and parallel functional programming in general there is a quick introduction to GranSim available (*Note Quick Intro::). * Menu: * Quick Intro:: * Overview:: * Setting-up GranSim:: * Parallelism Annotations:: * Runtime-System Options:: * Strategies:: * Visualisation Tools:: * GranSim Profiles:: * Parallel NoFib Suite:: * Internals of GranSim:: * GranSim vs GUM:: * Future Extensions:: * Bug Reports:: * Determinant (Example) :: * Options Index:: * Concept Index::  File: user.info, Node: Quick Intro, Next: Overview, Prev: Top, Up: Top A Quick Introduction to GranSim ******************************* If you already know how to compile a Haskell program with GHC and if you have an installed version of GranSim available there are only a few changes necessary to simulate parallel execution. Basically, a compile time flag has to be used to generate instrumented code. Runtime-system flags then control the behaviour of the simulation. 1. Compile all modules with the additional options -gransim and -fvia-C. Use -gransim also when linking object files. For example ghc -gransim -fvia-C -o foo foo.hs creates a GranSim executable file foo. 2. When running the program use the runtime-system option -bP to generate a full GranSim profile (*Note Types of GranSim Profiles::). *Note Runtime-System Options:: for a description of all options that allow you to control the behaviour of the simulated parallel architecture. For example ./foo +RTS -bP -bp16 -bl400 starts a simulation for a machine with 16 processors and a latency of 400 machine cycles. It generates a GranSim profile `foo.gr' . 3. Use one of the visualisation tools (*Note Visualisation Tools::) to examine the behaviour of the program. The first bet is to use `gr2ps', which generates a graph showing the overall activity of the machine in a global picture. For example gr2ps -O foo.gr generates an activity profile as a colour PostScript file `foo.ps'. It shows how many threads in total have been running, runnable (but not running), blocked (on data under evaluation), fetching (remote data) and migrating (to another processor) at each point during the execution. Other tools you might want to try are `gr2pe' (giving a per-PE activity profile) and `gr2ap' (giving a per-thread activity profile). Additionally, another set of visualisation tools allows to focus on the granularity of the generated threads. The most important one is `gr2gran', which generates bucket statistics showing the runtime of the individual threads (*Note Granularity Profiles::).  File: user.info, Node: Overview, Next: Setting-up GranSim, Prev: Quick Intro, Up: Top Overview ******** GranSim is mainly intended to be a research tool for studying the dynamic behaviour of parallel, lazy functional programs. By simulating the parallel execution it is possible to concentrate on the amount and structure of the parallelism exposed by the program without being distracted by `external' details like load of the machine produced by other users or the basic network traffic. However, for obtaining results that are of relevance for a particular machine it is possible to set parameters in GranSim that very accurately model the behaviour of this machine. * Menu: * Components:: * Semi-explicit Parallelism:: * GranSim Modes:: * Example:: * Running it::  File: user.info, Node: Components, Next: Semi-explicit Parallelism, Prev: Overview, Up: Overview Components of the GranSim System ================================ The overall GranSim System consists of the following components: * The *GranSim Simulator* proper. * Some *Visualisation Tools*. * The *GranSim Toolbox*. * The *GranSim T-shirt* (not available, yet, sorry). The GranSim simulator is built on top of the "runtime-system" (RTS) of the Glasgow Haskell Compiler ("GHC"). Thus, the major part of it is a (rather big) *runtime system for GHC*. This part is responsible for implementing the event driven simulation of parallelism. For example communication is modelled by adding new features to the appropriate routines in the sequential runtime-system. For simulating parallelism the generated code has to be instrumented, which is achieved by a slightly modified code generator in GHC. To produce instrumented code the compile time option -gransim of the Haskell compiler has to be used. This is what we mean by saying `compiling under GranSim'. Note that it is not possible to use object file compiled for another setup of GHC to generate a GranSim executable file. Our approach is similar to that of "GUM" (a portable parallel implementation of Haskell), which has also been developed in this department. Both systems are basically extensions of the sequential runtime-system. GranSim is also similar to GUM in the way it implements several features like the communication mechansim. To a large extend both systems actually use the same code. *Note GranSim vs GUM:: for a more detailled comparison. In order to analyse the result of a simulation it is very important to visualise the abundance of data. To this end the GranSim system contains a set of *visualisation tools* that generate PostScript graphs, focusing on specific aspects of the execution. The most important usage of these tools is the generation of an overall activity profile of the execution. The system contains a set of small tools, usually scripts processing the output of a simulation. These tools are collected in the so-called *GranSim Toolbox*. The final component, the *GranSim T-shirt* has not been implemented, yet. If you have suggestions, feel free to email me.  File: user.info, Node: Semi-explicit Parallelism, Next: GranSim Modes, Prev: Components, Up: Overview Semi-explicit Parallelism ========================= The underlying execution model of GranSim is one of semi-explicit parallelism where expressions that should be evaluated in parallel have to be annotated by using the par annotation. However, it is up to the runtime-system to decide when and where to create a parallel thread. It may even discard potential parallelism altogether. The communication between the threads is performed via accessing shared data structures. If a thread tries to access a value that is under evaluation, it is put into a blocking queue attached to that closure. When the closure is updated by the result this blocking queue is released. So, the only extension of the sequential evaluation mechanism is the addition of blocking queues. Because of these characteristics of hiding as many low-level details as possible in the runtime-system we call this model semi-explicit. It is only necessary to mark potential parallelism in the program. All issues related to communication, load balancing etc are handled by the runtime system and can be ignored by the programmer. For example in the expression let squares = [ i^2 | i <- [1..n] ] in squares `par` sum squares the list of squares will be produced by one thread (the "producer") and the result sum will be computed by another thread (the "consumer"). Note that this producer-consumer structure of the algorithm is solely determined by the data-dependencies in the program. All the programmer has to do is to annotate a named subexpression (in this example squares), which should be computed in parallel. Thread placement, communication and whether the thread should be created at all are up to the runtime-system. In this introduction we don't go into the operational details of this example. *Note Example: Sum of Squares:: for a discussion how to improve the parallelism of the algorithm. This model of using par and seq annotations is the same as it is used for the GRIP machine and for the GUM system. In fact, there is a strong correspondence between the GranSim and GUM. This allows to carry results of a GranSim simulation over to a real parallel machine operating under GUM (this issue will be addressed in more detail in *Note GranSim vs GUM::.).  File: user.info, Node: GranSim Modes, Next: Example, Prev: Semi-explicit Parallelism, Up: Overview GranSim Modes ============= This section outlines a methodology for parallelising lazy functional programs. The development and performance tuning of a parallel algorithm typically proceeds in several stages: 1. *Profiling* of the sequential algorithm if the goal is the parallelisation of an already existing algorithm. This stage should give an idea about the `big computations' in the program. The programmer can then concentrate on parallelising this part of the algorithm. 2. *Extracting* parallelism out of the algorithm. In our model of computation this means adding seq and par annotations to the program (*Note Parallelism Annotations::, for a discussion of all available annotations). 3. *Restructuring* parts of the algorithm to increase the amount of parallelism. This can be called performance tuning on an abstract level where the performance of the algorithm is solely modelled by the amount of parallelism in the program. 4. *Optimising* the parallel algorithm for execution on a specific machine. In this stage low-level details have to be addressed. To facilitate such a process of parallelisation the GHC system provides several tools. In particular GHC and GranSim support different modes reflecting different stages: 1. The sequential *profiling* setup of GHC allows to get accurate information about computation time and heap usage of program expressions. Especially for parallelising big sequential programs the profiler gives invaluable information for the proper parallelisation. A more detailed description of the profiler is part of the overall GHC documentation. 2. The *GranSim-Light* mode simulates the parallel execution of a program with an unbounded number of processors. In this mode no communication is modelled (i.e. access of remote data is as expensive as access of local data). Thus, this mode mainly shows the amount of parallelism in the program and indicates an upper bound of the parallelism when running the program on a real machine. 3. In the usual mode *GranSim* can simulate the execution of up to 32 processors (1) (*Note GranSim Modes-Footnotes::) with an exact modelling of communication. Thus, the execution on a particular machine with the chosen characteristics is simulated. 4. A set of *visualisation tools* allows to generate activity and granularity profiles of the execution. These tools allow to generate overall profiles as well as per-PE and per-thread profiles (*Note Visualisation::). This especially facilitates the performance tuning of the program.  File: user.info Node: GranSim Modes-Footnotes, Up: GranSim Modes (1) In general the word size of the machine is the upper bound for the number of processors.  File: user.info, Node: Example, Next: Running it, Prev: GranSim Modes, Up: Overview A Simple Example Program ======================== This section gives an example of how to write a simple parallel program in the parallel, lazy functional execution model. Due to a sad lack of imagination we take our good old friend, nfib, as an example program. This is the sequential function from which we start: nfib :: Int -> Int nfib 0 = 1 nfib 1 = 1 nfib n = nf1+nf2+1 where nf1 = nfib (n-1) nf2 = nfib (n-2) The straightforward idea for parallelising nfib is to start parallel processes for computing both recursive calls. This is expressed with the par annotation, which `sparks' a parallel process for computing its first argument and whose result is its second argument (the exact operational semantics of par is discussed in more detail in chapter *Note Parallelism Annotations::.). This gives the following parallel program: parfib :: Int -> Int parfib 0 = 1 parfib 1 = 1 parfib n = nf2 `par` (nf1 `par` (nf1+nf2+1)) where nf1 = parfib (n-1) nf2 = parfib (n-2) The drawback of this program is the blocking of the main task on the two created child-tasks. Only when both child tasks have returned a result, the main task can continue. It is more efficient to have one of the recursive calls computed by the main task and to spark only one parallel process for the other recursive call. In order to guarantee that the main expression is evaluated in the right order (i.e. without blocking the main task on the child task) the seq annotation is used: parfib :: Int -> Int parfib 0 = 1 parfib 1 = 1 parfib n = nf2 `par` (nf1 `seq` (nf1+nf2+1)) where nf1 = parfib (n-1) nf2 = parfib (n-2) The operational reading of this function is as follows: First spark a parallel process for computing the expression parfib (n-2). Then evaluate the expression parfib (n-1). Finally, evaluate the main expression. If the parallel process has not finished by this time the main task will be blocked on nf2. Otherwise it will just pick up the result. This simple example already shows several basic aspects of parallel, lazy functional programming: * For adding parallelism to the program par annotations on local variables (that appear in the main expression) are used. * To specify evaluation order of the program seq annotations are used. It reduces the first argument (usually a variable occurring in the definition of the second argument) to weak head normal form (WHNF). * For the efficiency of the parallel program it is important to avoid unnecessary blocking during the execution. Another aspect not shown in this example is the importance of evaluation degree. If the parallel expressions create a compound type (e.g. a list) then it is important that they are evaluated to a certain degree. Otherwise there won't be a lot of parallelism in the program. We will revisit this aspect again in *Note Strategies::. Finally, here is the total example as a stand-alone program: module Main where import Parallel main = getArgs abort ( \ a -> let n = head ( map (\ a1 -> fst ((readDec a1) !! 0)) a ) in print ("parfib " ++ (show n) ++ " = " ++ (show (parfib n)) ++ "\n") ) nfib :: Int -> Int nfib 0 = 1 nfib 1 = 1 nfib n = nf1+nf2+1 where nf1 = nfib (n-1) nf2 = nfib (n-2) parfib :: Int -> Int parfib 0 = 1 parfib 1 = 1 parfib n = nf2 `par` (nf1 `seq` (nf1+nf2+1) ) where nf1 = parfib (n-1) nf2 = parfib (n-2)  File: user.info, Node: Running it, Prev: Example, Up: Overview Running the Example Program =========================== In a standard installation of GHC with GranSim enabled the example program can be compiled as usually but with adding the compiler option -gransim. So, if the above program is in file `parfib.hs' compile a GranSim version of the program with ghc -gransim -fvia-C parfib.hs GranSim requires a compilation using C code as an intermediate stage. Compiling with the native code generator is not supported. To simulate the execution on a machine with 4 processors type ./a.out +RTS -bP -bp4 This will generate a granularity profile (`.gr' extension). For getting an overall activity profile of the execution use gr2ps -O a.out.gr Use your favourite PostScript viewer to examine the activity profile in `a.out.ps'. You'll get the best results with GNU ghostview and ghostscript version 2.6.1 or later.  File: user.info, Node: Setting-up GranSim, Next: Parallelism Annotations, Prev: Overview, Up: Top Setting-up GranSim ****************** [WARNING: THIS IS VERY DRAFTY FOR NOW. IF YOU HAVE SEVERE INSTALLATION PROBLEMS PLEASE LET ME KNOW.] This chapter describes how to get the latest version of GranSim, and how to install it. The system requirements are basically the same as for GHC itself. So far, GranSim has been tested on SUNs under SunOS 4.1 and Solaris 2, and on DEC Alphas under OSF 1 and OSF 3.2. The visualisation tools require Perl, Bash and Gnuplot (only for granularity profiles). * Menu: * Retrieving:: * Installing:: * Trouble Shooting::  File: user.info, Node: Retrieving, Next: Installing, Prev: Setting-up GranSim, Up: Setting-up GranSim Retrieving ========== The important addresses to check for the latest information about GHC and GranSim: * GranSim Home Page -- http://www.dcs.glasgow.ac.uk/fp/software/gransim.html * Anonymous FTP Server -- ftp://ftp.dcs.glasgow.ac.uk/pub/haskell/glasgow/ * GHC home page -- http://www.dcs.glasgow.ac.uk/fp/software/ghc.html (*Note (GHC)::) * Glasgow FP group page -- http://www.dcs.glasgow.ac.uk/fp/ To get the current version of GranSim go to the anonymous FTP Server at Glasgow and retrieve version 0.29 of GHC. The sources of the compiler with GranSim support are in ghc-0.29-src.tar.gz Only if you don't already have an installed version of GHC for bootstrapping the new version you'll have to download the HC files in that directory, too. Check the README file on the FTP server to get more information about the different versions you can download. There should be at least one binary installation of GranSim available for Suns. If you have the right machine and operating system you can just download and unpack this version.  File: user.info, Node: Installing, Next: Trouble Shooting, Prev: Retrieving, Up: Setting-up GranSim Installing ========== Follow the instructions in the GHC Installation Guide (in the subdirectory `ghc/docs/install_guide'. Note that so far GranSim has only been tested with Haskell 1.2. I recommend using that version until I have had a closer look at the interaction of GranSim with the new, shiny GHC for 1.3. The only things different from a normal installation are: * Call the `configure' script with the additional option --enable-gransim * Typing make all (after `configure' and `STARTUP') should create all necessary libraries. All GranSim specific libraries (and hi files) have the extension `_mg.a' (`_mg.hi'). You can find a copy of the GranSim User's Guide in the subdirectory `ghc/docs/gransim'.  File: user.info, Node: Trouble Shooting, Prev: Installing, Up: Setting-up GranSim Trouble Shooting ================ If an installation with make all doesn't run through smoothly, try to build the components by hand. Often it's just a problem with dependencies or old interface files. Try to recompile the modules by hand on which the failing module depends. The top level files for the installation are * `hsc' in the directory `ghc/compiler/': the compiler proper. * The libraries `libHS_mg.a' and `libHS13_mg.a' in the directory `ghc/lib/': these are the necessary built-in libraries (preludes). The libraries `libHSghc_mg.a' and `libHShbc_mg.a' are optional. * `libHSrts_mg.a' in the directory `ghc/runtime/': the runtime-system. You can create a version of the RTS with debugging information (and all GranSim debugging options) by typing make EXTRA_HC_OPTS="-optcO-DGRAN -optcO-DGRAN_CHECK -optcO-DDEBUG -optcO-g" libHSrts_mg.a For those Bravehearts who want to actually hack on the RTS I recommend first having a closer look at the hackers sections of the general GHC User's Guide (*Note (ghc-user-guide.info)Top::). Eventually, I'll add GranSim specific stuff to the chapter about the internals of GranSim (*Note GranSim Internals::) but for now there is not much in this chapter.  File: user.info, Node: Parallelism Annotations, Next: Runtime-System Options, Prev: Setting-up GranSim, Up: Top Parallelism Annotations *********************** This chapter discusses the constructs for adding parallelism to a Haskell program. All constructs are annotations that do not change the semantics of the program (one exception to this rule is the annotation for forcing sequential evaluation since it may change the strictness properties of the program). Normally the basic annotations that are discussed first are sufficient to write and tune a parallel program. The advanced annotations allow to name static spark sites and to provide granularity information. *NB:* To use these annotations the module `Parallel' must be imported as shown in the introductory example (*Note Example::). * Menu: * Basic Annotations:: * Advanced Annotations:: * Experimental Annotations::  File: user.info, Node: Basic Annotations, Next: Advanced Annotations, Prev: Parallelism Annotations, Up: Parallelism Annotations Basic Annotations ================= The two basic parallelism annotations in GranSim (as well as in GUM and GRIP) are `par' and `seq'. Both take two arguments and return the value of the second argument as result. However, they differ in their operational behaviour: -- Annotation: par :: a -> b -> b `par x y' creates a spark for evaluating X and returns the value of Y as a result. -- Annotation: seq :: a -> b -> b `seq x y' reduces X to weak head normal form and then returns the value of Y as a result. The process of sparking a parallel thread creates potential parallelism (a "spark"). Such a spark may be turned into a thread or may be discarded by the runtime system. Several important facts should be noted about these annotations: * Semantically the value of `seq x y' equals that of Y only if the evaluation of X terminates and is error free. Thus, `seq' is strict in its first argument. * Typically X is a local variable and Y is an expression with X as a free variable. A typical idiom for using `par' is let x = ... in x `par` f x If X does not occur in Y speculative parallelism is created, which might trigger the evaluation of objects that are not needed for computing the result. This may also change the termination properties of the program. * The evaluation degree of Y is unaffected by these annotations. It is solely determined by the expression in which the result of this annotated expression is used.  File: user.info, Node: Advanced Annotations, Next: Experimental Annotations, Prev: Basic Annotations, Up: Parallelism Annotations Advanced Annotations ==================== The annotations in the previous section are actually specialisations of the following set of built-in functions which can be called directly: -- Annotation: parGlobal :: Int -> Int -> Int -> Int -> a -> b -> b `parGlobal n g s p x y' creates a spark for evaluating X and returns the value of Y as a result. The spark gets the name N with the granularity information G, the size information S and a degree of parallelism of P. The G, S and P fields just forward information to the runtime system. -- Annotation: parLocal :: Int -> Int -> Int -> Int -> a -> b -> b `parLocal n g s p x y' behaves like `parGlobal' except that the thread (if it is generated at all) must be executed on the current processor. -- Annotation: parAt :: Int -> Int -> Int -> Int -> a -> b -> c -> c `parAt n g s p v x y' behaves like `parGlobal' except that the thread (if it is generated at all) must be executed on the processor that contains the expression V. -- Annotation: parAtForNow :: Int -> Int -> Int -> Int -> a -> b -> c -> c `parAtForNow n g s p v x y' behaves like `parAt' except that the generated thread may be migrated to another processor during the execution of the program.  File: user.info, Node: Experimental Annotations, Prev: Advanced Annotations, Up: Parallelism Annotations Experimental Annotations ======================== Some experimental annotations currently available in GranSim are: -- Annotation: parAtAbs :: Int -> Int -> Int -> Int -> Int -> a -> b -> b `parAtAbs n g s p m x y' behaves like `parAt' except that the thread must be executed on processor number M (the processors are numbered from 0 to p-1 where p is the runtime system argument supplied via the -bp option). -- Annotation: parAtRel :: Int -> Int -> Int -> Int -> Int -> a -> b -> b `parAtRel n g s p m x y' behaves like `parAtAbs' except that the value of M is added to the number of the current processor to determine the target processor. -- Annotation: copyable :: a -> a `copyable x' marks the expression X to be copyable. This means that the expression will be transferred in its unevaluated form if it is needed on another processor. This may duplicate work. *This annotation is not yet implemented.* -- Annotation: noFollow :: a -> a `noFollow x' *This annotation is not yet implemented.*  File: user.info, Node: Runtime-System Options, Next: Strategies, Prev: Parallelism Annotations, Up: Top Runtime-System Options ********************** GranSim provides a large number of runtime-system options for controlling the simulation. Most of these options allow to specify a particular parallel architecture in very much detail. As general convention all GranSim related options start with -b to separate them from other GHC RTS options. To separate the RTS options from usual options given to the Haskell program the meta-option +RTS has to be used. If you are not interested in the details of the available options and just want to specify a somewhat generic setup for one class of parallel machines go to the last section of this chapter (*Note Specific Setups::). * Menu: * Basic Options:: * Special Features:: * Communication Parameters:: * Runtime-System Parameters:: * Processor Characteristics:: * Granularity Control Mechanisms:: * Miscellaneous Options:: * Debugging Options:: * General GHC RTS Options:: * Specific Setups::  File: user.info, Node: Basic Options, Next: Special Features, Prev: Runtime-System Options, Up: Runtime-System Options Basic Options ============= The options in this section are probably the most important GranSim options the programmer has to be aware of. They define the basic behaviour of GranSim rather than going down to a low level of specifying machine characteristics. -- RTS Option: -bP This option controls the generation of a GranSim profile (*Note GranSim Profiles::). By default a reduced profile (only END events) is created. With -bP a full GranSim profile is generated. Such a profile is necessary to create activity profiles. With -b-P no profile is generated at all. -- RTS Option: -bs Generate a spark profile. The GranSim profile will contain events describing the creation, movement, consumption and pruning of sparks. *Note:* This option will drastically increase the size of the generated GranSim profile. -- RTS Option: -bh Generate a heap profile. The GranSim profile will contain events describing heap allocations. *Note:* This option will drastically increase the size of the generated GranSim profile. -- RTS Option: -bpN Choose the number of processors to simulate. The value of N must be less than or equal to the word size on the machine (i.e. usually 32). If N is 0 GranSim-Light mode is enabled. -- RTS Option: -bp: Enable GranSim-Light (same as -bp0). In this mode there is no limit on the number of processors but no communication costs are recorded.  File: user.info, Node: Special Features, Next: Communication Parameters, Prev: Basic Options, Up: Runtime-System Options Special Features ================ The options in this section allow to simulate special features in the runtime-system of the simulated machine. This allows to study how these options influence the behaviour of different kinds of parallel machines. All of these flags can be turned of by inserting a - symbol after b (as in -b-P). * Menu: * Asynchronous Communication:: * Bulk Fetching:: * Migration::  File: user.info, Node: Asynchronous Communication, Next: Bulk Fetching, Prev: Special Features, Up: Special Features Asynchronous Communication -------------------------- If a thread fetches remote data by default the processor is blocked until the data arrives. This "synchronous communication" is usually only advantageous on machines with very low latency. On such machines it is better to wait and avoid the overhead of a context switch. Synchronous communication may also increase data locality as a thread is only descheduled when it gets blocked on data that is under evaluation by another thread. On machines with high latency "asynchronous communication" is usually better as it allows the processor to perform some useful computation while a thread waits for the arrival of remote data. However, the processor might be even more aggressive in trying to get work while another thread is waiting for data. The aggressiveness of the processor to get new work is determined by the "fetching strategy". Currently five different strategies are supported. -- RTS Option: -byN Choose a Fetching Strategy (i.e. determine what to do while the running thread fetches remote data): 0. Synchronous communication (default). 1. This and all higher fetching strategies implement asynchronous communication. This strategy schedules another runnable thread if one is available. This gives the same behaviour as -bZ. 2. If no runnable thread is available a local spark is turned into a thread. This adds task creation overhead to context switch overhead for asynchronous communication. 3. If no local sparks are available the processor tries to acquire a remote spark. 4. If the processor can't get a remote spark it tries to acquire a runnable thread from another busy processor provided that migration is also turned on (-bM). -- RTS Option: -bZ Enable asynchronous communication. This causes a thread to be descheduled when it fetches remote data. Its processor schedules another runnable thread. If none is available it becomes idle. This gives the same behaviour as -by1. Note that fetching strategies 3 and 4 involve the sending of messages to other processors. Therefore, it's likely that by the time a spark (or thread) has been fetched, the original thread has already received its data and is being executed again. Therefore, in most cases strategies 1 and 2 yield better results than 3 and 4.  File: user.info, Node: Bulk Fetching, Next: Migration, Prev: Asynchronous Communication, Up: Special Features Bulk Fetching ------------- When fetching remote data there are several possibilities how to transfer the data. The options -bG and -bQN allow to choose among them. By default GranSim uses "incremental fetching" (also called single closure fetching, or lazy fetching). This means that only the closure that is immediately required is fetched from a remote processor. Again, this strategy is preferable for low latency systems as it minimises the total amount of data that has to be transferred. However, if the overhead for creating a packet and for sending a message are high it is better to perform "bulk fetching", which transfers a subgraph with the required closure as its root. The size of the subgraph can be bounded by specifying the maximal size of a packet or by specifying the maximal number of "thunks" (unevaluated closures) that should be put into one packet. The way how to determine which closures to put into a packet is called "packing strategy". -- RTS Option: -bG Enable bulk fetching. This causes the whole subgraph with the needed closure as its root to be transferred. -- RTS Option: -bQN Pack at most N-1 non-root thunks into a packet. Choosing a value of 1 means that only normal form closures are transfered with the possible exception of the root, of course. The value 0 is interpreted as infinity (i.e. pack the whole subgraph). This is the default setting. -- RTS Option: -QN Pack at most N words in one packet. This limits the size of the packet and does not distinguish between thunks and normal forms. The default packet size is 1024 words. The option for setting the packet size differs from the usual GranSim naming scheme as it is also available for GUM. In fact, both implementations use almost the same source code for packing a graph.  File: user.info, Node: Migration, Prev: Bulk Fetching, Up: Special Features Migration --------- When an idle processor looks for work it first checks its local spark pool, then it tries to get a remote spark. It might happen that no sparks are available in the system any more and that some processors have several runnable threads. In such a situation it might be advantageous to transfer a runnable thread from a busy processor to an idle one. However, this "thread migration" is very expensive and should be avoided unless absolutely necessary. Therefore, by default thread migration is turned off. -- RTS Options: -bM Enable thread migration. When an idle process has no local sparks and can't find global sparks it tries to migrate (steal) a runnable thread from another busy processor. Note that thread migration often causes a lot of fetching in order to move all required data to the new processor, too. This bears the risk of destroying data locality.  File: user.info, Node: Communication Parameters, Next: Runtime-System Parameters, Prev: Special Features, Up: Runtime-System Options Communication Parameters ======================== The options in this section allow to specify the overheads for communication. Note that in GranSim-Light mode all of these values are irrelevant (communication is futile). -- RTS Option: -blN Set the latency in the system to N machine cycles. Typical values for shared memory machines are 60 -- 100 cycles, for GRIP (a closely-coupled distributed memory machine) around 400 cycles and for standard distributed memory machines between 1000 and 5000 cycles. The default value is 1000 cycles. -- RTS Option: -baN Set the additional latency in the system to N machine cycles. The additional latency is the latency of follow-up packets within the same message. Usually this is much smaller than the latency of the first packet (default: 100 cycles). -- RTS Option: -bmN Set the overhead for message packing to N machine cycles. This is the overhead for constructing a packet independent of its size. -- RTS Option: -bxN Set the overhead for tidying up the packet after sending it to N machine cycles. On some systems significant work is needed after having sent a packet. -- RTS Option: -brN Set the overhead for unpacking a message to N machine cycles. Again, this overhead is independent of the message size. -- RTS Option: -bgN Set the overhead for fetching remote data to N machine cycles. By default this value is two times latency plus message unpack time.  File: user.info, Node: Runtime-System Parameters, Next: Processor Characteristics, Prev: Communication Parameters, Up: Runtime-System Options Runtime-System Parameters ========================= The options in this section model overhead that is related to the runtime-system of the simulated parallel machine. -- RTS Option: -btN Set the overhead for thread creation to N machine cycles. This overhead includes costs for initialising a control structure describing the thread and allocating stack space for the execution. -- RTS Option: -bqN Set the overhead for putting a thread into the blocking queue of a closure to N machine cycles. -- RTS Option: -bcN Set the overhead for scheduling a thread to N machine cycles. -- RTS Option: -bdN Set the overhead for descheduling a thread to N machine cycles. -- RTS Option: -bnN Set the overhead for global unblocking (i.e. blocking on a remote closure) to N machine cycles. This value does not contain the overhead caused by the communication between the processors. -- RTS Option: -buN Set the overhead for local unblocking (i.e. putting a thread out of a blocking queue and into a runnable queue) to N machine cycles. This value does not contain the overhead caused by the communication between the processors.  File: user.info, Node: Processor Characteristics, Next: Granularity Control Mechanisms, Prev: Runtime-System Parameters, Up: Runtime-System Options Processor Characteristics ========================= The options in this section specify the characteristics of the microprocessor of the simulated parallel machine. To this end the instructions of the processor are divided into six groups. These groups have different weights reflecting their different relative costs. The groups of processor instructions are: * Arithmetic instructions * Load instructions * Store instructions * Branch instructions * Floating point instruction * Heap allocations The options for assigning weights to these groups are: -- RTS Option: -bAN Set weight for arithmetic operations to N machine cycles. -- RTS Option: -bLN Set weight for load operations to N machine cycles. -- RTS Option: -bSN Set weight for store operations to N machine cycles. -- RTS Option: -bBN Set weight for branch operations to N machine cycles. -- RTS Option: -bFN Set weight for floating point operations to N machine cycles. -- RTS Option: -bHN Set weight for heap allocations to N machine cycles. Strictly speaking, the heap allocation costs is a parameter of the runtime-system. However, in our underlying machine model allocating heap is such a basic operation that one can think of it as a special instruction.  File: user.info, Node: Granularity Control Mechanisms, Next: Miscellaneous Options, Prev: Processor Characteristics, Up: Runtime-System Options Granularity Control Mechanisms ============================== There are three granularity control mechanisms: 1. *Explicit threshold* No spark whose priority is smaller than a given threshold will be turned into a thread. 2. *Priority sparking* The spark queue is sorted by priority. This guarantees that the highest priority spark is turned into a thread. Priorities are not maintained for threads. 3. *Priority scheduling* The thread queue is sorted by priority, too. This guarantees that the biggest available thread is scheduled next. This imposes a higher runtime overhead. -- RTS Option: -bYN Use the value N as a threshold when turning sparks into threads. No spark with a priority smaller than N will be turned into a thread. -- RTS Option: -bXX Enable priority sparking. The letter X indicates how to use the granularity information attached to a spark site in the source code: * Use the granularity information field as a priority. * I Use the granularity information field as an inverse priority. * R Ignore the granularity information field and use a random priority. * N Ignore the granularity information field and don't use priorities at all. -- RTS Option: -bI Enable priority scheduling. -- RTS Option: -bKN Set the overhead for inserting a spark into a sorted spark queue to N machine cycles. -- RTS Option: -bON Set the overhead for inserting a thread into a sorted thread queue to N machine cycles.  File: user.info, Node: Miscellaneous Options, Next: Debugging Options, Prev: Granularity Control Mechanisms, Up: Runtime-System Options Miscellaneous Options ===================== -- RTS Option: -bC Force the system to eagerly turn a spark into a thread. This basically disables the lazy thread creation mechanism of GranSim and ensures that no sparks are discarded (except for sparks whose closures are already in normal form). -- RTS Option: -be Enable deterministic mode. This means that no random number generator is used for deciding where to get work from. With this option two runs of the same program with the same input will yield exactly the same result. -- RTS Option: -bT Prefer stealing threads over stealing sparks when looking for remote work. This is mainly an experimental option. -- RTS Option: -bN When creating a new thread prefer sparks generated by local closures over sparks that have been stolen from other processors. This is mainly an experimental option, which might improve data locality. -- RTS Option: -bfN Specify the maximal number of outstanding requests to acquire sparks from remote processors. High values of N may cause a more even distribution of sparks avoiding bottlenecks caused by a `spark explosion' on one processor. However, this might harm the data locality. The default value is 1. -- RTS Option: -bwN Set the time slice of the simulator to N machine cycles. This is an internal variable that changes the behaviour of the simulation rather than that of the simulated machine. A longer time slice means faster but less accurate simulation. The default time slice is 1000 cycles.  File: user.info, Node: Debugging Options, Next: General GHC RTS Options, Prev: Miscellaneous Options, Up: Runtime-System Options Debugging Options ================= These options are mainly intended for debugging GranSim. Only those options that might be of interest to the friendly (i.e. non-hacking) user of GranSim are listed here for now. *Note:* These options are only available if the RTS has been compiled with the cpp flag GRAN_CHECK (*Note Installing::). -- RTS Option: -bDE Print an event statistics at the end of the computation. This also includes a statistics about the packages sent (if bulk fetching) has been enabled. If you are *really* interested in all the hidden options in GranSim look into the file `ghc/runtime/main/RtsFlags.lc'.  File: user.info, Node: General GHC RTS Options, Next: Specific Setups, Prev: Debugging Options, Up: Runtime-System Options General GHC RTS Options ======================= Some options of the GHC runtime-system that are not specific for GranSim are of special interest, too. They are discussed in this section. -- RTS Option: -oN This option sets the initial size of the stack of a thread to N words. This might be of importance for GranSim-Light, which can create an abundance of parallel threads filling up the heap. The default stack size in GranSim-Light is already reduced to 200 words (usually the default is 1024 words). If you run into problems with heap size in a GranSim-Light setup you might want to reduce it further. -- RTS Option: -SF Print messages about garbage collections to the file F (or to STDERR). -- RTS Option: -F2s Use a two-space garbage collector instead of a generational garbage collector. Previous versions of GranSim had problems with the latter. If you experience problems try this option and send me a bug report (*Note Bug Reports::).  File: user.info, Node: Specific Setups, Prev: General GHC RTS Options, Up: Runtime-System Options Specific Setups =============== When using GranSim a programmer often just wants to specify the general architecture of the machine rather than going down to the details of a specific machine. To facilitate this approach this section presents examples of standard set-ups for GranSim. Note that these setups specify the characteristics of a machine, but not of the runtime-system. Thus characteristics like thread creation costs are left open. However, the default setting fairly closely reflect the real costs for example under GUM. So, unless you have a different implementation of runtime-system details in mind the default settings should be sufficiently accurate. * Menu: * The Ideal Machine:: * Shared Memory Machines:: * Strongly Connected DMMs:: * Distributed Memory Machines (DMMs)::  File: user.info, Node: The Ideal Machine, Next: Shared Memory Machines, Prev: Specific Setups, Up: Specific Setups The Ideal GranSim Setup ----------------------- This setup reflects the ideal case, where communication is for free and where there is no limit on the number of processors. This is used to show the maximal amount of parallelism in the program. Using such a *GranSim-Light* setup is usually the first step in tuning the performance of a parallel, lazy functional program (*Note GranSim Modes::). The typical GranSim-Light setup is: +RTS -bP -b:  File: user.info, Node: Shared Memory Machines, Next: Strongly Connected DMMs, Prev: The Ideal Machine, Up: Specific Setups GranSim Setup for Shared-Memory Machines ---------------------------------------- In a shared memory machine the latency is roughly reduced to the costs of a load operation. Potentially, some additional overhead for managing the shared memory has to be added. Also the caching mechanism might be more expensive than in a sequential machine. In general, the latency should be between 5 and 20 machine cycles. For machines where the latency is of the same order of magnitude as loading and storing data, it is reasonable to assume incremental, synchronous communication. Migration should also be possible. This gives the following setup (for 32 processors): +RTS -bP -bp32 -bl10 -b-G -by0 -bM  File: user.info, Node: Strongly Connected DMMs, Next: Distributed Memory Machines (DMMs), Prev: Shared Memory Machines, Up: Specific Setups GranSim Setup for Strongly Connected Distributed Memory Machines ---------------------------------------------------------------- Strongly connected DMMs put a specific emphasis on keeping the latency in the system as low as possible. One example of such a machine is GRIP, which has been built specifically for performing parallel graph reduction. Therefore, this setup is of special interest for us. Most importantly the latency in such machines is typically between 100 and 500 cycles (400 for GRIP). Furthermore, the GRIP runtime-system, as an example for such kind of machines, uses incremental, synchronous communication. Migration is also possible. This gives the following setup (for 32 processors): +RTS -bP -bp32 -bl400 -b-G -by0 -bM  File: user.info, Node: Distributed Memory Machines (DMMs), Prev: Strongly Connected DMMs, Up: Specific Setups GranSim Setup for Distributed Memory Machines --------------------------------------------- General distributed memory machines usually have latencies that are a order of magnitude higher than that of strongly connected DMMs. However, especially in this class of machines the differences between specific machines are quite significant. So, I strongly recommend to use the exact machine characteristics if GranSim should be used to predict the behaviour on such a machine. The high latency requires a fundamentally different runtime-system to avoid long delays for fetching remote data. Therefore, usually synchronous bulk fetching is used. I'd recommend choosing a fetching strategy of 1 or 2 (it's hard to say which one is better in general). Thread migration is such expensive on DMMs that it is often not supported at all. This gives the following setup (for 32 processors): +RTS -bP -bp32 -bl2000 -bG -by2 -b-M  File: user.info, Node: Strategies, Next: Visualisation Tools, Prev: Runtime-System Options, Up: Top Parallel Functional Programming in the Large: Strategies ******************************************************** [SEE THE STRATEGY PAPER: "STRATEGIES FOR WRITING PARALLEL NON-STRICT PROGRAMS", TRINDER P.W., LOIDL H.W., HAMMOND K., PEYTON JONES S.L. (IN PREPARATION). ] This chapter deals with parallel, lazy functional programming in the large. It introduces the notion of strategies and shows how they make it easier to develop large parallel programs. * Menu: * Motivation:: * The Main Idea:: * Quicksort (Example):: * Sum of Squares (Example):: * A Class of Strategies:: * Experiences with Strategies::