Info file: user.info, -*-Text-*- produced by `texinfo-format-buffer' from file `user.texi' using `texinfmt.el' version 2.32 of 19 November 1993. The GranSim User's Guide discusses how to use the GranSim simulator for the parallel execution of Haskell programs. Copyright (C) 1994 -- 1996, Hans-Wolfgang Loidl for the GRASP/AQUA Project, Glasgow University  File: user.info, Node: Internals of GranSim, Next: GranSim vs GUM, Prev: Parallel NoFib Suite, Up: Top Internals ********* This chapter discusses details of the implementation of GranSim. [THIS TEXT IS TAKEN FROM A DRAFT OF A PREVIOUS PAPER. IT NEEDS UPDATES TO COVER THE MORE RECENT CHANGES LIKE ADDING GRANSIM-LIGHT. SEE ALSO THE HPFC'95 PAPER FOR THE BASICS] * Menu: * Global Structure:: * Accuracy:: * Flexibility:: * Visualisation:: * Efficiency:: * Integration into GHC::  File: user.info, Node: Global Structure, Next: Accuracy, Prev: Internals of GranSim, Up: Internals of GranSim Global Structure ================ GranSim performs an event driven simulation with a centralised event queue. Each of the simulated processors has its own spark queue and thread queue as well as its own clock. The synchronisation of the clocks is performed via accessing the event queue which is sorted by the time stamps of the events.  File: user.info, Node: Accuracy, Next: Flexibility, Prev: Global Structure, Up: Internals of GranSim Accuracy ======== GranSim is built on top and therefore makes use of a state-of-the-art compiler for Haskell including a huge variety of possible optimisations that can be performed. The instruction count function, which determines the number of instructions that are executed for a given piece of abstract C code, has been carefully tuned by analysing the assembler code generated by GHC and the results have been compared with the number of instructions executed in real Haskell programs. These comparisons have shown that the instruction count of the simulation lies within 10% for arithmetic operations, within 2% for load, store operations, within 20% for branch instructions and within 14% for floating point instructions of the real values. To allow the modelling of different kinds of architectures the instructions have been split into five classes, with different weights: arithmetic operations, floating point operations (1 cycle each), load, store operations (4 cycles each) and branch instructions (2 cycles). These weights model a SPARC processor and have been verified with Haskell programs in the nofib suite. The weights are tunable in order to simulate other kinds of processors.  File: user.info, Node: Flexibility, Next: Visualisation, Prev: Accuracy, Up: Internals of GranSim Flexibility =========== GranSim has a big number of tunable parameters (*Note RTS Options::). One set of parameters allows to model a broad variety of processors. To achieve a similar accuracy as for SPARC processors the same kind of measurements can be performed. Another set of parameters allows to tune the costs for communication latency, message pack time etc. Finally, the overhead imposed by the runtime-system can be captured by adjusting the parameters for task creation time, context switch time etc. This allows us to model architectures ranging from shared memory to distributed memory machines. And we can quite easily simulate the effect of some small changes in the runtime system on the overall behaviour. Additionally, the GranSim-Light setup allows to study the parallelism inherent in a program rather than choosing a fixed architecture to run the program on. Our experiences with parallelising rather big programs show that this is an important additional feature for tuning the performance of a parallel program.  File: user.info, Node: Visualisation, Next: Efficiency, Prev: Flexibility, Up: Internals of GranSim Visualisation ============= Especially for analysing the granularity of the created tasks, we need visualisation tools. We have developed a set of tools for that purpose and used it on a quite wide range of example programs to analyse their performance (*Note Parallel NoFib Suite::). With these tools we were able to show a very high correlation between execution time and the amount of allocated heap space. Apart from these tools showing the granularity of the resulting tasks, we have also developed a set of tools for showing the activity of the simulated machine as a whole, of all processors and of all tasks in the system. These tools have proven indispensable in the parallelisation and optimisation of a linear system solver. Based on the experience from this implementation, such tools are essential when working with a lazy language, in which the order of evaluation is not at all obvious from the program source. *Note Visualisation Tools::for a more detailed discussion of these tools.  File: user.info, Node: Efficiency, Next: Integration into GHC, Prev: Visualisation, Up: Internals of GranSim Efficiency ========== The high accuracy in the simulation also implies a rather high overhead for the simulation as more bookkeeping has to be done. This is mainly due to the exact modelling of communication. Therefore, programs that perform a lot of communication will impose a higher overhead on the simulation. Early measurements of some example programs on GranSim showed a factor between 14 and more than 100 between a GranSim simulation and an optimised sequential execution. In the meantime we have spent some effort in improving the efficiency of the simulator. The most recent measurements show factors between 10 and 15. Regarding the amount of information we generate out of a simulation we think that is an acceptable factor. Especially a comparison with another simulator for annotated Haskell programs, HBCPP, is of interest. Using a set of small example programs GranSim was between 1.5 and 2 times slower than HBCPP. In some cases the GranSim-Light setup was about as fast the HBCPP version. We observed one potential problem of the GranSim-Light setup, though. If the parallel program creates a huge number of parallel threads (several thousand) the bookkeeping of all these running threads slows down the simulation significantly. This might make the GranSim-Light setup slower than the plain GranSim setup, which has a limited number of running threads. In such a setup it is advisable to increase the time slice for each thread in the simulation (-bwN option).  File: user.info, Node: Integration into GHC, Prev: Efficiency, Up: Internals of GranSim Integration into GHC ==================== GranSim is part of the overall GHC system. Therefore, it is possible to use all the features of a compilation via GHC for GranSim, too. Especially, the ccall mechanism can be used to call C functions in a parallel program (indeed, we have experimented with the use of a computer algebra library in implementing a parallel resultant algorithm). Also the interaction between certain optimisations like deforestation and the parallel execution of a program should be of interest (deforestation might eliminate intermediate lists that are crucial for the parallel execution of the program). Finally, in parallelising the LOLITA natural language processing system it was crucial to have an profiler for the sequential version of the program available. This allowed us to determine the important parts of the computation early on in the parallelisation process.  File: user.info, Node: GranSim vs GUM, Next: Future Extensions, Prev: Internals of GranSim, Up: Top GranSim vs GUM ************** [MAINLY A PLACEHOLDER FOR A REAL COMPARISON] In a nutshell: The GUM runtime-system is very close to GranSim. To a fairly large degree the same code is used by both versions. The main source of inaccuracy is the modelling of stealing sparks and threads. GUM uses a fishing mechanism: a fish message is sent from processor to processor, looking for sparks. In GranSim we use the global knowledge of the system and weigh the costs for stealing a spark with the inverse of the probability of hitting a processor with available sparks. Furthermore, there is no counterpart of GUM's FETCHME closures as pointers to remote data in GranSim. On the other hand, GranSim is also significantly more powerful than GUM. It supports thread migration, offers different strategies for packing a graph (based on the number of thunks), and it allows to choose between different fetching strategies, when deciding what to do while one thread fetches remote data. These features mainly deal with the aspect of data locality in the program. Another important aspect for the performance of the granularity of the parallel program. This can be tuned by choosing among three basic methods for granularity control (*Note Granularity Control Mechanisms::).  File: user.info, Node: Future Extensions, Next: Bug Reports, Prev: GranSim vs GUM, Up: Top Future Extensions ***************** This version of GranSim should be fairly stable. It has been stress tested on a number of non-trivial programs (including a program of more than 100,000 lines of Haskell and wee bit of C). It contains most of the features I want to have in it. One future extension of GranSim might be the implementation of a more accurate modelling of the spark/thread stealing mechanism (based on GUM's fishing model).  File: user.info, Node: Bug Reports, Next: Determinant (Example), Prev: Future Extensions, Up: Top Bug Reports *********** Send all bug reports, including the source of the failing program, the compile and RTS options and the error message to hwloidl@dcs.gla.ac.uk Be sure to add "GranSim Bug" in the subject line. Earlier versions of GranSim sometimes had problems when using the generational garbage collector (default). If you meet problems with GranSim you can try the following: * Use the option -Sstderr to get garbage collection messages. This will tell you whether there is any garbage collection going on. * Increase the heap size. This should give you a hint whether the problem is related to garbage collection at all. * Try to use the runtime-system options -F2s to force two-space garbage collection. This garbage collector uses more heap than the generational one. Whenever I had problems with a test program before it just went away with this garbage collector. * Perhaps also use -Z (this omits update frame squeezing; you don't have to worry about what that means, though). In any case, please report it as a bug to the above address.  File: user.info, Node: Determinant (Example), Next: Options Index, Prev: Bug Reports, Up: Top Example Program: Determinant Computation **************************************** The example program in this chapter is a parallel determinant computation. It uses the data type SqMatrix a to represent a matrix as a list of list together with its bounds. For getting a parallel version of the program strategies (*Note A Class of Strategies::) are used. Thus, much of the parallelism comes from applying the parList strategy. In order to demonstrate the use of strategies in a parallel functional program this example contains parallelism down to a very fine granularity. determinant :: (NFDataIntegral a) => SqMatrix a -> a determinant (SqMatrixC ((iLo,jLo),(iHi,jHi)) mat) | jHi-jLo+1 == 1 = let [[mat_1_1]] = mat in mat_1_1 | jHi-jLo+1 == 2 = let [[mat_1_1,mat_1_2], [mat_2_1,mat_2_2] ] = mat a = mat_1_1 * mat_2_2 b = mat_1_2 * mat_2_1 strategy r = a `par` b `par` () in a - b `using` strategy | otherwise = sum (l_par `using` parList rnf) where newLine _ [] = [] newLine j line = pre ++ post `using` strategyN where pre = [ line !! (k-1) | k <- [jLo..j-1] ] post = [ line !! (k-1) | k <- [j+1..jHi] ] strategyN r = pre `par` post `par` () determine1 j = (if pivot > 0 then sign*pivot*det' `using` strategyD1 else 0) `using` sPar sign where sign = if (even (j-jLo)) then 1 else -1 pivot = (head mat) !! (j-1) mat' = SqMatrixC ((iLo,jLo),(iHi-1,jHi-1)) (map (newLine j) (tail mat)) det' = determinant mat' strategyD1 r = parSqMatrix (parList rwhnf) mat' `seq` det' `par` () l_par = map determine1 [jLo..jHi]  File: user.info, Node: Options Index, Next: Concept Index, Prev: Determinant (Example), Up: Top Runtime-System Options Index ***************************** * Menu: * -baN: Communication Parameters. 4. * -bAN: Processor Characteristics. 4. * -bBN: Processor Characteristics. 4. * -bC: Miscellaneous Options. 4. * -bcN: Runtime-System Parameters. 4. * -bDE: Debugging Options. 4. * -bdN: Runtime-System Parameters. 4. * -be: Miscellaneous Options. 4. * -bfN: Miscellaneous Options. 4. * -bFN: Processor Characteristics. 4. * -bG: Bulk Fetching. 4. * -bgN: Communication Parameters. 4. * -bh: Basic Options. 4. * -bHN: Processor Characteristics. 4. * -bI: Granularity Control Mechanisms. 4. * -bKN: Granularity Control Mechanisms. 4. * -blN: Communication Parameters. 4. * -bLN: Processor Characteristics. 4. * -bM: Migration. 4. * -bmN: Communication Parameters. 4. * -bN: Miscellaneous Options. 4. * -bnN: Runtime-System Parameters. 4. * -bON: Granularity Control Mechanisms. 4. * -bP: Basic Options. 4. * -bpN: Basic Options. 4. * -bQN: Bulk Fetching. 4. * -bqN: Runtime-System Parameters. 4. * -brN: Communication Parameters. 4. * -bs: Basic Options. 4. * -bSN: Processor Characteristics. 4. * -bT: Miscellaneous Options. 4. * -btN: Runtime-System Parameters. 4. * -buN: Runtime-System Parameters. 4. * -bwN: Miscellaneous Options. 4. * -bxN: Communication Parameters. 4. * -bXX: Granularity Control Mechanisms. 4. * -byN: Asynchronous Communication. 4. * -bYN: Granularity Control Mechanisms. 4. * -bZ: Asynchronous Communication. 4. * -oN: General GHC RTS Options. 4. * -QN: Bulk Fetching. 4. * -SF: General GHC RTS Options. 4.  File: user.info, Node: Concept Index, Prev: Options Index, Up: Top Concept Index ************* * Menu: * ACQUIRED event: Spark Events. 4. * Asynchronous Communication : Asynchronous Communication. 4. * BLOCK event: Basic Thread Events. 4. * Bucket statistics: Bucket Graphs. 4. * Bulk Fetching: Bulk Fetching. 4. * Compiling under GranSim: Running it. 4. * Cumulative statistics: Cumulative Graphs. 4. * DESCHEDULE event: Basic Thread Events. 4. * Determinant computation (parallel): Determinant (Example). 4. * Distributed memory setup: Distributed Memory Machines (DMMs). 4. * Distributed memory setup: Strongly Connected DMMs. 4. * Emacs mode for GranSim profiles: The Emacs GranSim Profile Mode. 4. * Emacs support: The Emacs GranSim Profile Mode. 4. * Emacs support (highlighting): Features. 4. * Emacs support (narrowing): Features. 4. * END event: END Events. 4. * Evaluation degree: Sum of Squares (Example). 4. * Events: Body. 4. * Example: Determinant (Example). 4. * Example: Example. 4. * Example: Parallel NoFib Suite. 4. * Example: Sum of Squares (Example). 4. * EXPORTED event: Spark Events. 4. * FETCH event: Communication Events. 4. * Fetching Strategies: Asynchronous Communication. 4. * FETCHME closures: GranSim vs GUM. 4. * Forcing functions: Sum of Squares (Example). 4. * Garbage collectors: General GHC RTS Options. 4. * GHC: Components. 4. * `gr2ap': Activity Profiles. 4. * `gr2pe': Activity Profiles. 4. * `gr2ps': Activity Profiles. 4. * GranSim profile mode: The Emacs GranSim Profile Mode. 4. * GranSim-Light: GranSim Modes. 4. * Granularity Control Mechanisms: Granularity Control Mechanisms. 4. * GUM: GranSim vs GUM. 4. * HBCPP: Efficiency. 4. * Highlighting in GranSim profile mode: Features. 4. * Histogram: Bucket Graphs. 4. * Ideal setup: The Ideal Machine. 4. * Incremental Fetching: Bulk Fetching. 4. * Migration: Migration. 4. * Narrowing in GranSim profile mode: Features. 4. * nfib: Example. 4. * Node: General Structure. 4. * Overview: Overview. 4. * Packing Strategies: Bulk Fetching. 4. * par: Basic Annotations. 4. * Parallel determinant computation: Determinant (Example). 4. * Parallel module: Example. 4. * Parallel nofib suite: Parallel NoFib Suite. 4. * parAt: Advanced Annotations. 4. * parAtAbs: Experimental Annotations. 4. * parAtForNow: Advanced Annotations. 4. * parAtRel: Experimental Annotations. 4. * parfib: Example. 4. * parGlobal: Advanced Annotations. 4. * parLocal: Advanced Annotations. 4. * Producer consumer parallelism: Sum of Squares (Example). 4. * Profiling: GranSim Modes. 4. * PRUNED event: Spark Events. 4. * Quicksort: Quicksort (Example). 4. * REPLY event: Communication Events. 4. * RESUME event: Basic Thread Events. 4. * RESUME(Q) event: Basic Thread Events. 4. * Running an example program: Running it. 4. * SCHEDULE event: Basic Thread Events. 4. * seq: Basic Annotations. 4. * Setups: Specific Setups. 4. * Setups (distributed memory): Distributed Memory Machines (DMMs). 4. * Setups (distributed memory): Strongly Connected DMMs. 4. * Setups (ideal): The Ideal Machine. 4. * Setups (shared memory): Shared Memory Machines. 4. * Shared memory setup: Shared Memory Machines. 4. * Spark: Basic Annotations. 4. * SPARK event: Spark Events. 4. * SPARKAT event: Spark Events. 4. * Stack size: General GHC RTS Options. 4. * START event: Basic Thread Events. 4. * START(Q) event: Basic Thread Events. 4. * STEALING event: Thread Migration Events. 4. * STOLEN event: Thread Migration Events. 4. * STOLEN(Q) event: Thread Migration Events. 4. * Strategies: Strategies. 4. * Sum of squares (parallel): Sum of Squares (Example). 4. * Synchronous Communication : Asynchronous Communication. 4. * SYSTEM_END event: Debugging Events. 4. * SYSTEM_START event: Debugging Events. 4. * Thread id: General Structure. 4. * Thread migration: Migration. 4. * Thread stealing: Migration. 4. * Thunk: Bulk Fetching. 4. * Time stamps: General Structure. 4.