Info file: user.info, -*-Text-*- produced by `texinfo-format-buffer' from file `user.texi' using `texinfmt.el' version 2.32 of 19 November 1993. The GranSim User's Guide discusses how to use the GranSim simulator for the parallel execution of Haskell programs. Copyright (C) 1994 -- 1996, Hans-Wolfgang Loidl for the GRASP/AQUA Project, Glasgow University  File: user.info, Node: Motivation, Next: The Main Idea, Prev: Strategies, Up: Strategies Motivation for Strategies ========================= The following naive version `parmap' often does not give good parallelism: parmap :: (a -> b) -> [a] -> [b] parmap f [] = [] parmap f (x:xs) = fx `par` pmxs `par` (fx:pmxs) where fx = f x pmxs = parmap f xs If in this version the spark for pmxs is discarded almost all parallelism in the program is lost. On the other hand, if all par's are executed more parallel threads than necessary are created (two for each list element). An alternative version of parmap reduces the total number of generated threads by replacing the second par with a seq: parmap :: (a -> b) -> [a] -> [b] parmap f [] = [] parmap f (x:xs) = fx `par` pmxs `seq` (fx:pmxs) where fx = f x pmxs = parmap f xs The problem of this version is that in a context that requires the result of parmap only in weak head normal form, the recursive call to parmap will be deferred until its value is needed. This drastically reduces the total amount of parallelism in the program. Especially, if such a function is used as the producer in a program with producer-consumer parallelism, this gives very poor parallelism (*Note Sum of Squares (Example)::). To improve this behaviour one can write a forcing function, which guarantees that all of the result is demanded immediately: forcelist [] = () forcelist (x:xs) = x `seq` (forcelist xs) Using this function in `parmap' gives the following function: parmap :: (a -> b) -> [a] -> [b] parmap f [] = [] parmap f (x:xs) = fx `par` (forcelist pmxs) `seq` (fx:pmxs) where fx = f x pmxs = parmap f xs However, following this approach yields a big set of forcing functions and a mixture of defining the result and driving the computation. We want to avoid both.  File: user.info, Node: The Main Idea, Next: Quicksort (Example), Prev: Motivation, Up: Strategies The Main Idea ============= [FOR NOW MAINLY THE ABSTRACT OF THE STRATEGY PAPER] Separate the components of a parallel, lazy functional program: 1. Parallelism 2. Evaluation degree 3. Definition of the result In most current parallel non-strict functional languages, a function must both describe the value to be computed, and the *dynamic behaviour*. The dynamic behaviour of a function has two aspects: * parallelism*, i.e. what values could be computed in parallel, and * evaluation-degree*, i.e. how much of each value should be constructed. Our experience is that specifying the dynamic behaviour of a function obscures its semantics. Strategies have been developed to address this problem; a strategy being an explicit description of a function's potential dynamic behaviour. The philosophy is that *it should be possible to understand the semantics of a function without considering it's dynamic behaviour*. In essence a strategy is a runtime function that traverses a data structure specifying how much of it should be evaluated, and possibly sparking threads to perform the construction in parallel. Because strategies are functions they can be defined on any type and can be combined to specify sophisticated dynamic behaviour. For example a strategy can control thread granularity or specify evaluation over an infinite structure. A disadvantage is that a strategy requires an additional runtime pass over parts of the data structure. Example programs are given that use strategies on divide-and-conquer, pipeline and data-parallel applications.  File: user.info, Node: Quicksort (Example), Next: Sum of Squares (Example), Prev: The Main Idea, Up: Strategies Example Program: Quicksort ========================== This section compares two version of parallel quicksort: one using a forcing function and another one using strategies. * Menu: * Forcing Function Version:: * Strategy Version::  File: user.info, Node: Forcing Function Version, Next: Strategy Version, Prev: Quicksort (Example), Up: Quicksort (Example) Parallel Quicksort using Forcing Functions ------------------------------------------ The following naive attempt to introduce parallelism in quicksort fails because the threads generating loSort and hiSort only create a single cons cell. quicksortN [] = [] quicksortN [x] = [x] quicksortN (x:xs) = losort `par` hisort `par` result where losort = quicksortN [y|y <- xs, y < x] hisort = quicksortN [y|y <- xs, y >= x] result = losort ++ (x:hisort) The current practice of parallel Haskell programmers is to introduce a *forcing function* that forces the evaluation of the required elements of a data structure. For example forceList :: [a] -> () forceList [] = () forceList (x:xs) = x `seq` forceList xs Quicksort can be rewritten to have the desired behaviour using forceList as follows. quicksortF [] = [] quicksortF [x] = [x] quicksortF (x:xs) = (forceList losort) `par` (forceList hisort) `par` losort ++ (x:hisort) where losort = quicksortF [y|y <- xs, y < x] hisort = quicksortF [y|y <- xs, y >= x] To obtain the required dynamic behaviour for the parMap example we might use forceList within the definition of f: f x = forceList result `seq` result where result = [fib x] When programs are written in this style, a number of forcing functions are required: at least one for each type, e.g. forcePair. To obtain good dynamic behaviour, par, seq and forcing functions are inserted throughout the program. In consequence the dynamic behaviour and static semantics of the computation are intertwined. This intertwining obscures the semantics of the program. Small-scale programs remain manageable, but in larger programs, particularly those with complex data structures and parallel behaviour, discerning the meaning of a program becomes very hard.  File: user.info, Node: Strategy Version, Prev: Forcing Function Version, Up: Quicksort (Example) Parallel Quicksort using Strategies ----------------------------------- In the strategy version we can specify the evaluation degree by applying one of the predefined strategies to the components of the result. These strategies are then combined by using either par or seq to specify the parallelism. In quicksort, each parallel thread should construct all of its result list, and rnf expresses this neatly. The interesting equation becomes quicksortS (x:xs) = losort ++ (x:hisort) `using` strategy where losort = quicksortS [y|y <- xs, y < x] hisort = quicksortS [y|y <- xs, y >= x] strategy result = rnf losort `par` rnf hisort `par` rnf result `par` ()  File: user.info, Node: Sum of Squares (Example), Next: A Class of Strategies, Prev: Quicksort (Example), Up: Strategies Example Program: Sum of Squares =============================== Although the sum-of-squares program is a very simple producer-consumer program, it already shows some basic problems in this model. Unfortunately, the simple version of the program, which was presented in *Note Semi-explicit Parallelism::, produces hardly any parallelism at all. The reason for this behaviour is that because of the lazy evaluation approach the producer will only create the top cons cell of the intermediate list. When the consumer then demands the rest of the list it computes squares as well as the result sum. In the example below version 1 (with res_naive) shows this behaviour. To improve the parallelism one can use a "forcing function" on the intermediate list. The parallel thread is then the application of this forcing function on the squares list. As expected, this creates two threads that are computing most of the time and occasionally have to exchange data. Version 2 (with res_force) shows this behaviour. A better way to achieve the same operational behaviour is to define a strategy, how to compute the result. The strategy describes the parallelism and the evaluation degree. In doing so, it uses the reduce to normal form (rnf) strategy, which is defined in the class NFData. Compared to version 2 the strategy version 3 (with res_strategy) achieves a clean separation of defining the result and specifying operational details like parallelism and evaluation degree. module Main where import StrategiesVII main = getArgs abort ( \ a -> let args :: [Int] args = map (\ a1 -> fst ((readDec a1) !! 0)) a version = args !! 0 n = args !! 1 -- l = [1..] squares :: [Int] squares = [ i^2 | i <- [1..n] ] -- Naive version sparks a producer for list squares but doesn't force it res_naive = squares `par` sum squares -- This version sparks a forcing function on the list (producer) res_force = (foldl1 seq squares) `par` sum squares -- The strategy version res_strat = sum squares `using` (rnf squares `par` rnf) res = case version of 1 -> res_naive 2 -> res_force 3 -> res_strat _ -> res_naive str = case version of 1 -> "Naive version" 2 -> "Forcing version" 3 -> "Strategy version" _ -> "Naive version" in print ("Sum of squares (" ++ str ++ ") of length " ++ (show n) ++ " = " ++ (show res) ++ "\n") ) When running the naive version of this program the spark for the consumer process is pruned and all the computation of the consumer is subsumed by the producer. This gives a totally sequential program. The relevant parts of the GranSim profile are: Granularity Simulation for sq 1 99 +RTS -bP -bp: -bs ++++++++++++++++++++ PE 0 [0]: START 0 0x100620 [SN 0] PE 0 [67445]: SPARK 20004 0x347dbc [sparks 0] PE 0 [68241]: PRUNED 20004 0x347dbc [sparks 1] PE 0 [327088]: END 0, SN 0, ST 0, EXP F, BB 847, HA 5426, RT 327088, BT 0 (0), FT 0 (0), LS 0, GS 1, MY T The other two versions of the program create one producer and one consumer thread as expected.  File: user.info, Node: A Class of Strategies, Next: Experiences with Strategies, Prev: Sum of Squares (Example), Up: Strategies Using a Class of Strategies =========================== [FOR NOW THIS SECTION IS JUST A SHORT DISCUSSION OF THE STRATEGIES MODULE] This section discusses the contents of the strategies module. It shows how strategies are implemented on top of the basic par and seq annotations. Haskell's overloading mechanism is used to define a strategy for reducing to normal form. The current implementation is based on Haskell 1.2 but in the near future we want to use constructor classes as provided in Haskell 1.3. * Menu: * Type and Application:: * Primitive Strategies:: * Basic Strategies:: * Strategy Combinators::  File: user.info, Node: Type and Application, Next: Primitive Strategies, Prev: A Class of Strategies, Up: A Class of Strategies Type Definition of Strategies ----------------------------- A strategy does not return a meaningful result it only specifies parallelism and evaluation degree. Therefore, we use the following type for strategies type Strategy a = a -> () We can now define a "strategy application", which adds information about the operational behaviour of the execution to an expression using :: a -> Strategy a -> a using x s = s x `seq` x Note that `x `using` s' is a projection on `x', i.e. both * a retraction: `x `using` s [ x' * - and idempotent: `(x `using` s) `using` s = x `using` s'  File: user.info, Node: Primitive Strategies, Next: Basic Strategies, Prev: Type and Application, Up: A Class of Strategies Primitive Strategies -------------------- The primitive strategies are basically just syntactic sugar to fit par and seq into the strategy scheme. `sPar' is a strategy corresponding to `par', i.e. `x `par` e <=> e `using` sPar x' sPar :: a -> Strategy b sPar x y = x `par` () `sSeq' is a strategy corresponding to `seq', i.e. `x `seq` e <=> e `using` sSeq x' sSeq :: a -> Strategy b sSeq x y = x `seq` ()  File: user.info, Node: Basic Strategies, Next: Strategy Combinators, Prev: Primitive Strategies, Up: A Class of Strategies Basic Strategies ---------------- Three basic strategies describe the evaluation degree of a data structure: * r0 is a strategy which performs no evaluation at all r0 :: Strategy a r0 x = () * rwhnf reduces a data structure to weak head normal form rwhnf :: Strategy a rwhnf x = x `seq` () * rnf reduces a data structure to normal form. This strategy can't be defined independent of the type of the data structure. Therefore, we define a class NFData and overload the rnf strategy class NFData a where -- rnf reduces it's argument to (head) normal form rnf :: Strategy a -- Default method. Useful for base types. A specific method is necessary for -- constructed types rnf = rwhnf For primitive types like Ints and Chars the default methods can be used instance NFData Int instance NFData Char ...  File: user.info, Node: Strategy Combinators, Prev: Basic Strategies, Up: A Class of Strategies Strategy Combinators -------------------- When defining a strategy on a compound data type we can use these three basic strategies and compose them with the strategy combinators below. These combinators describe the parallelism in the evaluation. -- Pairs instance (NFData a, NFData b) => NFData (a,b) where rnf (x,y) = rnf x `seq` rnf y seqPair :: Strategy a -> Strategy b -> Strategy (a,b) seqPair strata stratb (x,y) = strata x `seq` stratb y parPair :: Strategy a -> Strategy b -> Strategy (a,b) parPair strata stratb (x,y) = strata x `par` stratb y -- Lists instance NFData a => NFData [a] where rnf [] = () rnf (x:xs) = rnf x `seq` rnf xs -- Applies a strategy to every element of a list in parallel parList :: Strategy a -> Strategy [a] parList strat [] = () parList strat (x:xs) = strat x `par` (parList strat xs) -- Applies a strategy to the first n elements of a list in parallel parListN :: (Integral b) => b -> Strategy a -> Strategy [a] parListN n strat [] = () parListN 0 strat xs = () parListN n strat (x:xs) = strat x `par` (parListN (n-1) strat xs) -- Sequentially applies a strategy to each element of a list seqList :: Strategy a -> Strategy [a] seqList strat [] = () seqList strat (x:xs) = strat x `seq` (seqList strat xs) -- Sequentially applies a strategy to the first n elements of a list seqListN :: (Integral a) => a -> Strategy b -> Strategy [b] seqListN n strat [] = () seqListN 0 strat xs = () seqListN n strat (x:xs) = strat x `seq` (seqListN (n-1) strat xs) Now we can use these predefined strategies to define a parallel map function -- parMap applies a function to each element of the argument list in -- parallel. The result of the function is evaluated using `strat' parMap :: Strategy b -> (a -> b) -> [a] -> [b] parMap strat f xs = map f xs `using` parList strat For bigger programs the programmer only has to define strategies on the most important data structures in order to introduce parallelism in to his program. With this approach the parallel program will not be cluttered with par annotations and it's easier to determine semantics and dynamic behaviour of the individual functions.  File: user.info, Node: Experiences with Strategies, Prev: A Class of Strategies, Up: Strategies Experiences with Strategies =========================== [STRATEGIES ARE COOL]  File: user.info, Node: Visualisation Tools, Next: GranSim Profiles, Prev: Strategies, Up: Top Visualisation Tools ******************* The visualisation tools that come together with GranSim take a GranSim profile as input and create level 2 PostScript files showing either activity or the granularity of the execution. A collection of scripts in the GranSim Toolbox allows to focus on specific aspects of the execution like the `node flow' i.e. the movement of nodes (closures) between the PEs. Most of these tools are implemented as Perl scripts. This means that they are very versatile and it should be easy to modify them. However, processing a large GranSim profile can involve a quite big amount of computation (and of memory). Speeding up crucial parts of the scripts is on my ToDo-list but better don't hold your breath. * Menu: * Activity Profiles:: * Granularity Profiles:: * Scripts::  File: user.info, Node: Activity Profiles, Next: Granularity Profiles, Prev: Visualisation Tools, Up: Visualisation Tools Activity Profiles ================= Three tools allow to show the activity during the execution in three levels of detail: * The *overall activity profile* (created with `gr2ps') shows the activity of the whole machine (*Note Overall Activity Profile::). * The *per-processor activity profile* (created with `gr2pe') shows the activity of all simulated processors (*Note Per-Processor Activity Profile::). * The *per-thread activity profile* (created with `gr2ap') shows the activity of all generated threads (*Note Per-Thread Activity Profile::). All tools discussed in this section print a help message when called with the option -h. This message shows the available options. In general, all tools understand the option -o for specifying the output file and -m for generating a monochrome profile (by default the tools generate colour PostScript). The `gr2ps' and `gr2ap' scripts work in two stages: 1. First the `.gr' file is translated into a `.qp' file, which is basically a stripped down version of a GranSim profile. 2. Then the `.qp' file is translated into a `.ps' file. The `gr2pe' script works directly on the `.gr' file. * Menu: * Overall Activity Profile:: * Per-Processor Activity Profile:: * Per-Thread Activity Profile::  File: user.info, Node: Overall Activity Profile, Next: Per-Processor Activity Profile, Prev: Activity Profiles, Up: Activity Profiles Overall Activity Profile ------------------------ The *overall activity profile* (created via gr2ps) shows the activity of the whole machine by separating the threads into up to five different groups. These groups describe the number of * running threads (i.e. threads that are currently performing a reduction), * runnable threads (i.e. threads that could be executed but that have not found an idle PE), * blocked threads (i.e. threads that wait for a result that is being computed by another thread), * fetching threads (i.e. threads that are currently fetching data from a remote PE), * migrating threads (i.e. threads that are currently being transferred from a busy PE to an idle PE). If the GranSim profile includes information about sparks (-bs option) it is also possible to show the number of sparks. However, this number is usually much bigger than the number of all threads. Therefore, it doesn't make much sense mixing the groups for sparks and threads in one profile. The option -O reduces the size of the generated PostScript file. The option -I is used to specify which groups of threads should be shown in which order. For example gr2ps -O -I "arb" parfib.gr generates a profile `parfib.ps' showing only active (`a'), runnable (`r') and blocked (`b') threads. The letter code for the other groups are `f' for fetching, `m' for migrating and `s' for sparks. In the current version the marks on the y-axis of the generated profile may be stretched or compressed. This might happen if many events occur at exactly the same time. If this is the case, the initial count of the maximal number of y-values may be wrong causing a rescaling at the very end. In practice that happens rarely (more often for GranSim-Light profiles, though).  File: user.info, Node: Per-Processor Activity Profile, Next: Per-Thread Activity Profile, Prev: Overall Activity Profile, Up: Activity Profiles Per-processor Activity Profile ------------------------------ The idea of the *per-processor activity profile* is to show the most important pieces of information about each processor in one graph. Therefore, it is easy to compare the behaviour of the different processors and to spot imbalances in the computation. This profile shows one strip for each of the simulated processors. Each of these strips encodes three kinds of information: * Is the processor *active* at a certain point? If it is active the strip appears in some shade of green (gray in the monochrome version). If it is idle it appears in red (white in the monochrome version). The area before starting the first thread and after terminating the last thread is left blank in both versions. * How high is the *load* of the processor? The load is measured by the number of runnable threads on this processor. A high load is shown by a dark shading of green (or grey). A palette at the top of the graph shows the available shadings (two ticks indicate the range that is used in the graph). It is possible to distinguish between 20 different values. Therefore, all processors with more than 20 runnable threads are shown in the same (dark) colour. * How many *blocked threads* are on the processor? This information is shown by the thickness of blue (black) bar at to bottom of each strip. Without any blocked threads no bar is shown. If 20 or more threads are blocked the bar covers 80% of strip. Thus, the load information is always visible `in the background'. This script also allows to produce variants of the same kind of graph that focus on different features of GranSim: * *Migration:* With the option -M this script produces a graph that draws arrows between processors indicating the migration of a thread from one processor to another. No load or blocking information is shown in this graph. * *Sparking:* With the option -S a spark graph is generated. It shows information about the number of sparks on a processor in the same way as the the number of runnable threads (i.e. by shading). This graph is useful to highlight processors that create a lot of work. No more than about 32 processors should be shown in one graph otherwise the strips are getting too small. This profile can not be generated for a GranSim-Light profile.  File: user.info, Node: Per-Thread Activity Profile, Prev: Per-Processor Activity Profile, Up: Activity Profiles Per-Thread Activity Profile --------------------------- The *per-thread activity profile* shows the activity of all generated threads. For each thread a horizontal line is shown. The line starts when the thread is created and ends when it is terminated. The thickness of the line indicates the state of the thread. The possible states correspond to the groups shown in the overall activity profile (*Note Overall Activity Profile::). The states are encoded in the following way: * A *running* thread is shown as a thick green (gray) line. * A *runnable* thread is shown as a medium red (black) line. * A *fetching* (or *migrating*) thread is shown as a thin blue (black) line. * A *blocked* thread is shown as a gap in the line. This profile gives the most accurate kind of information and it often allows to `step through' the computation by relating events on different processors with each other. For example the typical pattern at the beginning of the computation is some short computation for starting the thread followed by fetching remote data. After that the thread may become runnable if another thread has been started in the meantime. However, such a detailed analysis is only possible for programs with a rather small number of threads. Usually, GranSim profiles of bigger executions have to be pre-processed to reduce the number of threads that are shown on one graph (*Note Scripts::). As the level of detail provided by this graph is rarely needed for bigger executions no automatic splitting of a profile into several graphs has been implemented.  File: user.info, Node: Granularity Profiles, Next: Scripts, Prev: Activity Profiles, Up: Visualisation Tools Granularity Profiles ==================== The tools for generating *granularity profiles* aim at showing the relative sizes of the generated threads. Especially the number of tiny threads, for which the overhead of thread creation is relatively high is of interest. All tools discussed in this section require Gnuplot to generate the granularity profiles. I am using version 3.5 but it should work with older versions, too. For showing granularity basically two kinds of graphs can be generated: * A *bucket graph* (*Note Bucket Graphs::), which collects threads with similar runtime in the same bucket and shows the number of threads in each bucket. * A *cumulative graph* (*Note Cumulative Graphs::), which shows how many threads have a runtime less than or equal a given number.. The main tools for generating such graphs are: * gr2gran creates one bucket graph and one cumulative graph from a given GranSim profile. The information about the partitioning for the bucket statistics and other set-up information is usually provided in a *template file* (*Note Template Files::), which is specified via the -t option (-t , uses the global template file in $GRANDIR/bin). This script works in three stages: 1. First a `RTS' file is generated, which is only a sorted list of runtimes extracted out of the END events of a GranSim profile (*Note GranSim Profiles::). 2. The main stage generates a `gnuplot' file by grouping the threads into buckets and computing cumulative values. 3. Finally, Gnuplot is used to generate `PostScript' files showing the graphs. * gran-extr is based on the same idea as gr2gran, but it produces even more graphs, showing the communication percentage, determining a correlations coefficient between heap allocations and runtime etc. * Menu: * Bucket Graphs:: * Cumulative Graphs:: * Template Files:: * Statistics Package::  File: user.info, Node: Bucket Graphs, Next: Cumulative Graphs, Prev: Granularity Profiles, Up: Granularity Profiles Bucket Graphs ------------- In a bucket graph the x-axis indicating execution times of the threads is partitioned into intervals (dfn{buckets}). The graph shows a histogram of the number of threads in each bucket (i.e. whose execution time falls into this interval). For generating this kind of graph only a restricted GranSim profile (containing only END events) is required. For example one of the files generated by running gr2gran -t , pf.gr is g.ps, which contains such a bucket statistics. The -t option of this tool selects the right template file (, is a shorthand for the global template in $GRANDIR/bin). The necessary change in the template file for this bucket statistics is -- Intervals for pure exec. times G: (100, 200, 500, 1000, 2000, 5000, 10000)  File: user.info, Node: Cumulative Graphs, Next: Template Files, Prev: Bucket Graphs, Up: Granularity Profiles Cumulative Graphs ----------------- In a cumulative graph the x-axis again represents execution times of the individual threads. The value in the graph at the time t represents the number of threads whose execution time is smaller than t. Therefore, the values in the graph are monotonically increasing until the right end shows the total number of threads in the execution. Again, running gr2gran -t , pf.gr generates cumulative graphs for the runtime in the files cumu-rts.ps and cumu-rts0.ps (one file shows absolute numbers of threads, the other the percentage of the threads on the y-axis).  File: user.info, Node: Template Files, Next: Statistics Package, Prev: Cumulative Graphs, Up: Granularity Profiles Template Files -------------- The functions for reading template files can be found in `template.pl'. This file also contains documentation about the available fields.  File: user.info, Node: Statistics Package, Prev: Template Files, Up: Granularity Profiles Statistics Packages ------------------- A set of statistics functions for computing mean value, standard deviation, correlation coefficient etc can be found in `stats.pl'.  File: user.info, Node: Scripts, Prev: Granularity Profiles, Up: Visualisation Tools Scripts ======= The GranSim Toolbox contains not only visualisation tools but also a set of scripts that work on GranSim profiles and provide specific information. * The tf script aims at showing the task flow (as well as node flow) in the execution of a program. It is used in the GranSim Emacs mode to narrow a GranSim profile (*Note The Emacs GranSim Profile Mode::). * The SN script creates a summary of spark names that occur in a GranSim profile. This summary is shown as a impulses graph via Gnuplot. It allows to compare the relative number of threads generated by each static spark site. * The AVG and avg-RTS scripts compute the average runtime from an RTS file, which is generated by `gr2RTS'.  File: user.info, Node: GranSim Profiles, Next: Parallel NoFib Suite, Prev: Visualisation Tools, Up: Top GranSim Profiles **************** This chapter describes the contents of a "GranSim profile" (a `.gr' file). In most cases the profiles generated by the visualisation tools should provide sufficient information for tuning the performance of the program. However, it is possible to extract more information out of the generated GranSim profile. This chapter provides information how to do that. * Menu: * Types of GranSim Profiles:: * Contents of a GranSim Profile:: * The Emacs GranSim Profile Mode::  File: user.info, Node: Types of GranSim Profiles, Next: Contents of a GranSim Profile, Prev: GranSim Profiles, Up: GranSim Profiles Types of GranSim Profiles ========================= Depending on some runtime-system options different kind of profiles are generated: * A *reduced* GranSim profile contains in the body component only END events. This is sufficient to extract granularity profiles, but it is not sufficient to generate activity profiles. This is the default setting for GranSim profiles. * A *full* GranSim profile contains one line for every major event in the system (*Note Contents of a Granularity Profile::). A generation of such a profile is enabled by the RTS option -bP. * A *spark* profile additionally contains events related to sparks (for creating, using, pruning, exporting, acquiring sparks). Such a profile is generated when using the RTS option -bs. * A *heap* profile additionally contains events for allocating heap. Such a profile is generated when using the RTS option -bh.  File: user.info, Node: Contents of a GranSim Profile, Next: The Emacs GranSim Profile Mode, Prev: Types of GranSim Profiles, Up: GranSim Profiles Contents of a GranSim Profile ============================= This section describes the syntactic structure of a GranSim profile. * Menu: * Header:: * Body::  File: user.info, Node: Header, Next: Body, Prev: Contents of a GranSim Profile, Up: Contents of a GranSim Profile Header ------ The header contains general information about the execution. It is split into several sections separated by lines only consisting of - symbols. The end of the header is indicated by a line only consisting of + symbols. The sections of the header are: * Name of the program, arguments and start time. * General parameters describing the parallel architecture. This covers the number of processors, flags for thread migration, asynchronous communication etc. Finally, this section describes basic costs of the parallel machine like thread creation time, context switch time etc. * Communication parameters describing basic costs for sending messages like latency, message creation costs etc. * Instruction costs describing the costs of different classes of machine operations.  File: user.info, Node: Body, Prev: Header, Up: Contents of a GranSim Profile Body ---- The body of the GranSim profile contains events that are generated during the execution of the program. The following subsections first describe the general structure of the events and then go into details of several classes of events. * Menu: * General Structure:: * END Events:: * Basic Thread Events:: * Communication Events:: * Thread Migration Events:: * Spark Events:: * Debugging Events::  File: user.info, Node: General Structure, Next: END Events, Prev: Body, Up: Body General Structure of an Event ............................. Each line in the body of a GranSim profile represents one event during the execution of the program. The general structure of one such line is: * The keyword PE. * The *processor number* where the event happened. * The *time stamp* of the event (in square brackets). * The *name of the event*. * The *thread id* of the affected thread (a hex number). * Optionally a *node* as an additional argument to the event (e.g. the node to be reduced in case of a START event). This is either a hex number or the special string ______ indicating a Nil-closure. * Additional information depending on the event. This can be the processor from which data is fetched or the length of the spark queue after starting a new thread. The fields are separated by whitespace. A : symbol must follow the time stamp (which must be in sqare brackets).  File: user.info, Node: END Events, Next: Basic Thread Events, Prev: General Structure, Up: Body END Events .......... END events are an exception to this general structure. The reason for their special structure is that they summarise the most important information about the thread. Therefore, information about e.g. the granularity of the threads can be extracted out of END events alone without having to generate a full GranSim profile. The structure of an END event is: * The keyword PE. * The *processor number* where the event happened. * The *time stamp* of the event (in square brackets). * The *name of the event* (END in this case). * The keyword SN followed by the *spark name* of the thread. This information allows to associate a thread with its static spark site in the program (*Note Parallelism Annotations::, on how to give names to spark sites.) * The keyword ST followed by the *start time* of the thread. * The keyword EXP followed by a flag indicating whether this thread has been *exported* to another processor or has been evaluated locally (possible values are T and F). * The keyword BB followed by the number of *basic blocks* that have been executed by this thread. * The keyword HA followed by the number of *heap allocations*. * The keyword RT followed by the total *runtime*. This is the most important information in an END event. It is used by the visualisation tools for generating granularity profiles. * The keyword BT followed by the total *block time* and a block count (i.e. how often the thread has been blocked). * The keyword FT followed by the total *fetch time* and a fetch count (i.e. how often the thread fetched remote data). * The keyword LS followed by the *number of local sparks* (sparks that have to be executed on the local processor) generated by the thread. * The keyword GS followed by the *number of global sparks* generated by the thread. * The keyword EXP followed by a flag indicating whether this thread was *mandatory* or only advisable (in the current version this flag is not used; it would be important in a combination of GranSim with a concurrent set-up).  File: user.info, Node: Basic Thread Events, Next: Communication Events, Prev: END Events, Up: Body Basic Thread Events ................... The main events directly related to threads are: * START: Generated when starting a thread (after adding overhead for thread creation to the clock of the current processor). After the thread id it has two additional fields: one specifying the node to be evaluated (as a hex number) and the spark site that generated this thread (format: [SN N] where N is a dec number). * START(Q): Same as START but the new thread is put into the runnable queue rather than being executed (only if the current processor is busy at that point). * BLOCK: A thread is blocked on data that is under evaluation by another thread. It is descheduled and put into the blocking queue of that node. Two additional fields contain the node on which the thread is blocked and the processor on which this node is lying (format: (from N) where N is a processor (dec) number). * RESUME: Continue to execute the thread after it has been blocked or has been waiting for remote data. This event does not contain additional fields. * RESUME(Q): Same as RESUME but the new thread is put into the runnable queue rather than being executed (only if the current processor is busy at that point). * SCHEDULE: The thread is scheduled on the given processor (no additional fields). This event is usually emitted after terminating a thread on the processor. It may also occur after a FETCH (if asynchronous communication is turned on) or after a BLOCK event. * DESCHEDULE: The thread is descheduled on the given processor (no additional fields). After this event the thread is in the runnable queue. This event is not used for implicit descheduling that is performed after events like BLOCK or FETCH. DESCHEDULE events should only occur if fair scheduling is turned on.  File: user.info, Node: Communication Events, Next: Thread Migration Events, Prev: Basic Thread Events, Up: Body Communication Events .................... Events that are issued when sending data between processors are: * FETCH: Send a fetch request from the given thread (on the given processor) to another processor. This event has two additional fields: The first field is the node (hex number) that should be fetched. The last field is the processor where this node is lying and from which the data has to be fetched (format: (from N) where N is a processor (dec) number). * REPLY: A reply for a fetch request of the given thread arrived at the given processor. The first additional field contains the node and the last field contains the processor from which it arrived (format: (from N) where N is a processor (dec) number). Note: This event only marks the arrival of the data. It is usually followed by a RESUME or RESUME(Q) event for the thread that asked for the data.  File: user.info, Node: Thread Migration Events, Next: Spark Events, Prev: Communication Events, Up: Body Thread Migration Events ....................... These events are only produced when thread migration is enabled (-bM): * STEALING: Indicates the stealing of a thread on the given processor. The thread which is being stolen appears in the thread field. One additional field (the last field) indicates which processor is stealing that thread (format: (by N) where N is a processor (dec) number). * STOLEN: Indicates the arrival of a stolen thread on the given processor. Two additional fields show the node which will be evaluated by this thread next. The last field shows from which processor the thread has been stolen (format: (from N) where N is a processor (dec) number). Note: This thread is immediately being executed by the given processor (no RESUME event follows). * STOLEN(Q): Same as STOLEN but the new thread is put into the runnable queue rather than being executed (only if the current processor is busy at that point).  File: user.info, Node: Spark Events, Next: Debugging Events, Prev: Thread Migration Events, Up: Body Spark Events ............ When enabling spark profiling, events related to sparks will appear in the profile: * SPARK: Indicates the generation of a spark on the given processor for the given node. At that point it is added to this processor's spark pool. Two additional fields show the node to which this spark is pointing and the current size of the spark pool (format: [sparks N] where N is a dec number). * SPARKAT: Same as SPARK but with explicit placement of the spark on this processor. This is usually achieved in the program by using a parLocal or parAt rather than a parGlobal annotation (*Note Parallelism Annotations::). * USED: Indicates that this spark is turned into a thread on the given processor. A START or START(Q) event will follow soon afterwards. * PRUNED: A spark is removed from the spark pool of the given processor. This might occur when the spark points to a normal form (there is no work to do for that spark). This is checked when creating a spark and when searching the spark pool for new work. * EXPORTED: A spark is exported from a given processor. Two additional fields show the node to which this spark is pointing and the current size of the spark pool (format: [sparks N] where N is a dec number). * ACQUIRED: A spark that has been exported by another proceessor is acquired by the given processor. Two additional fields have the same meaning as for EXPORTED.  File: user.info, Node: Debugging Events, Prev: Spark Events, Up: Body Debugging Events ................ Certain debug options generate additional events that allow to monitor the internal behaviour of the simulator. This information shouldn't be of interest for the friendly user but might come in handy for those who dare hacking at the runtime-system: * SYSTEM_START: Indicates that the simulator is executing a "system" routine (a routine in the runtime-system that is not directly related to graph reduction). This allows to show when exactly rescheduling is done in the simulator. It may be useful in GranSim-Light to check that the costs during system operations are attached to the right thread. * SYSTEM_END: See previous event. From this point on normal graph reduction is performed.  File: user.info, Node: The Emacs GranSim Profile Mode, Prev: Contents of a GranSim Profile, Up: GranSim Profiles The Emacs GranSim Profile Mode ============================== Looking up information directly in a GranSim profile is very tedious (believe me, I have done it quite often). To make this task easier the GranSim Toolbox contains a GNU Emacs mode for GranSim profiles: the "GranSim Profile Mode". The most useful features (IMNSHO) are highlighting of parts of a GranSim profile and narrowing of the profile to specific PEs, threads, events etc. * Menu: * Installation:: * Customisation:: * Features::  File: user.info, Node: Installation, Next: Customisation, Prev: The Emacs GranSim Profile Mode, Up: The Emacs GranSim Profile Mode Installation ------------ To use this mode just put the file `GrAnSim.el' somewhere on your Emacs LOAD-PATH and load the file. I don't have autoload support at the moment, but the file is very short anyway, so directly loading it is quite fast. Currently, the mode requires the HILIT19 package for highlighting parts of the profile. It also requires the `tf' script in the bin dir of your GranSim installation. I use Emacs 19.31 with the default `hilit19.el' package, but the GranSim profile mode has been successfully tested with Emacs 19.27. However, if you have problems with the mode please report it to the address shown at the end of this document (*Note Bug Reports::).  File: user.info, Node: Customisation, Next: Features, Prev: Installation, Up: The Emacs GranSim Profile Mode Customisation ------------- A few Emacs variables control the behaviour of the GranSim Profile mode: -- Variable: gransim-auto-hilit This variable indicates whether highlighting is turned on by default. Note that you can customise `hilit19' such that it does not automatically highlight buffers that are bigger than a given size. Since GranSim profiles tend to be extremely large you might want to reduce the default value. -- Variable: grandir The root of the GranSim installation. The mode searches for scripts of the GranSim Toolbox in the directory grandir/bin. By default this variable is set to the contents of the environment variable `GRANDIR'. -- Variable: hwl-hi-node-face Face to be used for specific highlighting of a node. -- Variable: hwl-hi-thread-face Face to be used for specific highlighting of a thread. Here are the hilit19 variables that are of some interest for the GranSim Profile Mode: -- Variable: hilit-auto-highlight T if we should highlight all buffers as we find 'em, nil to disable automatic highlighting by the find-file hook. Default value: t. -- Variable: hilit-auto-highlight-maxout Auto-highlight is disabled in buffers larger than this. Default value: 60000.  File: user.info, Node: Features, Prev: Customisation, Up: The Emacs GranSim Profile Mode Features -------- The main features of the GranSim profile mode are: * *Highlighting* of parts of the profile. Colour coding is used to distinguish between events that start a reduction, finish a reduction and block a reduction. Within END events the total runtime is specially highlighted. * *Narrowing* of the profile. This should not be confused with the narrowing mode in Emacs. The narrowing in GranSim profile mode is done by running a script (`tf') over the buffer and displaying the output in another buffer. Hence, narrowing can be further refined be improving the `tf' script, which is written in Perl. It is possible to narrow a GranSim profile to a specific * *processor* (PE), * *event*, * *thread*, * *node*, * *spark* (only possible for spark profiles) This feature is particularly useful to e.g. follow a node, which has been moved between processors or to concentrate on the reductions on one specific processor. Of course, for those pagans, who don't believe in Emacs it is also possible to run the `tf' script directly on a `.gr' file. * A second form of *highlighting*, specialised for nodes and threads is available, too. With the commands `hwl-hi-thread' and `hwl-hi-node' every occurrence of the thread or node in the profile after the current point is highlighted. The function `hwl-hi-clear' undoes all such highlighting. * There is a menu item for calling most of the functions described here. It automatically appears in any GranSim profile (i.e. any file that has a `.gr' extension). Default key bindings in GranSim profile mode: -- : `C-c T', `M-X HWL-TRUNCATE' Truncate event lines such that exactly one line is shown for one event in the body of a profile. -- : `C-c W', `M-X HWL-WRAP' Wrap lines to show them in full length. -- : `C-c ', `M-X HWL-TOGGLE-TRUNCATE-WRAP' Toggle between the above two modes. -- : `C-c H', `M-X HILIT-REHIGHLIGHT-BUFFER' Rehighlight the whole buffer. -- : `C-c P', `M-X HWL-NARROW-TO-PE' Narrow the profile to a PE. -- : `C-c T', `M-X HWL-NARROW-TO-THREAD' Narrow the profile to a thread. -- : `C-c E', `M-X HWL-NARROW-TO-EVENT' Narrow the profile to an event. -- : `C-c C-E', `(LAMBDA () (HWL-NARROW-TO-EVENT "END"))' Narrow the profile to an END event. -- : `C-c ', `M-X HWL-TOGGLE-TRUNCATE-WRAP' Toggle between the above two modes. -- : `C-c N', `M-X HWL-HI-NODE' Highlight a node in the profile. -- : `C-c T', `M-X HWL-HI-THREAD' Highlight a thread in the profile. -- : `C-c C-C', `M-X HWL-HI-CLEAR' Remove highlightings of nodes and threads.  File: user.info, Node: Parallel NoFib Suite, Next: Internals of GranSim, Prev: GranSim Profiles, Up: Top The Parallel NoFib Suite ************************ While developing GranSim (and GUM) we have started to collect a set of interesting and non-trivial parallel programs written in parallel Haskell. These programs are used as a test suite for GranSim and part of the NoFib Suite of GHC. Currently this suite contains the following programs: 1. Sieve of Erathostenes (`sieve'), which creates a bidirectional pipeline, where each processor filters the multiples of one prime number out of an input list. 2. Warshall's shortest path algorithm in a graph (`warshall'), which creates a cyclic pipeline of processors. 3. A function `matmult', which performs a parallel matrix multiplication. 4. A function `determinant', which computes for a given square matrix its determinant. 5. A function `LinSolv', which solves a given set of linear equations over integers by using multiple homomorphic images, computing the result in each image via Cramer's Rule. The main part of this algorithm is the parallel determinant computation. 6. A function `coins', which computes the number of possibilities how to pay a given value with a given set of coins. 7. An award assignment program, that basically has to compute permutations of a given list (`pperms'). 8. A parallel verion of Quicksort (`sort'). 9. A function word search program `soda7', which searches a given grid of letters for a given set of words. 10. An RSA encryption program `rsa' (taken out of the sequential nofib suite). 11. An n-queens program `queens'. 12. The parallel database manager for GUM `dcbm'. It performs queries to a database in parallel. 13. The bill of materials program `bom' developed (under GranSim) by Phil Trinder as part of the Parallel Databases Project. 14. A ray-tracer `nray' based on a version taken from Kelly's thesis and ported to GRIP by Hammond. 15. A univariant resultant computation, using the SACLIB library for computer algebra. Mainly the basic polynomial arithmetic has been taken form that library. 16. One of the FLARE programs, solving a problem in the area of computational fluid dynamics `cfd'. 17. A function `minimax', which computes for a given position in the game of tic-tac-toe the best next move using an Alpha-Beta pruning algorithm. 18. Several NESL programs, including a numerical integration algorithm `integrate', a quick hull algorithm `qhull' for computing the convex hull of a set of points in a plane, a matrix inversion based on gauss-jordan elimination `mi' and a fast fourier transformation `fft'. All algorithms are based on the NESL versions published by Guy Blelloch in CACM 39(3) and translated to Haskell. 19. A Newton-Raphson iteration for finding a root of a polynomial. This program has been taken from the Impala suite of implicitly parallel benchmark programs (written in Id). Despite its name the parallel nofib suite also contains some fib-ish programs. These programs should be of interest for getting a start with parallel functional programming using GranSim: 1. A parallel factorial function (`parfact'), which computes the sum of all numbers from 1 up to a given value n by bisecting the interval and computing results of the intervals in parallel. 2. A parallel fib-like function (`parfib'), which performs an additional gcd computation and 2 additional multiplications in each recursive call.