Go to the first, previous, next, last section, table of contents.

Runtime-System Options

GranSim provides a large number of runtime-system options for controlling the simulation. Most of these options allow to specify a particular parallel architecture in very much detail.

As general convention all GranSim related options start with -b to separate them from other GHC RTS options. To separate the RTS options from usual options given to the Haskell program the meta-option +RTS has to be used.

If you are not interested in the details of the available options and just want to specify a somewhat generic setup for one class of parallel machines go to the last section of this chapter (see section Specific Setups).

Basic Options

. . . .

The options in this section are probably the most important GranSim options the programmer has to be aware of. They define the basic behaviour of GranSim rather than going down to a low level of specifying machine characteristics.

RTS Option: -bP: . This option controls the generation of a GranSim profile (see section GranSim Profiles). By default a reduced profile (only END events) is created. With -bP a full GranSim profile is generated. Such a profile is necessary to create activity profiles. With -b-P no profile is generated at all.

RTS Option: -bs: . Generate a spark profile. The GranSim profile will contain events describing the creation, movement, consumption and pruning of sparks. Note: This option will drastically increase the size of the generated GranSim profile.

RTS Option: -bh: . Generate a heap profile. The GranSim profile will contain events describing heap allocations. Note: This option will drastically increase the size of the generated GranSim profile.

RTS Option: -bpn: . Choose the number of processors to simulate. The value of n must be less than or equal to the word size on the machine (i.e. usually 32). If n is 0 GranSim-Light mode is enabled.

RTS Option: -bp:: . Enable GranSim-Light (same as -bp0). In this mode there is no limit on the number of processors but no communication costs are recorded.

Special Features

The options in this section allow to simulate special features in the runtime-system of the simulated machine. This allows to study how these options influence the behaviour of different kinds of parallel machines. All of these flags can be turned of by inserting a - symbol after b (as in -b-P).

Asynchronous Communication

. . . . .

If a thread fetches remote data by default the processor is blocked until the data arrives. This synchronous communication is usually only advantageous on machines with very low latency. On such machines it is better to wait and avoid the overhead of a context switch. Synchronous communication may also increase data locality as a thread is only descheduled when it gets blocked on data that is under evaluation by another thread.

On machines with high latency asynchronous communication is usually better as it allows the processor to perform some useful computation while a thread waits for the arrival of remote data. However, the processor might be even more aggressive in trying to get work while another thread is waiting for data. The aggressiveness of the processor to get new work is determined by the fetching strategy. Currently five different strategies are supported.

RTS Option: -byn

. Choose a Fetching Strategy (i.e. determine what to do while the running thread fetches remote data):

Synchronous communication (default).
This and all higher fetching strategies implement asynchronous communication. This strategy schedules another runnable thread if one is available. This gives the same behaviour as -bZ.
If no runnable thread is available a local spark is turned into a thread. This adds task creation overhead to context switch overhead for asynchronous communication.
If no local sparks are available the processor tries to acquire a remote spark.
If the processor can't get a remote spark it tries to acquire a runnable thread from another busy processor provided that migration is also turned on (-bM).

RTS Option: -bZ: . Enable asynchronous communication. This causes a thread to be descheduled when it fetches remote data. Its processor schedules another runnable thread. If none is available it becomes idle. This gives the same behaviour as -by1.

Note that fetching strategies 3 and 4 involve the sending of messages to other processors. Therefore, it's likely that by the time a spark (or thread) has been fetched, the original thread has already received its data and is being executed again. Therefore, in most cases strategies 1 and 2 yield better results than 3 and 4.

Bulk Fetching

. . . . . . .

When fetching remote data there are several possibilities how to transfer the data. The options -bG and -bQn allow to choose among them. By default GranSim uses incremental fetching (also called single closure fetching, or lazy fetching). This means that only the closure that is immediately required is fetched from a remote processor. Again, this strategy is preferable for low latency systems as it minimises the total amount of data that has to be transferred. However, if the overhead for creating a packet and for sending a message are high it is better to perform bulk fetching, which transfers a subgraph with the required closure as its root. The size of the subgraph can be bounded by specifying the maximal size of a packet or by specifying the maximal number of thunks (unevaluated closures) that should be put into one packet. The way how to determine which closures to put into a packet is called packing strategy.

RTS Option: -bG: . Enable bulk fetching. This causes the whole subgraph with the needed closure as its root to be transferred.

RTS Option: -bQn: . Pack at most n-1 non-root thunks into a packet. Choosing a value of 1 means that only normal form closures are transfered with the possible exception of the root, of course. The value 0 is interpreted as infinity (i.e. pack the whole subgraph). This is the default setting.

RTS Option: -Qn: . Pack at most n words in one packet. This limits the size of the packet and does not distinguish between thunks and normal forms. The default packet size is 1024 words.

The option for setting the packet size differs from the usual GranSim naming scheme as it is also available for GUM. In fact, both implementations use almost the same source code for packing a graph.

Migration

. . . .

When an idle processor looks for work it first checks its local spark pool, then it tries to get a remote spark. It might happen that no sparks are available in the system any more and that some processors have several runnable threads. In such a situation it might be advantageous to transfer a runnable thread from a busy processor to an idle one. However, this thread migration is very expensive and should be avoided unless absolutely necessary. Therefore, by default thread migration is turned off.

RTS Options: -bM: . Enable thread migration. When an idle process has no local sparks and can't find global sparks it tries to migrate (steal) a runnable thread from another busy processor.

Note that thread migration often causes a lot of fetching in order to move all required data to the new processor, too. This bears the risk of destroying data locality.

Communication Parameters

. . . . . .

The options in this section allow to specify the overheads for communication. Note that in GranSim-Light mode all of these values are irrelevant (communication is futile).

RTS Option: -bln: . Set the latency in the system to n machine cycles. Typical values for shared memory machines are 60 -- 100 cycles, for GRIP (a closely-coupled distributed memory machine) around 400 cycles and for standard distributed memory machines between 1000 and 5000 cycles. The default value is 1000 cycles.

RTS Option: -ban: . Set the additional latency in the system to n machine cycles. The additional latency is the latency of follow-up packets within the same message. Usually this is much smaller than the latency of the first packet (default: 100 cycles).

RTS Option: -bmn: . Set the overhead for message packing to n machine cycles. This is the overhead for constructing a packet independent of its size.

RTS Option: -bxn: . Set the overhead for tidying up the packet after sending it to n machine cycles. On some systems significant work is needed after having sent a packet.

RTS Option: -brn: . Set the overhead for unpacking a message to n machine cycles. Again, this overhead is independent of the message size.

RTS Option: -bgn: . Set the overhead for fetching remote data to n machine cycles. By default this value is two times latency plus message unpack time.

Runtime-System Parameters

. . . . . .

The options in this section model overhead that is related to the runtime-system of the simulated parallel machine.

RTS Option: -btn: . Set the overhead for thread creation to n machine cycles. This overhead includes costs for initialising a control structure describing the thread and allocating stack space for the execution.

RTS Option: -bqn: . Set the overhead for putting a thread into the blocking queue of a closure to n machine cycles.

RTS Option: -bcn: . Set the overhead for scheduling a thread to n machine cycles.

RTS Option: -bdn: . Set the overhead for descheduling a thread to n machine cycles.

RTS Option: -bnn: . Set the overhead for global unblocking (i.e. blocking on a remote closure) to n machine cycles. This value does not contain the overhead caused by the communication between the processors.

RTS Option: -bun: . Set the overhead for local unblocking (i.e. putting a thread out of a blocking queue and into a runnable queue) to n machine cycles. This value does not contain the overhead caused by the communication between the processors.

Processor Characteristics

. . . . . .

The options in this section specify the characteristics of the microprocessor of the simulated parallel machine. To this end the instructions of the processor are divided into six groups. These groups have different weights reflecting their different relative costs.

The groups of processor instructions are:

Arithmetic instructions
Load instructions
Store instructions
Branch instructions
Floating point instruction
Heap allocations

The options for assigning weights to these groups are:

RTS Option: -bAn: . Set weight for arithmetic operations to n machine cycles.

RTS Option: -bLn: . Set weight for load operations to n machine cycles.

RTS Option: -bSn: . Set weight for store operations to n machine cycles.

RTS Option: -bBn: . Set weight for branch operations to n machine cycles.

RTS Option: -bFn: . Set weight for floating point operations to n machine cycles.

RTS Option: -bHn: . Set weight for heap allocations to n machine cycles.

Strictly speaking, the heap allocation costs is a parameter of the runtime-system. However, in our underlying machine model allocating heap is such a basic operation that one can think of it as a special instruction.

Granularity Control Mechanisms

. . . . .

There are three granularity control mechanisms:

Explicit threshold No spark whose priority is smaller than a given threshold will be turned into a thread.
Priority sparking The spark queue is sorted by priority. This guarantees that the highest priority spark is turned into a thread. Priorities are not maintained for threads.
Priority scheduling The thread queue is sorted by priority, too. This guarantees that the biggest available thread is scheduled next. This imposes a higher runtime overhead.

RTS Option: -bYn: . Use the value n as a threshold when turning sparks into threads. No spark with a priority smaller than n will be turned into a thread.

RTS Option: -bXx

. Enable priority sparking. The letter x indicates how to use the granularity information attached to a spark site in the source code:

Use the granularity information field as a priority.
I Use the granularity information field as an inverse priority.
R Ignore the granularity information field and use a random priority.
N Ignore the granularity information field and don't use priorities at all.

RTS Option: -bI: . Enable priority scheduling.

RTS Option: -bKn: . Set the overhead for inserting a spark into a sorted spark queue to n machine cycles.

RTS Option: -bOn: . Set the overhead for inserting a thread into a sorted thread queue to n machine cycles.

Miscellaneous Options

. . . . . .

RTS Option: -bC: . Force the system to eagerly turn a spark into a thread. This basically disables the lazy thread creation mechanism of GranSim and ensures that no sparks are discarded (except for sparks whose closures are already in normal form).

RTS Option: -be: . Enable deterministic mode. This means that no random number generator is used for deciding where to get work from. With this option two runs of the same program with the same input will yield exactly the same result.

RTS Option: -bT: . Prefer stealing threads over stealing sparks when looking for remote work. This is mainly an experimental option.

RTS Option: -bN: . When creating a new thread prefer sparks generated by local closures over sparks that have been stolen from other processors. This is mainly an experimental option, which might improve data locality.

RTS Option: -bfn: . Specify the maximal number of outstanding requests to acquire sparks from remote processors. High values of n may cause a more even distribution of sparks avoiding bottlenecks caused by a `spark explosion' on one processor. However, this might harm the data locality. The default value is 1.

RTS Option: -bwn: . Set the time slice of the simulator to n machine cycles. This is an internal variable that changes the behaviour of the simulation rather than that of the simulated machine. A longer time slice means faster but less accurate simulation. The default time slice is 1000 cycles.

Debugging Options

These options are mainly intended for debugging GranSim. Only those options that might be of interest to the friendly (i.e. non-hacking) user of GranSim are listed here for now.

Note: These options are only available if the RTS has been compiled with the cpp flag GRAN_CHECK (see section Installing).

RTS Option: -bDE: . Print an event statistics at the end of the computation. This also includes a statistics about the packages sent (if bulk fetching) has been enabled.

If you are really interested in all the hidden options in GranSim look into the file `ghc/runtime/main/RtsFlags.lc'.

General GHC RTS Options

. . . .

Some options of the GHC runtime-system that are not specific for GranSim are of special interest, too. They are discussed in this section.

RTS Option: -on: . This option sets the initial size of the stack of a thread to n words. This might be of importance for GranSim-Light, which can create an abundance of parallel threads filling up the heap. The default stack size in GranSim-Light is already reduced to 200 words (usually the default is 1024 words). If you run into problems with heap size in a GranSim-Light setup you might want to reduce it further.

RTS Option: -Sf: . Print messages about garbage collections to the file f (or to stderr).

RTS Option: -F2s: . Use a two-space garbage collector instead of a generational garbage collector. Previous versions of GranSim had problems with the latter. If you experience problems try this option and send me a bug report (see section Bug Reports).

Specific Setups

When using GranSim a programmer often just wants to specify the general architecture of the machine rather than going down to the details of a specific machine. To facilitate this approach this section presents examples of standard set-ups for GranSim.

Note that these setups specify the characteristics of a machine, but not of the runtime-system. Thus characteristics like thread creation costs are left open. However, the default setting fairly closely reflect the real costs for example under GUM. So, unless you have a different implementation of runtime-system details in mind the default settings should be sufficiently accurate.

The Ideal GranSim Setup

. .

This setup reflects the ideal case, where communication is for free and where there is no limit on the number of processors. This is used to show the maximal amount of parallelism in the program. Using such a GranSim-Light setup is usually the first step in tuning the performance of a parallel, lazy functional program (see section GranSim Modes).

The typical GranSim-Light setup is:

+RTS -bP -b:

GranSim Setup for Shared-Memory Machines

. .

In a shared memory machine the latency is roughly reduced to the costs of a load operation. Potentially, some additional overhead for managing the shared memory has to be added. Also the caching mechanism might be more expensive than in a sequential machine. In general, the latency should be between 5 and 20 machine cycles.

For machines where the latency is of the same order of magnitude as loading and storing data, it is reasonable to assume incremental, synchronous communication. Migration should also be possible.

This gives the following setup (for 32 processors):

 +RTS -bP -bp32 -bl10 -b-G -by0 -bM

GranSim Setup for Strongly Connected Distributed Memory Machines

. .

Strongly connected DMMs put a specific emphasis on keeping the latency in the system as low as possible. One example of such a machine is GRIP, which has been built specifically for performing parallel graph reduction. Therefore, this setup is of special interest for us.

Most importantly the latency in such machines is typically between 100 and 500 cycles (400 for GRIP). Furthermore, the GRIP runtime-system, as an example for such kind of machines, uses incremental, synchronous communication. Migration is also possible.

This gives the following setup (for 32 processors):

 +RTS -bP -bp32 -bl400 -b-G -by0 -bM

GranSim Setup for Distributed Memory Machines

. .

General distributed memory machines usually have latencies that are a order of magnitude higher than that of strongly connected DMMs. However, especially in this class of machines the differences between specific machines are quite significant. So, I strongly recommend to use the exact machine characteristics if GranSim should be used to predict the behaviour on such a machine.

The high latency requires a fundamentally different runtime-system to avoid long delays for fetching remote data. Therefore, usually synchronous bulk fetching is used. I'd recommend choosing a fetching strategy of 1 or 2 (it's hard to say which one is better in general). Thread migration is such expensive on DMMs that it is often not supported at all.

This gives the following setup (for 32 processors):

 +RTS -bP -bp32 -bl2000 -bG -by2 -b-M

Go to the first, previous, next, last section, table of contents.