Go to the first, previous, next, last section, table of contents.
GranSim provides a large number of runtime-system options for controlling
the simulation. Most of these options allow to specify a particular
parallel architecture in very much detail.
As general convention all GranSim related options start with -b to
separate them from other GHC RTS options. To separate the RTS options
from usual options given to the Haskell program the meta-option +RTS
has to be used.
If you are not interested in the details of the available options and
just want to specify a somewhat generic setup for one class of parallel
machines go to the last section of this chapter (see section Specific Setups).
.
.
.
.
The options in this section are probably the most important GranSim
options the programmer has to be aware of. They define the basic
behaviour of GranSim rather than going down to a low level of specifying
machine characteristics.
- RTS Option: -bP
- .
This option controls the generation of a GranSim profile (see section GranSim Profiles). By default a
reduced profile (only END events) is created. With -bP a
full GranSim profile is generated. Such a profile is necessary to create
activity profiles. With -b-P no profile is generated at all.
- RTS Option: -bs
- .
Generate a spark profile. The GranSim profile will contain events
describing the creation, movement, consumption and pruning of sparks.
Note: This option will drastically increase the size of the
generated GranSim profile.
- RTS Option: -bh
- .
Generate a heap profile. The GranSim profile will contain events
describing heap allocations.
Note: This option will drastically increase the size of the
generated GranSim profile.
- RTS Option: -bpn
- .
Choose the number of processors to simulate. The value of n must
be less than or equal to the word size on the machine (i.e. usually 32).
If n is 0 GranSim-Light mode is enabled.
- RTS Option: -bp:
- .
Enable GranSim-Light (same as -bp0). In this mode there is no limit
on the number of processors but no communication costs are recorded.
The options in this section allow to simulate special features in the
runtime-system of the simulated machine. This allows to study
how these options influence the behaviour of different kinds of
parallel machines. All of these flags can be turned of by inserting a
- symbol after b (as in -b-P).
.
.
.
.
.
If a thread fetches remote data by default the processor is blocked
until the data arrives. This synchronous communication is usually
only advantageous on machines with very low latency. On such machines it
is better to wait and avoid the overhead of a context
switch. Synchronous communication may also increase data locality as a
thread is only descheduled when it gets blocked on data that is under
evaluation by another thread.
On machines with high latency asynchronous communication is
usually better as it allows the processor to perform some useful
computation while a thread waits for the arrival of remote data.
However, the processor might be even more aggressive in trying to get
work while another thread is waiting for data. The aggressiveness of the
processor to get new work is determined by the fetching
strategy. Currently five different strategies are
supported.
- RTS Option: -byn
- .
Choose a Fetching Strategy (i.e. determine what to do while the running
thread fetches remote data):
-
Synchronous communication (default).
-
This and all higher fetching strategies implement asynchronous
communication. This strategy schedules another runnable thread if one is
available. This gives the same behaviour as -bZ.
-
If no runnable thread is available a local spark is turned into a thread.
This adds task creation overhead to context switch overhead for
asynchronous communication.
-
If no local sparks are available the processor tries to acquire a remote
spark.
-
If the processor can't get a remote spark it tries to acquire a runnable
thread from another busy processor provided that migration is also
turned on (-bM).
- RTS Option: -bZ
- .
Enable asynchronous communication. This causes a thread to be
descheduled when it fetches remote data. Its processor schedules
another runnable thread. If none is available it becomes idle.
This gives the same behaviour as -by1.
Note that fetching strategies 3 and 4 involve the sending of messages to
other processors. Therefore, it's likely that by the time a spark (or
thread) has been fetched, the original thread has already received its
data and is being executed again. Therefore, in most cases strategies 1
and 2 yield better results than 3 and 4.
.
.
.
.
.
.
.
When fetching remote data there are several possibilities how to
transfer the data. The options -bG and -bQn allow to
choose among them. By default GranSim uses incremental fetching
(also called single closure fetching, or lazy fetching).
This means that only the closure that is immediately required is fetched
from a remote processor. Again, this strategy is preferable for low
latency systems as it minimises the total amount of data that has to be
transferred. However, if the overhead for creating a packet and for
sending a message are high it is better to perform bulk fetching,
which transfers a subgraph with the required closure as its root.
The size of the subgraph can be bounded by specifying the maximal size
of a packet or by specifying the maximal number of thunks (unevaluated
closures) that should be put into one packet. The way how to determine
which closures to put into a packet is called packing strategy.
- RTS Option: -bG
- .
Enable bulk fetching. This causes the whole subgraph with the needed
closure as its root to be transferred.
- RTS Option: -bQn
- .
Pack at most n-1 non-root thunks into a packet. Choosing a value
of 1 means that only normal form closures are transfered with the
possible exception of the root, of course. The value 0 is interpreted as
infinity (i.e. pack the whole subgraph). This is the default setting.
- RTS Option: -Qn
- .
Pack at most n words in one packet. This limits the size of the
packet and does not distinguish between thunks and normal forms. The
default packet size is 1024 words.
The option for setting the packet size differs from the usual GranSim
naming scheme as it is also available for GUM. In fact, both
implementations use almost the same source code for packing a graph.
.
.
.
.
When an idle processor looks for work it first checks its local spark
pool, then it tries to get a remote spark. It might happen that no
sparks are available in the system any more and that some processors
have several runnable threads. In such a situation it might be
advantageous to transfer a runnable thread from a busy processor to an
idle one. However, this thread migration is very expensive and
should be avoided unless absolutely necessary. Therefore, by default
thread migration is turned off.
- RTS Options: -bM
- .
Enable thread migration. When an idle process has no local sparks and
can't find global sparks it tries to migrate (steal) a runnable thread from
another busy processor.
Note that thread migration often causes a lot of fetching in order to
move all required data to the new processor, too. This bears the risk of
destroying data locality.
.
.
.
.
.
.
The options in this section allow to specify the overheads for
communication. Note that in GranSim-Light mode all of these values are
irrelevant (communication is futile).
- RTS Option: -bln
- .
Set the latency in the system to n machine cycles. Typical values
for shared memory machines are 60 -- 100 cycles, for GRIP (a
closely-coupled distributed memory machine) around 400 cycles and for
standard distributed memory machines between 1000 and 5000 cycles. The
default value is 1000 cycles.
- RTS Option: -ban
- .
Set the additional latency in the system to n machine cycles. The
additional latency is the latency of follow-up packets within the same
message. Usually this is much smaller than the latency of the first
packet (default: 100 cycles).
- RTS Option: -bmn
- .
Set the overhead for message packing to n machine cycles. This is
the overhead for constructing a packet independent of its size.
- RTS Option: -bxn
- .
Set the overhead for tidying up the packet after sending it to n
machine cycles. On some systems significant work is needed after having
sent a packet.
- RTS Option: -brn
- .
Set the overhead for unpacking a message to n machine
cycles. Again, this overhead is independent of the message size.
- RTS Option: -bgn
- .
Set the overhead for fetching remote data to n machine cycles.
By default this value is two times latency plus message unpack time.
.
.
.
.
.
.
The options in this section model overhead that is related to the
runtime-system of the simulated parallel machine.
- RTS Option: -btn
- .
Set the overhead for thread creation to n machine cycles. This
overhead includes costs for initialising a control structure describing
the thread and allocating stack space for the execution.
- RTS Option: -bqn
- .
Set the overhead for putting a thread into the blocking queue of a
closure to n machine cycles.
- RTS Option: -bcn
- .
Set the overhead for scheduling a thread to n machine cycles.
- RTS Option: -bdn
- .
Set the overhead for descheduling a thread to n machine cycles.
- RTS Option: -bnn
- .
Set the overhead for global unblocking (i.e. blocking on a remote closure)
to n machine cycles. This value does not contain the overhead
caused by the communication between the processors.
- RTS Option: -bun
- .
Set the overhead for local unblocking (i.e. putting a thread out of a
blocking queue and into a runnable queue)
to n machine cycles. This value does not contain the overhead
caused by the communication between the processors.
.
.
.
.
.
.
The options in this section specify the characteristics of the
microprocessor of the simulated parallel machine. To this end the
instructions of the processor are divided into six groups.
These groups have different weights reflecting their different relative costs.
The groups of processor instructions are:
-
Arithmetic instructions
-
Load instructions
-
Store instructions
-
Branch instructions
-
Floating point instruction
-
Heap allocations
The options for assigning weights to these groups are:
- RTS Option: -bAn
- .
Set weight for arithmetic operations to n machine cycles.
- RTS Option: -bLn
- .
Set weight for load operations to n machine cycles.
- RTS Option: -bSn
- .
Set weight for store operations to n machine cycles.
- RTS Option: -bBn
- .
Set weight for branch operations to n machine cycles.
- RTS Option: -bFn
- .
Set weight for floating point operations to n machine cycles.
- RTS Option: -bHn
- .
Set weight for heap allocations to n machine cycles.
Strictly speaking, the heap allocation costs is a parameter of the
runtime-system. However, in our underlying machine model allocating heap
is such a basic operation that one can think of it as a special
instruction.
.
.
.
.
.
There are three granularity control mechanisms:
-
Explicit threshold
No spark whose priority is smaller than a given threshold will be turned
into a thread.
-
Priority sparking
The spark queue is sorted by priority. This guarantees that the highest
priority spark is turned into a thread. Priorities are not maintained
for threads.
-
Priority scheduling
The thread queue is sorted by priority, too. This guarantees that the
biggest available thread is scheduled next. This imposes a higher
runtime overhead.
- RTS Option: -bYn
- .
Use the value n as a threshold when turning sparks into threads. No
spark with a priority smaller than n will be turned into a thread.
- RTS Option: -bXx
- .
Enable priority sparking. The letter x indicates how to use the
granularity information attached to a spark site in the source code:
-
Use the granularity information field as a priority.
- I
Use the granularity information field as an inverse priority.
- R
Ignore the granularity information field and use a random priority.
- N
Ignore the granularity information field and don't use priorities at all.
- RTS Option: -bI
- .
Enable priority scheduling.
- RTS Option: -bKn
- .
Set the overhead for inserting a spark into a sorted spark queue
to n machine cycles.
- RTS Option: -bOn
- .
Set the overhead for inserting a thread into a sorted thread queue
to n machine cycles.
.
.
.
.
.
.
- RTS Option: -bC
- .
Force the system to eagerly turn a spark into a thread. This basically
disables the lazy thread creation mechanism of GranSim and ensures that
no sparks are discarded (except for sparks whose closures are already in
normal form).
- RTS Option: -be
- .
Enable deterministic mode. This means that no random number generator is
used for deciding where to get work from. With this option two runs of
the same program with the same input will yield exactly the same result.
- RTS Option: -bT
- .
Prefer stealing threads over stealing sparks when looking for remote
work. This is mainly an experimental option.
- RTS Option: -bN
- .
When creating a new thread prefer sparks generated by local closures over
sparks that have been stolen from other processors.
This is mainly an experimental option, which might improve data locality.
- RTS Option: -bfn
- .
Specify the maximal number of outstanding requests to acquire sparks from
remote processors. High values of n may cause a more even
distribution of sparks avoiding bottlenecks caused by a `spark explosion'
on one processor. However, this might harm the data locality. The default
value is 1.
- RTS Option: -bwn
- .
Set the time slice of the simulator
to n machine cycles.
This is an internal variable that changes the behaviour of the
simulation rather than that of the simulated machine. A longer time slice
means faster but less accurate simulation. The default time slice is 1000
cycles.
.
These options are mainly intended for debugging GranSim. Only those
options that might be of interest to the friendly (i.e. non-hacking) user
of GranSim are listed here for now.
Note: These options are only available if the RTS has been
compiled with the cpp flag GRAN_CHECK (see section Installing).
- RTS Option: -bDE
- .
Print an event statistics at the end of the computation. This also
includes a statistics about the packages sent (if bulk fetching) has
been enabled.
If you are really interested in all the hidden options in GranSim
look into the file `ghc/runtime/main/RtsFlags.lc'.
.
.
.
.
Some options of the GHC runtime-system that are not specific for GranSim
are of special interest, too. They are discussed in this section.
- RTS Option: -on
- .
This option sets the initial size of the stack of a thread to n
words. This might be of importance for GranSim-Light, which can
create an abundance of parallel threads filling up the heap. The default
stack size in GranSim-Light is already reduced to 200 words (usually the
default is 1024 words). If you run into problems with heap size in a
GranSim-Light setup you might want to reduce it further.
- RTS Option: -Sf
- .
Print messages about garbage collections to the file f (or to
stderr).
- RTS Option: -F2s
- .
Use a two-space garbage collector instead of a generational garbage
collector. Previous versions of GranSim had problems with the latter. If
you experience problems try this option and send me a bug report
(see section Bug Reports).
.
When using GranSim a programmer often just wants to specify the general
architecture of the machine rather than going down to the details of a
specific machine. To facilitate this approach this section presents examples of
standard set-ups for GranSim.
Note that these setups specify the characteristics of a machine, but not of
the runtime-system. Thus characteristics like thread creation costs are
left open. However, the default setting fairly closely reflect the real
costs for example under GUM. So, unless you have a different
implementation of runtime-system details in mind the default settings
should be sufficiently accurate.
.
.
This setup reflects the ideal case, where communication is for free and
where there is no limit on the number of processors. This is used to
show the maximal amount of parallelism in the program. Using such a
GranSim-Light setup is usually the first step in tuning the
performance of a parallel, lazy functional program (see section GranSim Modes).
The typical GranSim-Light setup is:
+RTS -bP -b:
.
.
In a shared memory machine the latency is roughly reduced to the costs
of a load operation. Potentially, some additional overhead for managing
the shared memory has to be added. Also the caching mechanism might be
more expensive than in a sequential machine.
In general, the latency should be between 5 and 20 machine cycles.
For machines where the latency is of the same order of magnitude as
loading and storing data, it is reasonable to assume incremental,
synchronous communication. Migration should also be possible.
This gives the following setup (for 32 processors):
+RTS -bP -bp32 -bl10 -b-G -by0 -bM
.
.
Strongly connected DMMs put a specific emphasis on keeping the latency in the
system as low as possible. One example of such a machine is GRIP, which
has been built specifically for performing parallel graph
reduction. Therefore, this setup is of special interest for us.
Most importantly the latency in such machines is typically between 100
and 500 cycles (400 for GRIP). Furthermore, the GRIP runtime-system, as
an example for such kind of machines, uses incremental, synchronous
communication. Migration is also possible.
This gives the following setup (for 32 processors):
+RTS -bP -bp32 -bl400 -b-G -by0 -bM
.
.
General distributed memory machines usually have latencies that are a
order of magnitude higher than that of strongly connected DMMs. However,
especially in this class of machines the differences between specific
machines are quite significant. So, I strongly recommend to use the
exact machine characteristics if GranSim should be used to predict the
behaviour on such a machine.
The high latency requires a fundamentally different runtime-system to
avoid long delays for fetching remote data. Therefore, usually
synchronous bulk fetching is used. I'd recommend choosing a fetching
strategy of 1 or 2 (it's hard to say which one is better in general).
Thread migration is such expensive on DMMs that it is often not
supported at all.
This gives the following setup (for 32 processors):
+RTS -bP -bp32 -bl2000 -bG -by2 -b-M
Go to the first, previous, next, last section, table of contents.