Go to the first, previous, next, last section, table of contents.

Visualisation Tools

The visualisation tools that come together with GranSim take a GranSim profile as input and create level 2 PostScript files showing either activity or the granularity of the execution. A collection of scripts in the GranSim Toolbox allows to focus on specific aspects of the execution like the `node flow' i.e. the movement of nodes (closures) between the PEs.

Most of these tools are implemented as Perl scripts. This means that they are very versatile and it should be easy to modify them. However, processing a large GranSim profile can involve a quite big amount of computation (and of memory). Speeding up crucial parts of the scripts is on my ToDo-list but better don't hold your breath.

Activity Profiles

. . .

Three tools allow to show the activity during the execution in three levels of detail:

The overall activity profile (created with `gr2ps') shows the activity of the whole machine (see section Overall Activity Profile).
The per-processor activity profile (created with `gr2pe') shows the activity of all simulated processors (see section Per-processor Activity Profile).
The per-thread activity profile (created with `gr2ap') shows the activity of all generated threads (see section Per-Thread Activity Profile).

All tools discussed in this section print a help message when called with the option -h. This message shows the available options. In general, all tools understand the option -o <file> for specifying the output file and -m for generating a monochrome profile (by default the tools generate colour PostScript).

The `gr2ps' and `gr2ap' scripts work in two stages:

First the `.gr' file is translated into a `.qp' file, which is basically a stripped down version of a GranSim profile.
Then the `.qp' file is translated into a `.ps' file.

The `gr2pe' script works directly on the `.gr' file.

Overall Activity Profile

The overall activity profile (created via gr2ps) shows the activity of the whole machine by separating the threads into up to five different groups. These groups describe the number of

running threads (i.e. threads that are currently performing a reduction),
runnable threads (i.e. threads that could be executed but that have not found an idle PE),
blocked threads (i.e. threads that wait for a result that is being computed by another thread),
fetching threads (i.e. threads that are currently fetching data from a remote PE),
migrating threads (i.e. threads that are currently being transferred from a busy PE to an idle PE).

If the GranSim profile includes information about sparks (-bs option) it is also possible to show the number of sparks. However, this number is usually much bigger than the number of all threads. Therefore, it doesn't make much sense mixing the groups for sparks and threads in one profile.

The option -O reduces the size of the generated PostScript file. The option -I <string> is used to specify which groups of threads should be shown in which order. For example

gr2ps -O -I "arb" parfib.gr

generates a profile `parfib.ps' showing only active (`a'), runnable (`r') and blocked (`b') threads. The letter code for the other groups are `f' for fetching, `m' for migrating and `s' for sparks.

In the current version the marks on the y-axis of the generated profile may be stretched or compressed. This might happen if many events occur at exactly the same time. If this is the case, the initial count of the maximal number of y-values may be wrong causing a rescaling at the very end. In practice that happens rarely (more often for GranSim-Light profiles, though).

The picture below shows an overall activity profile for a simple parallel divide-and-conquer program. The header of the graph shows the runtime-system options for the execution.

@centerline{@psfig{angle=90,file=parbonzo.ps,width=@hsize}}

The overall runtime is measured in machine cycles. The average parallelism is the area covered by the running (green or medium-gray) threads, normalised with respect to the total runtime. In this graph only three groups show up: Most of the time 16 threads are running, utilising all available processors with occasional dips in the green area. The big amber (or light-gray) area of runnable threads indicates that this program can easily use all available processors. The large amount of blocking indicated by the large red (or black) area in the graph is caused by nodes near the root in the computation tree. They have to wait for the results of their children to combine them into the overall result. The sequential part at the beginning of the computation is due to I/O overhead including the initialisation of the basic I/O monad. Towards the end of the computation, when combining results near the root of the computation tree, the overall utilisation drops significantly. In this setup the latency is so small that no large areas of fetching (blue) threads appear. Also migration is turned of in this case (it can be turned on with the runtime system option -bM). The picture below shows an overall activity profile for a simple parallel divide-and-conquer program. The header of the graph shows the runtime-system options for the execution.

Overall activity profile

Per-processor Activity Profile

The idea of the per-processor activity profile is to show the most important pieces of information about each processor in one graph. Therefore, it is easy to compare the behaviour of the different processors and to spot imbalances in the computation.

This profile shows one strip for each of the simulated processors. Each of these strips encodes three kinds of information:

Is the processor active at a certain point? If it is active the strip appears in some shade of green (gray in the monochrome version). If it is idle it appears in red (white in the monochrome version). The area before starting the first thread and after terminating the last thread is left blank in both versions.
How high is the load of the processor? The load is measured by the number of runnable threads on this processor. A high load is shown by a dark shading of green (or grey). A palette at the top of the graph shows the available shadings (two ticks indicate the range that is used in the graph). It is possible to distinguish between 20 different values. Therefore, all processors with more than 20 runnable threads are shown in the same (dark) colour.
How many blocked threads are on the processor? This information is shown by the thickness of blue (black) bar at to bottom of each strip. Without any blocked threads no bar is shown. If 20 or more threads are blocked the bar covers 80% of strip. Thus, the load information is always visible `in the background'.

This script also allows to produce variants of the same kind of graph that focus on different features of GranSim:

Migration: With the option -M this script produces a graph that draws arrows between processors indicating the migration of a thread from one processor to another. No load or blocking information is shown in this graph.
Sparking: With the option -S a spark graph is generated. It shows information about the number of sparks on a processor in the same way as the the number of runnable threads (i.e. by shading). This graph is useful to highlight processors that create a lot of work.

No more than about 32 processors should be shown in one graph otherwise the strips are getting too small. This profile can not be generated for a GranSim-Light profile.

The graph below shows a per-processor activity profile for the same parallel divide-and-conquer program as in the previous section.

@centerline{@psfig{angle=90,file=parbonzo-pe.ps,width=@hsize}}

The graph shows the activity of each of the 16 processors in this simulation. The dark green areas in the first third of the computation show that processors 2, 4 and 14 have a significantly higher load of runnable threads than the rest. Also note the pattern that a decreasing number of blocked threads (thinner blue bar) is accompanied by an increasing number of runnable threads (darker green area).

The graph below shows a per-processor activity profile for the same parallel divide-and-conquer program as in the previous section.

Per-processor activity profile for parfib

Per-Thread Activity Profile

The per-thread activity profile shows the activity of all generated threads. For each thread a horizontal line is shown. The line starts when the thread is created and ends when it is terminated. The thickness of the line indicates the state of the thread. The possible states correspond to the groups shown in the overall activity profile (see section Overall Activity Profile).

The states are encoded in the following way:

A running thread is shown as a thick green (gray) line.
A runnable thread is shown as a medium red (black) line.
A fetching (or migrating) thread is shown as a thin blue (black) line.
A blocked thread is shown as a gap in the line.

This profile gives the most accurate kind of information and it often allows to `step through' the computation by relating events on different processors with each other. For example the typical pattern at the beginning of the computation is some short computation for starting the thread followed by fetching remote data. After that the thread may become runnable if another thread has been started in the meantime.

The picture below shows an example of a per-thread activity profile. Note the short period of fetching immediately after starting a thread in order to get the data for the spark that has just been turned into a thread. The high degree of suspension is mainly due to the fact that migration is turned off in this example.

@centerline{@psfig{angle=90,file=pm-ap.ps,width=@hsize}}

Per-thread activity profile

However, such a detailed analysis is only possible for programs with a rather small number of threads. Usually, GranSim profiles of bigger executions have to be pre-processed to reduce the number of threads that are shown on one graph (see section Scripts). As the level of detail provided by this graph is rarely needed for bigger executions no automatic splitting of a profile into several graphs has been implemented.

Granularity Profiles

The tools for generating granularity profiles aim at showing the relative sizes of the generated threads. Especially the number of tiny threads, for which the overhead of thread creation is relatively high is of interest.

All tools discussed in this section require Gnuplot to generate the granularity profiles. I am using version 3.5 but it should work with older versions, too.

For showing granularity basically two kinds of graphs can be generated:

A bucket graph (see section Bucket Graphs), which collects threads with similar runtime in the same bucket and shows the number of threads in each bucket.
A cumulative graph (see section Cumulative Graphs), which shows how many threads have a runtime less than or equal a given number..

The main tools for generating such graphs are:

gr2gran creates one bucket graph and one cumulative graph from a given GranSim profile. The information about the partitioning for the bucket statistics and other set-up information is usually provided in a template file (see section Template Files), which is specified via the -t option (-t , uses the global template file in $GRANDIR/bin). This script works in three stages:
1. First a `RTS' file is generated, which is only a sorted list of runtimes extracted out of the END events of a GranSim profile (see section GranSim Profiles).
2. The main stage generates a `gnuplot' file by grouping the threads into buckets and computing cumulative values.
3. Finally, Gnuplot is used to generate `PostScript' files showing the graphs.
gran-extr is based on the same idea as gr2gran, but it produces even more graphs, showing the communication percentage, determining a correlations coefficient between heap allocations and runtime etc.

Bucket Graphs

. .

In a bucket graph the x-axis indicating execution times of the threads is partitioned into intervals (dfn{buckets}). The graph shows a histogram of the number of threads in each bucket (i.e. whose execution time falls into this interval). For generating this kind of graph only a restricted GranSim profile (containing only END events) is required.

For example one of the files generated by running

gr2gran -t , pf.gr

is g.ps, which contains such a bucket statistics. The -t option of this tool selects the right template file (, is a shorthand for the global template in $GRANDIR/bin).

Here is the bucket statistics of executing parfib 22:

@centerline{@psfig{angle=270,file=pf-bp0-g.ps,width=@hsize}}

It shows that this program creates a huge number of tiny threads (note the log scale in the graph). Refining the intervals for these tiny threads further gives the following bucket statistics

@centerline{@psfig{angle=270,file=pf-bp0-g1.ps,width=@hsize}}

Here is the bucket statistics of executing @t{parfib 22}:

Bucket statistics

It shows that this program creates a huge number of tiny threads (note the log scale in the graph). Refining the intervals for these tiny threads further gives the following bucket statistics

Bucket statistics

The necessary change in the template file for this bucket statistics is

-- Intervals for pure exec. times
G: (100, 200, 500, 1000, 2000, 5000, 10000)

Cumulative Graphs

In a cumulative graph the x-axis again represents execution times of the individual threads. The value in the graph at the time t represents the number of threads whose execution time is smaller than t. Therefore, the values in the graph are monotonically increasing until the right end shows the total number of threads in the execution.

Again, running

gr2gran -t , pf.gr

generates cumulative graphs for the runtime in the files cumu-rts.ps and cumu-rts0.ps (one file shows absolute numbers of threads, the other the percentage of the threads on the y-axis).

Here is the cumulative runtime statistics of executing parfib 22:

@centerline{@psfig{angle=270,file=pf-bp0-cumu-rts0.ps,width=@hsize}}

Here is the cumulative runtime statistics of executing @t{parfib 22}:

Cumulative runtime statistics

The tf script aims at showing the task flow (as well as node flow) in the execution of a program. It is used in the GranSim Emacs mode to narrow a GranSim profile (see section The Emacs GranSim Profile Mode).
The SN script creates a summary of spark names that occur in a GranSim profile. This summary is shown as a impulses graph via Gnuplot. It allows to compare the relative number of threads generated by each static spark site.
The AVG and avg-RTS scripts compute the average runtime from an RTS file, which is generated by `gr2RTS'.

Go to the first, previous, next, last section, table of contents.