
Go backward to General
Go up to Large Parallel Applications
Go forward to Lolita
Methodology
Our emerging methodology for parallelising large non-strict functional
programs is outlined below. The approach is top-down, starting with
the top level pipeline, and then parallelising successive components
of the program. The first five stages are machine-independent. Our
approach uses several ancillary tools, including time
profiling [Sansom and Peyton Jones, 1995] and the GranSim
simulator [Hammond et al., 1995]. Several stages use GranSim, which is
fully integrated with the GUM parallel runtime system [Trinder et al., 1996]. A
crucial property of GranSim is that it can be parameterised to
simulate both real architectures and an idealised machine with, for
example, zero-cost communication and an infinite number of processors.
The stages in our methodology are as follows.
- Sequential implementation.
Start with a correct implementation of an inherently-parallel algorithm
or algorithms.
- Parallelise Top-level Pipeline. Most non-trivial programs have
a number of stages, e.g. lex, parse and typecheck in a
compiler. Pipelining the output of each stage into the next is very
easy to specify, and often gains some parallelism for minimal
change.
- Time Profile the sequential application to discover the "big
eaters", i.e. the computationally intensive pipeline stages.
- Parallelise Big Eaters using evaluation strategies.
It is sometimes possible to introduce adequate parallelism without
changing the algorithm; otherwise the algorithm may need to be revised
to introduce an appropriate form of parallelism,
e.g. divide-and-conquer or data-parallelism.
- Idealised Simulation. Simulate the parallel execution of the
program on an idealised execution model, i.e. with an infinite number
of processors, no communication latency, no thread-creation costs
etc. This is a "proving" step: if the program isn't parallel on an
idealised machine it won't be on a real machine. We now use
GranSim, but have previously used hbcpp. A simulator is often easier to
use, more heavily instrumented, and can be run in a more convenient
environment, e.g. a workstation.
- Realistic Simulation. GranSim can be parameterised to closely
resemble the GUM runtime system for a particular machine, forming a
bridge between the idealised and real machines. A major concern at
this stage is to improve thread granularity so as to offset
communication and thread-creation costs.
- Real Machine. The GUM runtime system supports some of the GranSim
performance visualisation tools. This seamless integration
helps understand real parallel performance.
It is more conventional to start with a sequential program and then
move almost immediately to working on the target parallel machine.
This has often proved highly frustrating: the development environments
on parallel machines are usually much worse than those available on
sequential counterparts, and, although it is crucial to achieve good
speedups, detailed performance information is frequently not
available. It is also often unclear whether poor performance is due
to use of algorithms that are inherently sequential, or simply
artifacts of the communication system or other dynamic
characteristics. In its overall structure our methodology is similar
to others used for large-scale parallel functional
programming [Hartel et al., 1995].
{trinder,hwloidl,simonpj}@dcs.gla.ac.uk, kh@dcs.st-and.ac.uk
