Algorithm + Strategy = Parallelism -- Methodology

Go backward to General
Go up to Large Parallel Applications
Go forward to Lolita

Methodology

Our emerging methodology for parallelising large non-strict functional programs is outlined below. The approach is top-down, starting with the top level pipeline, and then parallelising successive components of the program. The first five stages are machine-independent. Our approach uses several ancillary tools, including time profiling [Sansom and Peyton Jones, 1995] and the GranSim simulator [Hammond et al., 1995]. Several stages use GranSim, which is fully integrated with the GUM parallel runtime system [Trinder et al., 1996]. A crucial property of GranSim is that it can be parameterised to simulate both real architectures and an idealised machine with, for example, zero-cost communication and an infinite number of processors.

The stages in our methodology are as follows.

Sequential implementation. Start with a correct implementation of an inherently-parallel algorithm or algorithms.
Parallelise Top-level Pipeline. Most non-trivial programs have a number of stages, e.g. lex, parse and typecheck in a compiler. Pipelining the output of each stage into the next is very easy to specify, and often gains some parallelism for minimal change.
Time Profile the sequential application to discover the "big eaters", i.e. the computationally intensive pipeline stages.
Parallelise Big Eaters using evaluation strategies. It is sometimes possible to introduce adequate parallelism without changing the algorithm; otherwise the algorithm may need to be revised to introduce an appropriate form of parallelism, e.g. divide-and-conquer or data-parallelism.
Idealised Simulation. Simulate the parallel execution of the program on an idealised execution model, i.e. with an infinite number of processors, no communication latency, no thread-creation costs etc. This is a "proving" step: if the program isn't parallel on an idealised machine it won't be on a real machine. We now use GranSim, but have previously used hbcpp. A simulator is often easier to use, more heavily instrumented, and can be run in a more convenient environment, e.g. a workstation.
Realistic Simulation. GranSim can be parameterised to closely resemble the GUM runtime system for a particular machine, forming a bridge between the idealised and real machines. A major concern at this stage is to improve thread granularity so as to offset communication and thread-creation costs.
Real Machine. The GUM runtime system supports some of the GranSim performance visualisation tools. This seamless integration helps understand real parallel performance.

It is more conventional to start with a sequential program and then move almost immediately to working on the target parallel machine. This has often proved highly frustrating: the development environments on parallel machines are usually much worse than those available on sequential counterparts, and, although it is crucial to achieve good speedups, detailed performance information is frequently not available. It is also often unclear whether poor performance is due to use of algorithms that are inherently sequential, or simply artifacts of the communication system or other dynamic characteristics. In its overall structure our methodology is similar to others used for large-scale parallel functional programming [Hartel et al., 1995].

{trinder,hwloidl,simonpj}@dcs.gla.ac.uk, kh@dcs.st-and.ac.uk