Abstract: This thesis demonstrates how to reduce the runtime of large non-strict functional programs using parallel evaluation. The parallelisation of several programs shows the importance of granularity, i.e. the computation costs of program expressions. The aspect of granularity is studied both on a practical level, by presenting and measuring runtime granularity improvement mechanisms, and at a more formal level, by devising a static granularity analysis.
By parallelising several large functional programs this thesis demonstrates for the first time the advantages of combining lazy and parallel evaluation on a large scale: laziness aids modularity, while parallelism reduces runtime. One of the parallel programs is the Lolita system which, with more than 47,000 lines of code, is the largest existing parallel non-strict functional program. A new mechanism for parallel programming, evaluation strategies, to which this thesis contributes, is shown to be useful in this parallelisation. Evaluation strategies simplify parallel programming by separating algorithmic code from code specifying dynamic behaviour. For large programs the abstraction provided by functions is maintained by using a data-oriented style of parallelism, which defines parallelism over intermediate data structures rather than inside the functions.
A highly parameterised simulator, GranSim, has been constructed collaboratively and is discussed in detail in this thesis. GranSim is a tool for architecture-independent parallelisation and a testbed for implementing runtime-system features of the parallel graph reduction model. By providing an idealised as well as an accurate model of the underlying parallel machine, GranSim has proven to be an essential part of an integrated parallel software engineering environment. Several parallel runtime-system features, such as granularity improvement mechanisms, have been tested via GranSim. It is publicly available and in active use at several universities worldwide.
In order to provide granularity information this thesis presents an inference-based static granularity analysis. This analysis combines two existing analyses, one for cost and one for size information. It determines an upper bound for the computation costs of evaluating an expression in a simple strict higher-order language. By exposing recurrences during cost reconstruction and using a library of recurrences and their closed forms, it is possible to infer the costs for some recursive functions. The possible performance improvements are assessed by measuring the parallel performance of a hand-analysed and annotated program.
Abstract: The design and implementation of useful programming languages, whether sequential or parallel, should be driven by large, realistic applications. In constructing several medium- and large-scale programs in Glasgow Parallel Haskell, GPH, a parallel extension of Haskell, the group at Glasgow has investigated several important engineering issues:
Abstract: In this paper we report on the ongoing parallelisation of Lolita, a natural language engineering system. Although Lolita currently exhibits only modest parallelism, we believe that it is the largest parallel functional program ever, comprising more than 47,000 lines of Haskell. Lolita has the following interesting features common to real world applications of lazy languages:
Our expectations in parallelising the program were to achieve moderate speedups with small changes in the code. To date speedups of up to 2.4 have been achieved for Lolita running under a realistic simulation of our 4 processor shared-memory target machine. Most notably, the parallelism is achieved with a very small number of changes to, and without requiring an understanding of most of the application. On the Sun SPARCserver target machine wall-clock speedup is currently limited by physical memory availability.
Abstract: The process of writing large parallel programs is complicated by the need to specify both the parallel behaviour of the program and the algorithm that is to be used to compute its result. This paper introduces evaluation strategies, lazy higher-order functions that control the parallel evaluation of non-strict functional languages. Using evaluation strategies, it is possible to achieve a clean separation between algorithmic and behavioural code. The result is enhanced clarity and shorter parallel programs.
Evaluation strategies are a very general concept: this paper shows how they can be used to model a wide range of commonly used programming paradigms, including divide-and-conquer, pipeline parallelism, producer/consumer parallelism, and data-oriented parallelism. Because they are based on unrestricted higher-order functions, they can also capture irregular parallel structures. Evaluation strategies are not just of theoretical interest: they have evolved out of our experience in parallelising several large-scale parallel applications, where they have proved invaluable in helping to manage the complexities of parallel behaviour. Some of these applications are described in detail here. The largest application we have studied to date, Lolita, is a 60,000 line natural language engineering system. Initial results show that for these programs we can achieve acceptable parallel performance, for relatively little programming effort.
Abstract: This paper studies critical runtime-system issues encountered when packing data for transmission in a lazy, parallel graph reduction system. In particular, we aim to answer two questions:
In order to answer the first question, we compare various packing schemes, of which one extreme packs just the node that is demanded (``incremental fetching''), and the other packs all the graph that is reachable from that node (``bulk fetching''). The second question is addressed by considering various mechanisms for latency hiding during communication, ranging from fully synchronous communication with no attempt to mask latency, to full thread migration during asynchronous communication.
In order to make our results as general as possible, we have used the GranSim simulator to study a wide variety of parallel machine configurations. Based on these measurements we propose concrete improvements for parallel graph reducers such as the GUM implementation of Glasgow Parallel Haskell.
Abstract: "An invisible car came out of nowhere, struck my vehicle and vanished." "I pulled away from the side of the road, glanced at my mother-in-law, and headed for the embankment." "As I approached the intersection a sign suddenly appeared in a place where no stop sign had ever appeared before."
Luckily, we don't normally have to deal with problems as bizarre as these. One interesting application that does arise at the Centre for Transport Studies consists of matching police reports of several accidents so as to locate accident blackspots. The application provides an interesting, data-intensive, test-bed for the persistent functional language PFL. We report here on an approach aimed at improving the performance of this application using Glasgow Parallel Haskell.
The accident application is one of several large parallel Haskell programs under development at Glasgow. Our objective is to achieve wall-clock speedups over the best sequential implementations, and we report modest wall-clock speedups for a demonstration program. From experience with these and other programs the group is developing a methodology for parallelising large functional programs. We have also developed strategies, a mechanism to separately specify a function's algorithm and its dynamic behaviour.
Abstract: This paper describes an inference system, whose purpose is to determine the cost of evaluating expressions in a strict purely functional language. Upper bounds can be derived for both computation cost and the size of data structures. We outline a static analysis based on this inference system for inferring size and cost information. The analysis is a synthesis of the sized types of Hughes et al., and the polymorphic time system of Dornic et al., which was extended to static dependent costs by Reistad and Gifford.
Our main interest in cost information is for scheduling tasks in the parallel execution of functional languages. Using the GranSim parallel simulator, we show that the information provided by our analysis is sufficient to characterise relative task granularities for a simple functional program. This information can be used in the runtime-system of the Glasgow Parallel Haskell compiler to improve dynamic program performance.
Abstract: GUM is a portable, parallel implementation of the Haskell functional language. Despite sustained research interest in parallel functional programming, GUM is one of the first such systems to be made publicly available. GUM is message-based, and portability is facilitated by using the PVM communications harness that is available on many multi-processors. As a result, GUM is available for both shared-memory (Sun SPARCserver multiprocessors) and distributed-memory (networks of workstations) architectures. The high message-latency of distributed machines is ameliorated by sending messages asynchronously, and by sending large packets of related data in each message. Initial performance figures demonstrate absolute speedups relative to the best sequential compiler technology. To improve the performance of a parallel Haskell program GUM provides tools for monitoring and visualising the behaviour of threads and of processors during execution.
Abstract: This paper studies the runtime behaviour of various parallel divide-and-conquer algorithms written in a non-strict functional language, when three common granularity control mechanisms are used: a simple cut-off, a priority thread creation and a priority scheduling mechanism. These mechanisms use granularity information that is currently provided via annotations to improve the performance of the parallel programs.
The programs we examine are several variants of a generic divide-and-conquer program, an unbalanced divide-and-conquer algorithm and a parallel determinant computation. Our results indicate that for balanced computation trees a simple, low-overhead mechanism performs well whereas the more complex mechanisms offer further improvements for unbalanced computation trees.
Abstract: To take advantage of distributed-memory parallel machines it is essential to have good control of task granularity. This paper describes a fairly accurate parallel simulator for Haskell, based on the Glasgow compiler, and complementary tools for visualising task granularities. Together these tools allow us to study the effects of various annotations on task granularity on a variety of simulated parallel architectures. They also provide a more precise tool for the study of parallel execution than has previously been available for Haskell programs. These tools have already confirmed that thread migration is essential in parallel systems, demonstrated a close correlation between thread execution times and total heap allocations, and shown that fetching data synchronously normally gives better overall performance than asynchronous fetching, if data is fetched on demand.
Abstract: At Glasgow our research into parallel functional programming has been moving away from our novel architecture, GRIP towards the provision of a general parallel runtime environment. We call this GRAPH (Graph Reduction for an Assortment of Parallel Hardware).
This paper describes the design of a new memory and load management model for GRAPH, which is intended to match shared- and distributed-memory machines better and to provide a framework for our research into parallel functional databases and granularity control. This model is currently under implementation. No performance results can therefore be provided at the present time.
Abstract: We describe a version of the GRAPH system (Graph Reduction for an Assortment of Parallel Hardware) designed to execute parallel functional programs on a range of distributed-memory MIMD machines running the PVM communications harness. GRAPH was developed from the runtime system for the novel GRIP multiprocessor. Although this system has proved highly successful, being a novel architecture it is hard to compare results with other architectures or implementations. The CPUs used in the GRIP design are also rather old.
The principal extensions from GRAPH for GRIP to GRAPH for PVM are intended to handle high latencies more efficiently, by exploiting asynchronous communication, multi-threaded scheduling, and by grouping packets into larger entities where possible. The main innovation is the development of new, sophisticated packet flushing and packet fetching algorithms whose purpose is to allow graphs to be exported to and fetched from global memory without inter-processor synchronisation. As communication and reduction can be interleaved, a new form of synchronising communication and reduction is necessary.
We also give a short outlook on the design of a new portable distributed graph reduction system, GRAPH for UMM, that adepts the ideas and techniques originally developed for the GRIP system to a distributed memory environment.
GRAPH for PVM has been implemented on a network of SUN workstations and it is currently being tested and debugged. Although no extensive performance measurements have been performed yet, the available data shows a significant decrease in the overall communication and thus an improvement in the overall performance of the system.
Abstract: This paper introduces the general area of parallel functional programming, surveying the current state of research and suggesting areas which could profitably be explored in the future. No new results are presented. The paper contains 97 references selected from the 500 or so publications in this field.
Abstract: This paper considers the issue of dynamic task control in the context of a parallel Haskell implementation on the GRIP multiprocessor. For the first time, we report the effect of our task control strategies on task granularity, as measured in terms of dynamic heap allocations. This gives a concrete means of measuring the effectiveness of these strategies other than wall-clock timings, which are notoriously uninformative.
Abstract: The GRIP architecture allows efficient execution of functional programs on a multi-processor built from standard hardware components. State-of-the-art compilation techniques are combined with sophisticated runtime resource-control to give good parallel performance. This paper reports the results of running GRIP on an application which is apparently unsuited to the basic functional model: a database transaction manager incorporating updates as well as lookup transactions. The results obtained show good relative speedups for GRIP, with real performance advantages over the same application executing on sequential machines.
Abstract: This paper describes a portable implementation of the Graph Reduction in Parallel (GRIP) software, built on the Parallel Virtual Machine (PVM) process control system. Using this system, it is easy for anyone to ``get a GRIP'' without needing to invest in special-purpose hardware. An important contribution of this paper is its detailed description of GRIP's Intelligent Memory Unit (IMU) software.
Abstract: It is widely claimed that functional languages are particularly suitable for programming parallel computers. A claimed advantage is that the programmer is not burdened with details of task creation, placement, scheduling, and synchronisation, these decisions being taken by the system instead. Leaving aside the question of whether a pure functional language is expressive enough to encompass all the parallel algorithms we might wish to program, there remains the question of how effectively the compiler and run-time system map the program onto a real parallel system, a task usually carried out mostly by the programmer. This is the question we address in this paper.
We first introduce the system architecture of GRIP, a shared-memory parallel machine supporting an implementation of the functional language Haskell. GRIP executes functional programs in parallel using compiled supercombinator graph reduction, a form of declarative rule system.
We then describe several strategies for run-time resource control which we have tried, presenting comprehensive measurements of their effectiveness. We are particularly concerned with strategies controlling task creation, in order to improve task granularity and minimise communication overheads. This is, so far as we know, one of the first attempts to make a systematic study of task-control strategies in a high-performance parallel functional-language system. GRIP's high absolute performance render these results credible for real applications.
Abstract: GRIP is a shared memory multiprocessor designed for efficient parallel evaluation of functional languages using compiled graph reduction. In this paper, we consider the feasibility of implementing a data base manager on GRIP, and present results obtained from a pilot implementation. A database implemented in a pure functional language must be modified non-destructively, i.e. the original database must be preserved and a new copy constructed. The naive implementation provides evidence for the feasibility of a pure functional database in the form of modest real-time speed-ups, and acceptable real-time performance. The functional database is also used to investigate the GRIP architecture, compared with an idealized machine. The particular features investigated are the thread-creation costs and caching of GRIP's distributed memory
Work in progress:
Related publications:
Abstract: Many compilers do some of their work by means of correctness- preserving, and hopefully performanceimproving, program transforma- tions. The Glasgow Haskell Compiler (GHC) takes this idea of \"compi- lation by transformation\" as its war-cry, trying to express as much as possible of the compilation process in the form of program transforma- tions.
This paper reports on our practical experience of the transformational approach to compilation, in the context of a substantial compiler.
Abstract: We give an overview of the Glasgow Haskell compiler, focusing especially on ways in which we have been able to exploit the rich theory of functional languages to give very practical improvements in the compiler. The compiler is portable, modular, generates good code, and is freely available.
Abstract: One of the most attractive features of functional programming languages is their suitability for programming parallel computers. This paper is devoted to discussion of such a claim. Firstly, parallel functional programming is discussed from the programmer's point of view. Secondly, since most parallel functional language implementations are based on the concept of graph reduction, the issues raised by graph reduction are discussed. Finally, the paper concludes with a case study of a particular parallel graph reduction machine and a survey of other parallel architectures.
Abstract: Parallel graph reduction is an attractive implementation for functional programming languages becaue of its simplicity and inherently distributed nature. This paper outlines some of the issues raised by parallel compiled graph reduction, and presents the approach we have adopted for our parallel machine, GRIP. We concentrate on two main areas: Static and dynamic techniques to control the growth of parallelism, so as to provide enough parallelism of an appropriate granularity to keep the machine busy without swamping it. Dynamic techniques to exploit the memory hierarchy, so that frequently-referenced data is held near to the processor that references it.
Abstract: The author investigates the use of strictness analysis to determine at compile time which parts of program evaluation can be carried out simultaneously. He gives a practical explanation of this technique and presents measurements of the actual parallelism produced.
Hans-Wolfgang Loidl <hwloidl@dcs.glasgow.ac.uk> Last modified: Tue Mar 31 12:15:57 1998 Stardate: