Glasgow Parallel Haskell

[64]	C. Brown, H.-W. Loidl, J. Berthold, and K. Hammond. Improving your CASH flow: The Computer Algebra SHell (Extended Abstract). In IFL'10 - Intl. Workshop on the Implementation of Functional Languages, Sept. 2010. Draft Proceedings. [ bib \| .pdf ] Some important challenges in the field of symbolic computation - and functional programming - are the transparent access to complex, mathematical software, the exchange of data between independent systems with specialised tasks and the exploitation of modern parallel hardware. One attempt to solve this problem is SymGrid-Par, a system for exploiting parallel hardware in the context of computer algebra. Specifically, SymGrid-Par provides an easy-to-use platform for parallel computation that connects several underlying computer algebra systems, communicating through a standardised protocol for symbolic computation. In this paper we describe a new component of SymGrid-Par known as CaSH: the Computer Algebra SHell. CaSH is a system that allows direct access to SymGrid-Par via GHCi. CaSH thus allows Haskell programmers to exploit high-performance parallel computations using a system designated for solving problems in computer algebra; whilst still maintaining the purity and expressability offered by the Haskell environment. We demonstrate access to both sequential and parallel services of SymGrid-Par. For the latter we use parallel skeletons, implemented in the Haskell dialect of Eden; these skeletons are called from CaSH but exploit a computational algebra system known as GAP to offload the mathematical complexity. Keywords: parallel applications, Eden, symbolic computation
[63]	V. Janjic and K. Hammond. Granularity-Aware Work-Stealing for Computationally-Uniform Grids. In CCGRID 2010 - Intl. Conf. on Cluster, Cloud and Grid Computing, pages 123-134, Melbourne, Australia, May 2010. [ bib \| DOI ] Good scheduling is important for ensuring effective use of Grid resources, while maximising parallel performance. In this paper, we show how a basic “Random-Stealing” load balancing algorithm for computational Grids can be improved by using information about the task granularity of parallel programs. We propose several strategies (SSL, SLL and LLL) for using granularity information to improve load balancing, presenting results both from simulations and from a real implementation (the Grid-GUM Runtime System for Parallel Haskell). We assume a common model of task creation which subsumes both master/worker and data-parallel programming paradigms under a task-stealing work distribution strategy. Overall, we achieve improvement in runtime of up to 19.4 percent for irregular problems in the real implementation, and up to 40 percent for the simulations (typical improvements of more that 15 percent for irregular programs, and from 5-10 percent for regular ones). Our results show that, for computationally-uniform Grids, advanced load balancing methods that exploit granularity information generally have the greatest impact on reducing the runtimes of irregular parallel programs. Moreover, the more irregular the program is, the better the improvements that can be achieved. Keywords: implementation, GpH, non-uniform topology, Grid
[62]	S. Marlow, P. Maier, H.-W. Loidl, M. K. Aswad, and P. W. Trinder. Seq no more: Better Strategies for Parallel Haskell. In Haskell Symposium 2010, Baltimore, MD, USA, Sept. 2010. ACM Press. To appear. [ bib \| .html ] We present a complete redesign of evaluation strategies, a key abstraction for specifying pure, deterministic parallelism in Haskell. Our new formulation preserves the compositionality and modularity benefits of the original, while providing significant new benefits. First, we introduce an evaluation-order monad to provide clearer, more generic, and more efficient specification of parallel evaluation. Secondly, the new formulation resolves a subtle space management issue with the original strategies, allowing parallelism (sparks) to be preserved while reclaiming heap associated with superfluous parallelism. Related to this, the new formulation provides far better support for speculative parallelism as the garbage collector now prunes unneeded speculation. Finally, the new formulation provides improved compositionality: we can directly express parallelism embedded within lazy data structures, producing more compositional strategies, and our basic strategies are parametric in the coordination combinator, facilitating a richer set of parallelism combinators. We give measurements over a range of benchmarks demonstrating that the runtime overheads of the new formulation relative to the original are low, and the new strategies even yield slightly better speedups on average than the original strategies. Keywords: implementation, GpH, evaluation strategies
[61]	M. K. Aswad, P. W. Trinder, A. Al Zain, G. J. Michaelson, and J. Berthold. Low Pain vs No Pain Multi-core Haskells. In TFP'09 - Symp. on Trends in Functional Programming, volume 10 of Trends in Functional Programming, Komarno, Slovakia, June 2009. Intellect. To appear. [ bib \| .pdf ] Multicores are becoming the dominant processor technology and functional languages are theoretically well suited to exploit them. In practice, however, implementing effective high level parallel functional languages is extremely challenging. This paper is the first programming and performance comparison of functional multicore technologies and reports some of the first ever multicore results for two languages. As such it reflects the growing maturity of the field by systematically evaluating four parallel Haskell implementations on a common multicore architecture. The comparison contrasts the programming effort each language requires with the parallel performance delivered. The study uses 15 'typical' programs to compare a 'no pain', i.e. entirely implicit, parallel language with three 'low pain', i.e. semi-explicit languages. The parallel Haskell implementations use different versions of GHC compiler technology, and hence the comparative performance metric is speedup which normalises against sequential performance. We ground the speedup comparisons by reporting both sequential and parallel runtimes and efficiencies for three of the languages. Our experiments focus on the number of programs improved, the absolute speedups delivered, the parallel scalability, and the program changes required to coordinate parallelism. The results are encouraging and, on occasion, surprising. Keywords: models of parallelism, GpH, Eden, GHC-SMP, GUM, multicore
[60]	J. Berthold, S. Marlow, K. Hammond, and A. Al Zain. Comparing and Optimising Parallel Haskell Implementations for Multicore Machines. In ADPNA 2009 - Intl. Workshop on Advanced Distributed and Parallel Network Applications, pages 386-393, Vienna, Austria, Sept. 2009. IEEE Computer Society. [ bib \| DOI ] In this paper, we investigate the differences and tradeoffs imposed by two parallel Haskell dialects running on multicore machines. GpH and Eden are both constructed using the highly-optimising sequential GHC compiler, and share thread scheduling, and other elements, from a common code base. The GpH implementation investigated here uses a physically-shared heap, which should be well-suited to multicore architectures. In contrast, the Eden implementation adopts an approach that has been designed for use on distributed-memory parallel machines: a system of multiple, independent heaps (one per core), with inter-core communication handled by message-passing rather than through shared heap cells. We report two main results. Firstly, we report on the effect of a number of optimisations that we applied to the shared-memory GpH implementation in order to address some performance issues that were revealed by our testing: for example, we implemented a work-stealing approach to task allocation. Our optimisations improved the performance of the shared-heap GpH implementation by as much as 30 percent on eight cores. Secondly, the shared heap approach is, rather surprisingly, not superior to a distributed heap implementation: both give similar performance results. Keywords: models of parallelism, implementation, GpH, Eden, multicore
[59]	D. Jones Jr., S. Marlow, and S. Singh. Parallel Performance Tuning for Haskell. In Haskell Symposium 2009, pages 81-92, Edinburgh, Scotland, Sept. 2009. ACM Press. [ bib \| DOI \| http ] Parallel Haskell programming has entered the mainstream with support now included in GHC for multiple parallel programming models, along with multicore execution support in the runtime. However, tuning programs for parallelism is still something of a black art. Without much in the way of feedback provided by the runtime system, it is a matter of trial and error combined with experience to achieve good parallel speedups. This paper describes an early prototype of a parallel profiling system for multicore programming with GHC. The system comprises three parts: fast event tracing in the runtime, a Haskell library for reading the resulting trace files, and a number of tools built on this library for presenting the information to the programmer. We focus on one tool in particular, a graphical timeline browser called ThreadScope. The paper illustrates the use of ThreadScope through a number of case studies, and describes some useful methodologies for parallelizing Haskell programs. Keywords: tools, GpH, GHC-SMP, multicore, profiling
[58]	S. Marlow, S. L. Peyton-Jones, and S. Singh. Runtime Support for Multicore Haskell. In ICFP 2009 - Intl. Conf. on Functional Programming, pages 65-78, Edinburgh, Scotland, Aug. 2009. ACM Press. [ bib \| DOI \| http ] Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wall-clock performance. We describe just such an implementation, and quantitatively explore some of the complex design tradeoffs that make such implementations hard to build. Our measurements are necessarily detailed and specific, but they are reproducible, and we believe that they offer some general insights. Keywords: implementation, GpH, GHC-SMP, multicore
[57]	P. W. Trinder, M. I. Cole, H.-W. Loidl, and G. J. Michaelson. Characterising Effective Resource Analyses for Parallel and Distributed Coordination. In FOPARA 2009 - Intl. Workshop on Foundational and Practical Aspects of Resource Analysis, Revised Selected Papers, volume 6324 of Lecture Notes in Computer Science, pages 67-83, Eindhoven, The Netherlands, Nov. 2009. Springer. [ bib \| DOI \| .pdf ] An important application of resource analysis is to improve the performance of parallel and distributed programs. In this context key resources are time, space and communication. Given the spectrum of cost models and associated analysis techniques available, what combination should be selected for a specific parallel or distributed context? We address the question as follows. We outline a continuum of coordination cost models and a range of analysis techniques. We consider six representative parallel/distributed applications of resource analysis techniques, and aim to extract general principles governing why the combination of techniques is effective in its context. Keywords: semantics, models of parallelism, survey, resource analysis
[56]	A. Al Zain, K. Hammond, J. Berthold, P. W. Trinder, G. J. Michaelson, and M. K. Aswad. Low-Pain, High-Gain Multicore Programming in Haskell: Coordinating Irregular Symbolic Computations on Multicore Architectures. In DAMP'09 - Intl. Workshop on Declarative Aspects of Multicore Programming, pages 25-36, Savannah, GA, USA, Jan. 2009. ACM Press. [ bib \| DOI \| .pdf ] With the emergence of commodity multicore architectures, exploiting tightly-coupled parallelism has become increasingly important. Functional programming languages, such as Haskell, are, in principle, well placed to take advantage of this trend, offering the ability to easily identify large amounts of fine-grained parallelism. Unfortunately, obtaining real performance benefits has often proved hard to realise in practice. This paper reports on a new approach using middleware that has been constructed using the Eden parallel dialect of Haskell. Our approach is “low pain” in the sense that the programmer constructs a parallel program by inserting a small number of higher-order algorithmic skeletons at key points in the program. It is “high gain” in the sense that we are able to get good parallel speedups. Our approach is unusual in that we do not attempt to use shared memory directly, but rather coordinate parallel computations using a message-passing implementation. This approach has a number of advantages. Firstly, coordination, i.e. locking and communication, is both confined to limited shared memory areas, essentially the communication buffers, and is also isolated within well-understood libraries. Secondly, the coarse thread granularity that we obtain reduces coordination overheads, so locks are normally needed only on (relatively large) messages, and not on individual data items, as is often the case for simple shared-memory implementations. Finally, cache coherency requirements are reduced since individual tasks do not share caches, and can garbage collect independently. We report results for two representative computational algebra problems. Computational algebra is a challenging application area that has not been widely studied in the general parallelism community. Computational algebra applications have high computational demands, and are, in principle, often suitable for parallel execution, but usually display a high degree of irregularity in terms of both task and data structure. This makes it difficult to construct parallel applications that perform well in practice. Using our system, we are able to obtain both extremely good processor utilisation (97 percent) and very good absolute speedups (up to 7.7) on an eight-core machine. Keywords: parallel applications, Eden, multicore, symbolic computation
[55]	J. Berthold, A. Al Zain, and H.-W. Loidl. Scheduling Light-Weight Parallelism in ArTCoP. In PADL 2008 - Intl. Symp. on Practical Aspects of Declarative Languages, volume 4902 of Lecture Notes in Computer Science, pages 214-229, San Francisco, CA, USA, Jan. 2008. Springer. [ bib \| DOI \| .pdf ] We present the design and prototype implementation of the scheduling component in ArTCoP (architecture transparent control of parallelism), a novel run-time environment (RTE) for parallel execution of high-level languages. A key feature of ArTCoP is its support for deep process and memory hierarchies, shown in the scheduler by supporting light-weight threads. To realise a system with easily exchangeable components, the system defines a micro-kernel, providing basic infrastructure, such as garbage collection. All complex RTE operations, including the handling of parallelism, are implemented at a separate system level. By choosing Concurrent Haskell as high-level system language, we obtain a prototype in the form of an executable specification that is easier to maintain and more flexible than conventional RTEs. We demonstrate the flexibility of this approach by presenting implementations of a scheduler for light-weight threads in ArTCoP, based on GHC Version 6.6. Keywords: implementation, run-time system
[54]	H.-W. Loidl, P. W. Trinder, K. Hammond, A. Al Zain, and C. A. Baker-Finch. Semi-Explicit Parallel Programming in a Purely Functional Style: GpH. In M. Alexander and B. Gardner, editors, Process Algebra for Parallel and Distributed Processing: Algebraic Languages in Specification-Based Software Development, pages 47-76. Chapman and Hall, Dec. 2008. [ bib \| .pdf ] Declarative programming languages can play an important role in the process of designing and implementing parallel systems. They bridge the gap between a high-level specification, with proven properties of the overall system, and the execution of the system on real hardware. Efficiently exploiting parallelism on a wide range of architectures is a challenging task and should in our view be handled by a sophisticated runtime environment. Based on this design philosophy we have developed and formalised Glasgow parallel Haskell (GpH), and implemented it as a conservative extension of the Glasgow Haskell Compiler. The high-level nature of declarative languages eases the task of mapping an algebraic specification down to executable code. In fact, the operational components of the specification can already be considered an implementation, with the associated properties acting as assertions to the code. Based on a formal model of the declarative language, the validity of these properties can be established by manual proof, which works on a level of detail similar to the specification language itself. Many operational aspects, usually complicating a proof of an implementation, do not come into the picture at this level. Most importantly, unnecessary sequentialisation of the code is avoided. However, the goal of implicit parallelism has proven an elusive one. Often the automatically generated threads are too fine-grained to be efficient. In other cases the data-dependencies between expressions prohibit the generation of a sufficient amount of parallelism. Thus, we employ an approach of semi-explicit parallelism, where only potential parallelism has to be annotated in a program, and all aspects of coordination are delegated to the runtime environment. A corresponding formal model, in the form of a structured operational semantics, handling pools of realised and of potential parallelism, is used to establish the correctness of the code employing semi-explicit parallelism. The runtime environment itself is capable of synchronising parallel threads, using automatic blocking on data under evaluation, and by simulating virtual shared memory across networks of machines. Being embedded into the optimised runtime environment for a sequential language, we achieve efficient execution of a high-level language, close to the original specification language, while minimising the programmer effort in parallelising the code and being scalable to large-scale applications that can be executed on heterogeneous networks and computational Grids. Keywords: semantics, GpH
[53]	S. Marlow, T. Harris, R. P. James, and S. L. Peyton-Jones. Parallel Generational-Copying Garbage Collection with a Block-structured Heap. In ISMM 2008 - Intl. Symp. on Memory Management, pages 11-20, Tucson, AZ, USA, June 2008. ACM Press. [ bib \| DOI ] We present a parallel generational-copying garbage collector implemented for the Glasgow Haskell Compiler. We use a blockstructured memory allocator, which provides a natural granularity for dividing the work of GC between many threads, leading to a simple yet effective method for parallelising copying GC. The results are encouraging: we demonstrate wall-clock speedups of on average a factor of 2 in GC time on a commodity 4-core machine with no programmer intervention, compared to our best sequential GC. Keywords: implementation, GpH, GHC-SMP, multicore
[52]	P. B. Vasconcelos. Cost Inference and Analysis for Recursive Functional Programs. PhD thesis, University of St Andrews, Feb. 2008. [ bib \| .pdf ] Programming resource-sensitive systems, such as real-time embedded systems, requires guaranteeing both the functional correctness of computations and also that time and space usage fit within constraints imposed by hardware limits or the environment. Functional programming languages have proved very good at meeting the former logical kind of guarantees but not the latter resource guarantees. This thesis contributes to demonstrate the applicability of functional programming in resource-sensitive systems with an automatic program analysis for obtaining guaranteed upper bounds on dynamic space usage of functional programs. Our analysis is developed for a core subset of Hume, a domain-specific functional language targeting resource-sensitive systems (Hammond et al. 2007), and presented as a type and effect system that builds on previous sized type systems (Hughes et al. 1996, Chin and Khoo 2001) and effect systems for costs (Dornic et al. 1992, Reistad and Gifford 1994, Hughes and Pareto 1999). It extends previous approaches by using abstract interpretation techniques to automatically infer linear approximations of the sizes of recursive data types and the stack and heap costs of recursive functions. The correctness of the analysis is formally proved with respect to an operational semantics for the language and an inferrence algorithm that automatically reconstructs size and cost bounds is presented. A prototype implementation of the analysis and operational semantics has been constructed and used to experimentally assess the quality of the cost bounds with some examples, including implementations of textbook functional programming algorithms and simplified embedded systems. Keywords: semantics, resource analysis, thesis
[51]	A. Al Zain, P. W. Trinder, G. J. Michaelson, and H.-W. Loidl. Evaluating a High-Level Parallel Language (GpH) for Computational GRIDs. IEEE Transactions on Parallel and Distributed Systems, 19(2):219-233, 2008. [ bib \| DOI \| .pdf ] Computational Grids potentially offer low cost, readily available, and large-scale high-performance platforms. For the parallel execution of programs, however, computational Grids pose serious challenges: they are heterogeneous, and have hierarchical and often shared interconnects, with high and variable latencies between clusters. This paper investigates whether a programming language with high-level parallel coordination and a Distributed Shared Memory model (DSM) can deliver good, and scalable, performance on a range of computational Grid configurations. The high-level language, Glasgow parallel Haskell (GpH), abstracts over the architectural complexities of the computational Grid, and we have developed GridGUM2, a sophisticated grid-specific implementation of GpH, to produce the first high-level DSM parallel language implementation for computational Grids. We report a systematic performance evaluation of GridGUM2 on combinations of high/low and homo/hetero-geneous computational Grids. We measure the performance of a small set of kernel parallel programs representing a variety of application areas, two parallel paradigms, and ranges of communication degree and parallel irregularity. We investigate GridGUM2's performance scalability on medium-scale heterogeneous and high-latency computational Grids, and analyse the performance with respect to the program characteristics of communication frequency and degree of irregular parallelism. Keywords: models of parallelism, GpH, non-uniform topology, GridGUM
[50]	K. Hammond, A. Al Zain, G. Cooperman, D. Petcu, and P. W. Trinder. SymGrid: A Framework for Symbolic Computation on the Grid. In Euro-Par 2007 - Intl. Conf. on Parallel Processing, volume 4641 of Lecture Notes in Computer Science, pages 457-466, Rennes, France, Aug. 2007. Springer. [ bib \| DOI \| .pdf ] This paper introduces the design of SymGrid, a new Grid framework that will, for the first time, allow multiple invocations of symbolic computing applications to interact via the Grid. SymGrid is designed to support the specific needs of symbolic computation, including computational steering (greater interactivity), complex data structures, and domain-specific computational patterns (for irregular parallelism). A key issue is heterogeneity: SymGrid is designed to orchestrate components from different symbolic systems into a single coherent (possibly parallel) Grid application, building on the OpenMath standard for data exchange between mathematically-oriented applications. The work is being developed as part of a major EU infrastructure project. Keywords: parallel applications, symbolic computation
[49]	A. Al Zain, K. Hammond, P. W. Trinder, S. Linton, H.-W. Loidl, and M. Costanti. SymGrid-Par: Designing a Framework for Executing Computational Algebra Systems on Computational Grids. In PAPP 2007 - Intl. Workshop on Practical Aspects of High-level Parallel Programming, volume 4488 of Lecture Notes in Computer Science, pages 617-624, Beijing, China, May 2007. Springer. [ bib \| DOI \| .pdf ] SymGrid-Par is a new framework for executing large computer algebra problems on computational Grids. We present the design of SymGrid-Par, which supports multiple computer algebra packages, and hence provides the novel possibility of composing a system using components from different packages. Orchestration of the components on the Grid is provided by a Grid-enabled parallel Haskell (GpH). We present a prototype implementation of a core component of SymGrid-Par, together with promising measurements of two programs on a modest Grid to demonstrate the feasibility of our approach. Keywords: parallel applications, symbolic computation
[48]	A. Al Zain, P. W. Trinder, H.-W. Loidl, and G. J. Michaelson. Supporting High-Level Grid Parallel Programming: the Design and Implementation of Grid-GUM2 (Full Paper). In UK e-Science 2007 All Hands Meeting, Nottingham, UK, Sept. 2007. [ bib \| .pdf ] Computational Grids are much more complex than classical high-performance architectures: they have high, and dynamically varying communication latencies, and comprise heterogeneous collections of processing elements. We argue that such a complex and dynamic architecture is extremely challenging for the programmer to explicitly manage, and advocate a high-level parallel programming language, like GpH, where the programmer controls only a few key aspects of the parallel coordination. Such a high-level language requires a sophisticated implementation to manage parallel execution. This paper presents the design and implementation of Grid-GUM2, a Grid-specic runtime environment for GpH. The core of the design are monitoring mechanisms that collect static and partial dynamic information, and new load management mechanisms that use the information to better schedule tasks on computational Grids. These mechanisms are built on top of a virtual shared memory (VSM), graph-reduction-based computation engine. A systematic evaluation of the implementation, reported elsewhere, shows acceptable scaleup and good speedups on a range of computational Grids. Keywords: implementation, GpH, non-uniform topology, GridGUM
[47]	A. Al Zain, P. W. Trinder, H.-W. Loidl, and G. J. Michaelson. Managing Heterogeneity in a Grid Parallel Haskell. Scalable Computing: Practice and Experience, 7(3):9-25, 2006. Selected papers from PAPP 2005 - Intl. Workshop on Practical Aspects of High-level Parallel Programming, Atlanta, GA, USA, May 2005. [ bib \| .pdf ] Computational Grids potentially offer cheap large-scale high-performance systems, but are a very challenging architecture, being heterogeneous, shared and hierarchical. Rather than requiring a programmer to explicitly manage this complex environment, we recommend using a high-level parallel functional language, like GpH, with largely automatic management of parallel coordination. We present GRID-GUM, an initial port of the distributed virtual shared-memory implementation of GpH for computational Grids. We show that, GRID-GUM delivers acceptable speedups on relatively low latency homogeneous and heterogeneous computational Grids. Moreover, we find that for heterogeneous computational Grids, load management limits performance. We present the initial design of GRID-GUM2, that incorporates new load management mechanisms that cheaply and effectively combine static and dynamic information to adapt to heterogeneous Grids. The mechanisms are evaluated by measuring four non-trivial programs with different parallel properties. The measurements show that the new mechanisms improve load distribution over the original implementation, reducing runtime by factors ranging from 17 to 57 percent, and the greatest improvement is obtained for the most dynamic program. Keywords: implementation, GpH, non-uniform topology, GridGUM
[46]	T. Harris, S. Marlow, and S. L. Peyton-Jones. Haskell on a Shared-Memory Multiprocessor. In Haskell Workshop 2005, pages 49-61, Tallinn, Estonia, Sept. 2005. ACM Press. [ bib \| DOI \| http ] Multi-core processors are coming, and we need ways to program them. The combination of purely-functional programming and explicit, monadic threads, communicating using transactional memory, looks like a particularly promising way to do so. This paper describes a full-scale implementation of shared-memory parallel Haskell, based on the Glasgow Haskell Compiler. Our main technical contribution is a lock-free mechanism for evaluating shared thunks that eliminates the major performance bottleneck in parallel evaluation of a lazy language. Our results are preliminary but promising: we can demonstrate wall-clock speedups of a serious application (GHC itself), even with only two processors, compared to the same application compiled for a uni-processor. Keywords: implementation, GpH, GHC-SMP, multicore
[45]	M. Lange and H.-W. Loidl. Parallel and Symbolic Model Checking for Fixpoint Logic with Chop. Electr. Notes Theor. Comput. Sci., 128(3):125-138, 2005. Selected Papers from PDMC'04 - Intl. Workshop on Parallel and Distributed Techniques in Verification, London, UK, Sep 2004. [ bib \| DOI \| .ps ] We consider the model checking problem for FLC, a modal fixpoint logic capable of defining non-regular properties. This paper presents a refinement of a symbolic model checker and discusses how to parallelise this algorithm. It reports on a prototype implementation of the algorithm in Glasgow Parallel Haskell (GpH) and its performance on a cluster of workstations. Keywords: parallel applications, GpH, model checking
[44]	H.-W. Loidl, F. Rubio, N. Scaife, K. Hammond, S. Horiguchi, U. Klusik, R. Loogen, G. J. Michaelson, R. Peña, S. Priebe, Á. J. Rebón Portillo, and P. W. Trinder. Comparing Parallel Functional Languages: Programming and Performance. Higher-Order and Symbolic Computation, 16(3):203-251, 2003. [ bib \| DOI \| .ps ] This paper presents a practical evaluation and comparison of three state-of-the-art parallel functional languages. The evaluation is based on implementations of three typical symbolic computation programs, with performance measured on a Beowulf-class parallel architecture. We assess three mature parallel functional languages: PMLS, a system for implicitly parallel execution of ML programs; GPH, a mainly implicit parallel extension of Haskell; and Eden, a more explicit parallel extension of Haskell designed for both distributed and parallel execution. While all three languages employ a completely implicit approach to communication, each language takes a different approach to specifying and controlling parallelism, ranging from explicit identification of processes as language constructs (Eden) through annotation of potential parallelism (GPH) to automatic detection of parallel skeletons in sequential code (PMLS). We present detailed performance measurements of all three systems on a widely available parallel architecture: a Beowulf cluster of low-cost commodity workstations. We use three representative symbolic applications: a matrix multiplication algorithm, an exact linear system solver, and a simple ray-tracer. Our results show how moderate speedups can be achieved with little or no changes to the sequential code, and that parallel performance can be significantly improved even within our high-level model of parallel functional programming by controlling key aspects of the program such as load distribution and thread granularity. Keywords: models of parallelism, GpH, Eden, PMLS, distributed memory
[43]	A. R. Du Bois, R. F. Pointon, H.-W. Loidl, and P. W. Trinder. Implementing Declarative Parallel Bottom-Avoiding Choice. In SBAC-PAD 2002 - Symp. on Computer Architecture and High Performance Computing, pages 82-92, Vitoria, Brazil, Oct. 2002. IEEE Computer Society. [ bib \| DOI \| .ps ] Non-deterministic choice supports efficient parallel speculation, but unrestricted non-determinism destroys the referential transparency of purely-declarative languages by removing unfoldability and it bears the danger of wasting resources on unncessary computations. While numerous choice mechanisms have been proposed that preserve unfoldability, and some concurrent implementations exist, we believe that no compiled parallel implementation has previously been constructed. This paper presents the design, semantics, implementation and use of a family of bottom-avoiding choice operators for Glasgow parallel Haskell. The subtle semantic properties of our choice operations are described, including a careful classification using an existing framework, together with a discussion of operational semantics issues and the pragmatics of distributed memory implementation. The expressiveness of our choice operators is demonstrated by constructing a branch and bound search, a merge and a speculative conditional. Their effectiveness is demonstrated by comparing the parallel performance of the speculative search with naive and 'perfect' implementations. Their efficiency is assessed by measuring runtime overhead and heap consumption. Keywords: implementation, semantics, GpH, speculative parallelism
[42]	A. R. Du Bois, H.-W. Loidl, and P. W. Trinder. Thread Migration in a Parallel Graph Reducer. In IFL'02 - Intl. Workshop on the Implementation of Functional Languages, volume 2670 of Lecture Notes in Computer Science, pages 199-214, Madrid, Spain, Sept. 2002. Springer. [ bib \| .ps ] To support high level coordination, parallel functional languages need effective and automatic work distribution mechanisms. Many implementations distribute potential work, i.e. sparks or closures, but there is good evidence that the performance of certain classes of program can be improved if current work, or threads, are also distributed. Migrating a thread incurs significant execution cost and requires careful scheduling and an elaborate implementation. This paper describes the design, implementation and performance of thread migration in the GUM runtime system underlying Glasgow parallel Haskell (GpH). Measurements of nontrivial programs on a highlatency cluster architecture show that thread migration can improve the performance of data-parallel and divide-and-conquer programs with low processor utilisation. Thread migration also reduces the variation in performance results obtained in separate executions of a program. Moreover, migration does not incur significant overheads if there are no migratable threads, or on a single processor. However, for programs that already exhibit good processor utilisation, migration may increase performance variability and very occasionally reduce performance. Keywords: implementation, GpH, GUM
[41]	S. B. Junaidu and P. W. Trinder. Parallelising large irregular programs: an experience with Naira. Information Sciences, 140(3-4):229-240, 2002. Title of pre-print differs from published article. [ bib \| .ps.gz ] Naira is a compiler for Haskell, written in Glasgow parallel Haskell. It exhibits modest, but irregular, parallelism that is determined by properties of the program being compiled, e.g. the complexity of the types and of the pattern matching. We report four experiments into Naira's parallel behaviour using a set of realistic inputs: namely the 18 Haskell modules of Naira itself. The issues investigated are: * Does increasing input size improve sequential efficiency and speedup? * To what extent do high communications latencies reduce average parallelism and speedup? * Does migrating running threads between processors improve average parallelism and speedup at all latencies? Keywords: parallel applications, GpH
[40]	H.-W. Loidl. The Virtual Shared Memory Performance of a Parallel Graph Reducer. In CCGrid 2002 - Intl. Symp. on Cluster Computing and the Grid, pages 311-318, Berlin, Germany, May 2002. IEEE Computer Society. [ bib \| .html ] This paper assesses the costs of maintaining a virtual shared heap in our parallel graph reducer (GUM), which implements a parallel functional language. GUM performs automatic and dynamic resource management for both work and data. We introduce extensions to the original design of GUM, aiming at a more flexible memory management and communication mechanism to deal with high-latency systems. We then present measurements of running GUM on a Beowulf cluster, evaluating the overhead of dynamic distributed memory management and the effectiveness of the new memory management and communication mechanisms. Keywords: implementation, GpH, GUM
[39]	Á. J. Rebón Portillo, K. Hammond, H.-W. Loidl, and P. B. Vasconcelos. Cost Analysis Using Automatic Size and Time Inference. In IFL'02 - Intl. Workshop on the Implementation of Functional Languages, volume 2670 of Lecture Notes in Computer Science, pages 232-247, Madrid, Spain, Sept. 2002. Springer. [ bib \| .ps ] Cost information can be exploited in a variety of contexts, including parallelizing compilers, autonomic GRIDs and real-time systems. In this paper, we introduce a novel type and effect system: the sized time system that is capable of determining upper bounds for both time and space costs, and which we initially intend to apply to determining good granularity for parallel tasks. The analysis is defined for a simple, strict, higher-order and polymorphic functional language, L, incorporating arbitrarily-sized list data structures. The inference algorithm implementing this analysis constructs cost- and size-terms for L-expressions, plus constraints over free size and cost variables in those terms that can be solved to produce information for higher-order functions. The paper presents both the analysis and the inference algorithm, providing examples that illustrate the primary features of the analysis. Keywords: semantics, resource analysis
[38]	P. W. Trinder and R. F. Loidl, Hans-WolfgangPointon. Parallel and Distributed Haskells. Journal of Functional Programming, 12(4&5):469-510, 2002. [ bib \| .html ] Parallel and distributed languages specify computations on multiple processors and have a computation language to describe the algorithm, i.e. what to compute, and a coordination language to describe how to organise the computations across the processors. Haskell has been used as the computation language for a wide variety of parallel and distributed languages, and this paper is a comprehensive survey of implemented languages. We outline parallel and distributed language concepts and classify Haskell extensions using them. Similar example programs are used to illustrate and contrast the coordination languages, and the comparison is facilitated by the common computation language. A lazy language is not an obvious choice for parallel or distributed computation, and we address the question of why Haskell is a common functional computation language. Keywords: models of parallelism, Eden, GpH, GdH, survey
[37]	H.-W. Loidl. Load Balancing in a Parallel Graph Reducer. In SFP'01 - Scottish Functional Programming Workshop, volume 3 of Trends in Functional Programming, pages 63-74, Stirling, Scotland, Aug. 2001. Intellect. [ bib \| .html ] Parallel graph reducers such as GUM use dynamic techniques to manage resources during execution. One important aspect of the dynamic behaviour is the distribution of work. The load balancing mechanism, which controls this aspect, should be flexible, to adjust the distribution of work to hardware characteristics as well as dynamic program characteristics, and scalable, to achieve high utilisation of all processors even on massively parallel machines. In this paper we study the behaviour of GUM's load balancing mechanism on a high-latency Beowulf multi-processor. We present modifications to the basic load balancing mechanism and discuss runtime measurements, which indicate that these modifications can significantly enhance the scalability of the system. Keywords: implementation, GpH, GUM
[36]	H.-W. Loidl, P. W. Trinder, and C. Butz. Tuning Task Granularity and Data Locality of Data Parallel GpH Programs. Parallel Processing Letters, 11(4):471-486, 2001. Selected papers from HLPP'01 - Intl. Workshop on High-level Parallel Programming and Applications, Orleans, France, 26-27 March, 2001. [ bib \| .html ] The performance of data parallel programs often hinges on two key coordination aspects: the computational costs of the parallel tasks relative to their management overhead - task granularity; and the communication costs induced by the distance between tasks and their data - data locality. In data parallel programs both granularity and locality can be improved by clustering, i.e. arranging for parallel tasks to operate on related sub-collections of data. The GpH parallel functional language automatically manages most coordination aspects, but also allows some high-level control of coordination using evaluation strategies. We study the coordination behavior of two typical data parallel programs, and find that while they can be improved by introducing clustering evaluation strategies, further performance improvements can be achieved by restructuring the program. We introduce a new generic Cluster class that allows clustering to be systematically introduced, and improved by program transformation. In contrast to many other parallel program transformation approaches, we transform realistic programs and report performance results on a 32-processor Beowulf cluster. The cluster class is highly-generic and extensible, amenable to reasoning, and avoids conflating computation and coordination aspects of the program. Keywords: models of parallelism, GpH, evaluation strategies, granularity
[35]	R. F. Pointon, S. Priebe, H.-W. Loidl, R. Loogen, and P. W. Trinder. Functional vs Object-Oriented Distributed Languages. In EUROCAST'01 - Intl. Conf. on Computer Aided Systems Theory, volume 2178 of Lecture Notes in Computer Science, pages 642-656, Las Palmas de Gran Canaria, Spain, Feb. 2001. Springer. [ bib \| DOI \| .ps ] Conventional distributed programming languages require the programmer to explicitly specify many aspects of distributed coordination, including resource location, task placement, communication and synchronisation. Functional languages aim to provide higher-level abstraction, and this paper investigates the effectiveness of this for distributed co-ordination. The investigation contrasts and compares contrasts Java and two Haskell-based distributed functional languages, Eden and GdH. Three distributed programs are used as case studies, and the performance and programming effort are reported. Keywords: models of parallelism, Eden, GdH, Java, survey
[34]	C. A. Baker-Finch, D. J. King, and P. W. Trinder. An Operational Semantics for Parallel Lazy Evaluation. In ICFP'00 - Intl. Conf. on Functional Programming, pages 162-173, Montreal, Canada, Sept. 2000. ACM Press. [ bib \| DOI \| .pdf ] We present an operational semantics for parallel lazy evaluation that accurately models the parallel behaviour the non-strict parallel functional language GpH. Parallelism is modelled synchronously, that is, single reductions are carried out separately, then combined before proceeding onto the next set of reductions. Consequently the semantics has two levels, with single-step transition rules at one level and combining rules at the other. Parallel threads are modelled by labelled bindings and to the best of our knowledge this is the first semantics that models thread states. A set of labelled bindings corresponds to a heap and is used to model sharing. The semantics is set at a higher level of abstraction than an abstract machine and is therefore more manageable for proofs about programs rather than implementations. At the same time, it is low level enough to allow us to reason about programs in terms of parallelism (i.e., the number of processors used), as well as work and run-time with different numbers of processors. The framework used by the semantics is flexible in that the model can easily be adapted to express other evaluation models such as sequential call-by-need and fully-speculative evaluation, non-deterministic choice and others. Keywords: semantics, GpH
[33]	H.-W. Loidl, U. Klusik, K. Hammond, R. Loogen, and P. W. Trinder. GpH and Eden: Comparing Two Parallel Functional Languages on a Beowulf Cluster. In SFP'00 - Scottish Functional Programming Workshop, volume 2 of Trends in Functional Programming, pages 39-52, St Andrews, Scotland, July 2000. Intellect. [ bib \| .ps ] We investigate two similar but contrasting parallel functional language designs: Eden and GpH. Both languages use the non-strict functional language Haskell as a core expression language, both their implementations are based on the same host sequential implementation - the high performance Glasgow Haskell Compiler (GHC), and both implementations are available on the same distributed architecture - the St Andrews Beowulf cluster. This allows an exceptionally pure comparison of the language design characteristics and their impact on parallel performance. The comparison is illustrated by two parallel benchmarks which expose differences in the communication, process creation, and work distribution mechanisms employed by Eden and GpH. Our results show that the explicit process model favoured by Eden gives good parallel performance for coarse-grained applications running on the Beowulf cluster. In comparison, the implicit process model used in GpH gives poorer absolute speedup for this class of application on this architecture. Further work is needed to verify the reasons for this difference in performance and to extend these results to other architectures. Keywords: models of parallelism, GpH, Eden, distributed memory
[32]	R. F. Pointon, P. W. Trinder, and H.-W. Loidl. The Design and Implementation of Glasgow Distributed Haskell. In IFL'00 - Intl. Workshop on the Implementation of Functional Languages, volume 2011 of Lecture Notes in Computer Science, pages 53-70, Aachen, Germany, Sept. 2000. Springer. [ bib \| DOI \| .ps ] This paper presents the design and implementation of Glasgow distributed Haskell (GdH), a non-strict distributed functional language. The language is intended for constructing scalable, reliable distributed applications, and is Haskell'98 compliant, being a superset of both Concurrent Haskell and Glasgow parallel Haskell (GpH). GdH distributes both pure and impure threads across multiple Processing Elements (PEs), each location is made explicit so a program can use resources unique to PE, and objects including threads can be created on a named PE. The location that uniquely owns a resource is identified by a method of a new Immobile type class. Impure threads communicate and synchronise explicitly to co-ordinate actions on the distributed state, but both pure and impure threads synchronise and communicate implicitly to share data. Limited support for fault tolerant programming is provided by distributed exception handling. The language constructs are illustrated by example, and two demonstration programs give a flavour of GdH programming. Although many distributed functional languages have been designed, relatively few have robust implementations. The GdH implementation fuses and extends two mature implementation technologies: the GUM runtime system (RTS) of GpH and the RTS for Concurrent Haskell. The fused RTS is extended with a small number of primitives from which more sophisticated constructs can be constructed, and libraries are adapted to the distributed context. Keywords: implementation, GdH
[31]	P. W. Trinder, H.-W. Loidl, E. Barry Jr., M. K. Davis, K. Hammond, U. Klusik, S. L. Peyton-Jones, and Á. J. Rebón Portillo. The Multi-Architecture Performance of the Parallel Functional Language GpH (Research Note). In Euro-Par 2000 - Intl. Conf. on Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 739-743, Munich, Germany, Aug. 2000. Springer. [ bib \| DOI \| .html ] In principle, functional languages promise straightforward architecture-independent parallelism, because of their high level description of parallelism, dynamic management of parallelism and deterministic semantics. However, these language features come at the expense of a sophisticated compiler and/or runtime-system. The problem we address is whether such an elaborate system can deliver acceptable performance on a variety of parallel architectures. In particular we report performance measurements for the GUM runtime-system on eight parallel architectures, including massively parallel, distributed-memory, shared-memory and workstation networks. Keywords: models of parallelism, GpH
[30]	P. W. Trinder, R. F. Pointon, and H.-W. Loidl. Runtime System Level Fault Tolerance for a Distributed Functional Language. In SFP'00 - Scottish Functional Programming Workshop, volume 2 of Trends in Functional Programming, pages 103-114, St Andrews, Scotland, July 2000. Intellect. [ bib \| .ps ] Functional languages potentially offer benefits for distributed fault tolerance: many computations are pure, and hence have no side-effects to be reversed during error recovery; moreover functional languages have a high-level runtime system (RTS) where computations and data are readily manipulated. We propose a new RTS level of fault tolerance for distributed functional languages, and outline a design for its implementation for the GdH language. The design distinguishes between pure and impure computations: impure computations must be recovered using conventional exception-based techniques, but the RTS attempts implicit recovery of pure computations. Keywords: implementation, GdH, GUM, fault tolerance
[29]	H.-W. Loidl, P. W. Trinder, K. Hammond, S. B. Junaidu, R. G. Morgan, and S. L. Peyton-Jones. Engineering Parallel Symbolic Programs in GpH. Concurrency - Practice and Experience, 11(12):701-752, 1999. [ bib \| .ps ] We investigate the claim that functional languages offer low-cost parallelism in the context of symbolic programs on modest parallel architectures. In our investigation we present the first comparative study of the construction of large applications in a parallel functional language, in our case in Glasgow Parallel Haskell (GPH). The applications cover a range of application areas, use several parallel programming paradigms, and are measured on two very different parallel architectures. On the applications level the most significant result is that we are able to achieve modest wall-clock speedups (between factors of 2 and 10) over the optimised sequential versions for all but one of the programs. Speedups are obtained even for programs that were not written with the intention of being parallelised. These gains are achieved with a relatively small programmer-effort. One reason for the relative ease of parallelisation is the use of evaluation strategies, a new parallel programming technique that separates the algorithm from the coordination of parallel behaviour. On the language level we show that the combination of lazy and parallel evaluation is useful for achieving a high level of abstraction. In particular we can describe top-level parallelism, and also preserve module abstraction by describing parallelism over the data structures provided at the module interface (“data-oriented parallelism”). Furthermore, we find that the determinism of the language is helpful, as is the largely-implicit nature of parallelism in GPH. Keywords: parallel applications, GpH
[28]	P. W. Trinder. Motivation for Glasgow distributed Haskell, a non-strict Functional Language. In PDSIA'99 - Intl. Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, pages 72-81, Sendai, Japan, July 1999. World Scientific. [ bib \| .html ] Non-strict functional languages offer potential benefits for constructing distributed systems: namely a highly-dynamic model of distribution, a relatively high degree of distribution transparency, and the potential to abstract over distribution-control primitives. We describe our motivation for implementing such a language, a variant of Haskell, and evaluating it. The implementation is a fusion of existing Glasgow Haskell Compiler technologies. The evaluation will be based on experiences implementing a distributed interactive simulation, and comparing it with a Java version. Keywords: models of parallelism, GdH
[27]	P. W. Trinder, H.-W. Loidl, and K. Hammond. Large-Scale Functional Applications. In K. Hammond and G. J. Michaelson, editors, Research Directions in Parallel Functional Programming, pages 399-426. Springer, Nov. 1999. [ bib \| http ] This chapter describes applications that have been written using parallel functional language technology. While the majority of these applications have been produced as part of a computer science research project aimed at testing the limits of the technology, others such as those written by Ericsson in Erlang, are being used in real commercial products. Furthermore, projects such as Digital's Smart Kiosk application written using the data-flow inspired Stampede system represent serious trial uses of the technology by commercial organisations. Functional programming is a wide-spectrum, general purpose paradigm, and the applications therefore cover a wide area, including high-performance numerical applications, symbolic applications, data-intensive applications and computer vision. Some of these areas present hard problems for traditional parallel approaches, which are usually best suited to regular computation patterns and regular data structures. Symbolic or irregular applications that are described here include compilers, theorem provers, computer algebra and natural language processing. It is a common misconception that functional programming can only be applied to small-scale programs, or that parallelism has only been obtained for toy examples. This chapter describes several real parallel applications that are large in absolute terms, e.g. the 230000 line Erlang mobility server, or the 47000 line Lolita natural language parser. Since functional programs are often a factor of 5-10 shorter than the equivalent C or Fortran, these represent serious, and realistic bases for assessing the quality of a parallel language and its development environment. Although, as the preceding chapters have shown, there is still much research to be done, parallel functional programming technology has been maturing rapidly, and many large parallel functional applications have been produced in recent years. Although this chapter aims to be reasonably comprehensive,it is therefore necessarily selective in the applications that are covered. We are particularly heartened by results from systems such as SAC and SISAL, which are not merely competitive with the performance of C, but have actually exceeded the performance of equivalent Fortran programs compiled using high-quality parallelising compilers. The chapter uses the taxonomy that was introduced in Chapter 3. Implicit, semi-explicit, coordination, explicit and derivational approaches are each covered in separate sections. Within those sections, applications are covered by area. Ideally we would like to report a consistent set of program and parallel platform characteristics, and performance figures for each application. In the absence of a standard set of parallel benchmarks, such results are currently unobtainable. Results are therefore given using either absolute or relative speedup figures as recorded by the authors of the research papers on particular parallel architectures. Keywords: parallel applications, GpH
[26]	J. G. Hall, C. A. Baker-Finch, P. W. Trinder, and D. J. King. Towards an Operational Semantics for a Parallel Non-Strict Functional Language. In IFL'98 - Intl. Workshop on the Implementation of Functional Languages, volume 1595 of Lecture Notes in Computer Science, pages 54-71, London, UK, Sept. 1998. Springer. [ bib \| .html ] Parallel programs must describe both computation and coordination, i.e. what to compute and how to organise the computation. In functional languages equational reasoning is often used to reason about computation. In contrast, there have been many different coordination constructs for functional languages, and far less work on reasoning about coordination. We present an initial semantics for GpH, a small extension of the Haskell language, that allows us to reason about coordination. In particular we can reason about work, average parallelism, and runtime. The semantics captures the notions of limited (physical) resources, the preservation of sharing and speculative evaluation. We show a consistency result with Launchbury's well-known lazy semantics. Keywords: semantics, GpH
[25]	K. Hammond. Strategic SPMD. In Glasgow Workshop on Functional Programming, Pitlochry, Scotland, Sept. 1998. [ bib \| .html ] This paper extends the existing concepts of evaluation strategies, higher-order functions that can be uses to separate behaviour from algorithm in parallel functional programs, to support the SPMD model of parallelism. Our model allows SPMD sub-programs to be composed with other parallel paradigms, or to be arbitrarily nested with other parallel paradigms. This represents a major step towards supporting the Group SPMD model, a generalisation of SPMD, in Glasgow Parallel Haskell. It also represents the first attempt of which we are aware to model SPMD programming in a system supporting implicit communications. In order to ensure that the necessary barrier communications are modelled correctly at the high level we use the implementation concept synchronisation through logically shared graph reduction. We therefore ensure, in common with other work on evaluation strategies, that no new constructs need be added to the host functional language other than basic ones to introduce parallelism and enforce sequential execution. In order to ensure accurate modelling of the SPMD paradigm it is, however, convenient to introduce a primitive to control task placement. Although there is, as yet, no efficient implementation of the SPMD strategy, we are confident that this could be constructed using an algorithmic skeleton, or similar implementation technique. Keywords: models of parallelism, evaluation strategies, SPMD
[24]	D. J. King, J. G. Hall, and P. W. Trinder. A Strategic Profiler for Glasgow Parallel Haskell. In IFL'98 - Intl. Workshop on the Implementation of Functional Languages 1998, volume 1595 of Lecture Notes in Computer Science, pages 88-102, London, UK, Sept. 1998. Springer. [ bib \| .html ] Execution profiling is a crucial part of engineering large parallel functional programs. This paper presents the design, implementation, and use of a new execution time profiler for Glasgow Parallel Haskell GpH - a well established parallel derivative of Haskell. Parallelism is introduced in GpH by using evaluation strategies, which provide a clean way of co-ordinating parallel threads without destroying a program's original structure. The new profiler attributes the cost of evaluating parallel threads to the strategies that created them. A unique feature of the strategic profiler is that the call-graph of evaluation strategies is maintained, allowing the programmer to discover the sequence of (perhaps nested) strategies that were used to create any given thread. The strategic profiler is demonstrated on several examples, and compared with other profilers. Keywords: tools, GpH, profiling, evaluation strategies
[23]	H.-W. Loidl. Granularity in Large-Scale Parallel Functional Programming. PhD thesis, Dept. of Computing Science, Univ. of Glasgow, Mar. 1998. [ bib \| .html ] This thesis demonstrates how to reduce the runtime of large non-strict functional programs using parallel evaluation. The parallelisation of several programs shows the importance of granularity, i.e. the computation costs of program expressions. The aspect of granularity is studied both on a practical level, by presenting and measuring runtime granularity improvement mechanisms, and at a more formal level, by devising a static granularity analysis. By parallelising several large functional programs this thesis demonstrates for the first time the advantages of combining lazy and parallel evaluation on a large scale: laziness aids modularity, while parallelism reduces runtime. One of the parallel programs is the Lolita system which, with more than 47,000 lines of code, is the largest existing parallel non-strict functional program. A new mechanism for parallel programming, evaluation strategies, to which this thesis contributes, is shown to be useful in this parallelisation. Evaluation strategies simplify parallel programming by separating algorithmic code from code specifying dynamic behaviour. For large programs the abstraction provided by functions is maintained by using a data-oriented style of parallelism, which defines parallelism over intermediate data structures rather than inside the functions. A highly parameterised simulator, GranSim, has been constructed collaboratively and is discussed in detail in this thesis. GranSim is a tool for architecture-independent parallelisation and a testbed for implementing runtime-system features of the parallel graph reduction model. By providing an idealised as well as an accurate model of the underlying parallel machine, GranSim has proven to be an essential part of an integrated parallel software engineering environment. Several parallel runtime-system features, such as granularity improvement mechanisms, have been tested via GranSim. It is publicly available and in active use at several universities worldwide. In order to provide granularity information this thesis presents an inference-based static granularity analysis. This analysis combines two existing analyses, one for cost and one for size information. It determines an upper bound for the computation costs of evaluating an expression in a simple strict higher-order language. By exposing recurrences during cost reconstruction and using a library of recurrences and their closed forms, it is possible to infer the costs for some recursive functions. The possible performance improvements are assessed by measuring the parallel performance of a hand-analysed and annotated program. Keywords: parallel applications, GpH, granularity, thesis
[22]	P. W. Trinder, E. Barry Jr., M. K. Davis, K. Hammond, S. B. Junaidu, U. Klusik, H.-W. Loidl, and S. L. Peyton-Jones. GpH: An Architecture-independent Functional Language. Unpublished draft, July 1998. [ bib \| .html ] In principle, pure functional languages promise straightforward architecture-independent parallelism. We investigate the validity of this claim in the context of our highly-portable implementation of an implicitly-parallel functional language: the GUM implementation of Glasgow Parallel Haskell (GpH). We discuss architecture independence at two levels: low-level (i.e. the implementation) and high-level (i.e. the programmer). Low-level architecture independence is achieved by choosing a message-passing model for GUM, and implementing it using portable C and a widely-supported message-passing library like PVM. In fact GUM is largely independent of the message-passing library, and has been adapted to use MPI and the CM-5 CMMD libraries as well as PVM. As a result, GUM is easily ported, and is currently available on seven platforms including shared-memory machines, distributed memory machines, and networks of workstations. We provide indicative measurements of how efficient and effective our architecture-independent runtime system is across a range of architectures. The GpH programming model provides higher-level architecture independence. The parallelism in GpH is mainly implicit, and hence relatively small parts of the program need to be changed for a new architecture. The coordination that is required is expressed with a new high-level construct, evaluation strategies. Evaluation strategies provide a clean separation between algorithm and coordination, easing the task of changing either facet to suit new parallel environments. Moreover, GpH programs can systematically be developed for multiple target architectures, using a suite of simulation, profiling and visualisation tools. Much of the development is architecture-independent but, once a particular target architecture has been selected, the tools are parameterised to support tuning for that architecture. We demonstrate the systematic development of two real programs to the point of achieving good speedups on four architectures. Keywords: models of parallelism, GpH, architecture independence, unpublished
[21]	P. W. Trinder, E. Barry Jr., M. K. Davis, K. Hammond, S. B. Junaidu, U. Klusik, H.-W. Loidl, and S. L. Peyton-Jones. Low level Architecture-independence of Glasgow Parallel Haskell (GpH). In Glasgow Workshop on Functional Programming, Pitlochry, Scotland, Sept. 1998. [ bib \| .html ] In principle, pure functional languages promise straightforward architecture-independent parallelism. We investigate the validity of this claim in the context of our highly-portable implementation of an implicitly-parallel functional language: the GUM implementation of Glasgow Parallel Haskell (GpH). We focus here on the low-level architecture independence of GUM: the high-level architecture independence of the GpH programming model is discussed elsewhere. Low-level architecture independence is achieved by choosing a message-passing model for GUM, and implementing it using portable C and a widely-supported message-passing library like PVM. In fact GUM is largely independent of the message-passing library, and has been adapted to use MPI and the CM-5 CMMD libraries as well as PVM. As a result, GUM is easily ported, and is currently available on seven platforms including shared-memory machines, distributed memory machines, and networks of workstations. We provide indicative measurements of the efficiency and speedups delivered by GUM on a range of architectures. Keywords: models of parallelism, GpH, architecture independence
[20]	P. W. Trinder, K. Hammond, H.-W. Loidl, and S. L. Peyton-Jones. Algorithm + Strategy = Parallelism. Journal of Functional Programming, 8(1):23-60, 1998. [ bib \| DOI \| .html ] The process of writing large parallel programs is complicated by the need to specify both the parallel behaviour of the program and the algorithm that is to be used to compute its result. This paper introduces evaluation strategies, lazy higher-order functions that control the parallel evaluation of non-strict functional languages. Using evaluation strategies, it is possible to achieve a clean separation between algorithmic and behavioural code. The result is enhanced clarity and shorter parallel programs. Evaluation strategies are a very general concept: this paper shows how they can be used to model a wide range of commonly used programming paradigms, including divide-and-conquer, pipeline parallelism, producer/consumer parallelism, and data-oriented parallelism. Because they are based on unrestricted higher-order functions, they can also capture irregular parallel structures. Evaluation strategies are not just of theoretical interest: they have evolved out of our experience in parallelising several large-scale parallel applications, where they have proved invaluable in helping to manage the complexities of parallel behaviour. These applications are described in detail here. The largest application we have studied to date, Lolita, is a 60,000 line natural language parser. Initial results show that for these programs we can achieve acceptable parallel performance, while incurring minimal overhead for using evaluation strategies. Keywords: models of parallelism, GpH, evaluation strategies
[19]	C. V. Hall, H.-W. Loidl, P. W. Trinder, K. Hammond, and J. T. O'Donnell. Refining a Parallel Algorithm for Calculating Bowings. In Glasgow Workshop on Functional Programming, Ullapool, Scotland, Sept. 1997. Draft proceedings. [ bib \| .html ] String players know that bowing properly is the hardest skill they have to learn. In other work [FH97], we develop an algorithm that calculates bowings for BowTech, a project that supports string performers. This algorithm takes a significant amount of time to execute (in our second test case, an hour for the implementation, written in Ada). We have implemented a parallel version of the algorithm in GpH, a parallel superset of Haskell, and measured the quality of output on size pieces of music. The parallel program has been refined using the GranSim simulator and measured on two parallel architectures: a shared memory multiprocessor and a network of workstations. Keywords: parallel applications, GpH
[18]	K. Hammond, H.-W. Loidl, and P. W. Trinder. Parallel Cost Centre Profiling. In Glasgow Workshop on Functional Programming, pages 51-72, Ullapool, Scotland, Sept. 1997. Draft proceedings. [ bib \| .html ] Good profiling is a major issue in extracting performance from parallel programs. We report on a novel synthesis of sequential cost-centre profiling and state-of-the-art parallel simultion for the pure functional language Haskell that promises to provide detailed and accurate information about parallel Haskell programs. Exploiting simulation also improves the quality of sequential cost centre profiling, though at a significant performance overhead. Keywords: tools, GpH, profiling
[17]	S. B. Junaidu, A. J. T. Davie, and K. Hammond. Naira: A Parallel^2 Haskell Compiler. In IFL'97 - Intl. Workshop on the Implementation of Functional Languages, volume 1467 of Lecture Notes in Computer Science, pages 214-230, St Andrews, Scotland, Sept. 1997. Springer. [ bib \| .html ] Naira is a compiler for a parallel dialect of Haskell, compiling to a dataflow-inspired parallel abstract machine. Unusually (perhaps even uniquely), Naira has itself been parallelised using state-of-the-art tools developed at Glasgow and St. Andrews. This paper reports initial performance results that have been obtained using the GranSim simulator, both for the top-level pipeline and for the individual compilation states. Our results show that a modest but useful degree of parallelism can be achieved even for a distributed memory machine. Keywords: parallel applications, GpH
[16]	H.-W. Loidl. LinSolv: a Case Study in Strategic Parallelism. In Glasgow Workshop on Functional Programming, Ullapool, Scotland, Sept. 1997. Draft proceedings. [ bib \| .html ] This paper discusses the parallelisation and performance tuning of a typical computer algebra algorithm, LinSolv, using evaluation strategies. We present three steps in the parallelisation process starting with a naive parallel version. As this algorithm uses infinite data structures as intermediate values it is necessary to define very sophisticated strategies in order to improve parallel performance. We also compare the strategic parallel code with pre-strategy code. This comparison shows how evaluation strategies help to localise changes needed for parallelisation. In particular, the separation between algorithmic and parallel code makes the structure of the parallelism much clearer. Keywords: parallel applications, GpH
[15]	H.-W. Loidl, R. G. Morgan, P. W. Trinder, S. Poria, C. Cooper, S. L. Peyton-Jones, and R. Garigliano. Parallelising a Large Functional Program or: Keeping LOLITA Busy. In IFL'97 - Intl. Workshop on the Implementation of Functional Languages, volume 1467 of Lecture Notes in Computer Science, pages 198-213, St Andrews, Scotland, Sept. 1997. Springer. [ bib \| DOI \| .html ] In this paper we report on the ongoing parallelisation of Lolita, a natural language engineering system. Although Lolita currently exhibits only modest parallelism, we believe that it is the largest parallel functional program ever, comprising more than 47,000 lines of Haskell. Lolita has the following interesting features common to real world applications of lazy languages: * the code was not specifically designed for parallelism; * laziness is essential for efficiency in Lolita; * Lolita interfaces to data structures outside the Haskell heap, using a foreign language interface; * Lolita was not written by those most closely involved in the parallelisation. Our expectations in parallelising the program were to achieve moderate speedups with small changes in the code. To date speedups of up to 2.4 have been achieved for Lolita running under a realistic simulation of our 4 processor shared-memory target machine. Most notably, the parallelism is achieved with a very small number of changes to, and without requiring an understanding of most of the application. On the Sun SPARCserver target machine wall-clock speedup is currently limited by physical memory availability. Keywords: parallel applications, GpH
[14]	H.-W. Loidl and P. W. Trinder. Engineering Large Parallel Functional Programs. In IFL'97 - Intl. Workshop on the Implementation of Functional Languages, volume 1467 of Lecture Notes in Computer Science, pages 178-197, St Andrews, Scotland, Sept. 1997. Springer. [ bib \| DOI \| .html ] The design and implementation of useful programming languages, whether sequential or parallel, should be driven by large, realistic applications. In constructing several medium- and large-scale programs in Glasgow Parallel Haskell, GPH, a parallel extension of Haskell, the group at Glasgow has investigated several important engineering issues: * Real Application Parallelism: The programs achieve good wall-clock speedups and acceptable scale-up on both a shared-memory and a distributed memory machine. The programs typify a number of application areas and use a number of different parallel paradigms, e.g. pipelining or divide-and-conquer, often combining several paradigms in a single program. * Language Issues: Although the largely implicit parallelism in GPH is a satisfactory programming model in general the base constructs for introducing and controlling parallelism tend to obfuscate the semantics of large programs. As a result we developed evaluation strategies, a more abstract, and systematic mechanism for introducing and controlling parallelism. * Engineering Environment: The development and performance tuning of these programs emphasised the importance of an integrated engineering environment. In the process we have refined components of this environment like the simulator, the runtime system, and the profiling tools. Keywords: parallel applications, GpH
[13]	H.-W. Loidl and K. Hammond. A Sized Time System for a Parallel Functional Language. In Glasgow Workshop on Functional Programming, Ullapool, Scotland, July 1996. [ bib \| .html ] This paper describes an inference system, whose purpose is to determine the cost of evaluating expressions in a strict purely functional language. Upper bounds can be derived for both computation cost and the size of data structures. We outline a static analysis based on this inference system for inferring size and cost information. The analysis is a synthesis of the sized types of Hughes et al., and the polymorphic time system of Dornic et al., which was extended to static dependent costs by Reistad and Gifford. Our main interest in cost information is for scheduling tasks in the parallel execution of functional languages. Using the GranSim parallel simulator, we show that the information provided by our analysis is sufficient to characterise relative task granularities for a simple functional program. This information can be used in the runtime-system of the Glasgow Parallel Haskell compiler to improve dynamic program performance. Keywords: semantics, resource analysis
[12]	H.-W. Loidl and K. Hammond. Making a Packet: Cost-Effective Communication for a Parallel Graph Reducer. In IFL'96 - Intl. Workshop on the Implementation of Functional Languages, volume 1268 of Lecture Notes in Computer Science, pages 184-199, Bad-Godesberg, Germany, Sept. 1996. Springer. [ bib \| .html ] This paper studies critical runtime-system issues encountered when packing data for transmission in a lazy, parallel graph reduction system. In particular, we aim to answer two questions: * How much graph should go into a packet? * How aggressively should a processor look for work after requesting remote data? In order to answer the first question, we compare various packing schemes, of which one extreme packs just the node that is demanded (“incremental fetching”), and the other packs all the graph that is reachable from that node (“bulk fetching”). The second question is addressed by considering various mechanisms for latency hiding during communication, ranging from fully synchronous communication with no attempt to mask latency, to full thread migration during asynchronous communication. In order to make our results as general as possible, we have used the GranSim simulator to study a wide variety of parallel machine configurations. Based on these measurements we propose concrete improvements for parallel graph reducers such as the GUM implementation of Glasgow Parallel Haskell. Keywords: implementation, GpH, GUM
[11]	P. W. Trinder, K. Hammond, H.-W. Loidl, S. L. Peyton-Jones, and J. Wu. Accidents always Come in Threes: A Case Study of Data-intensive Programs in Parallel Haskell. In Glasgow Workshop on Functional Programming, Ullapool, Scotland, July 1996. [ bib \| .html ] Accidents happen: * "An invisible car came out of nowhere, struck my vehicle and vanished." * "I pulled away from the side of the road, glanced at my mother-in-law, and headed for the embankment." * "As I approached the intersection a sign suddenly appeared in a place where no stop sign had ever appeared before." Luckily, we don't normally have to deal with problems as bizarre as these. One interesting application that does arise at the Centre for Transport Studies consists of matching police reports of several accidents so as to locate accident blackspots. The application provides an interesting, data-intensive, test-bed for the persistent functional language PFL. We report here on an approach aimed at improving the performance of this application using Glasgow Parallel Haskell. The accident application is one of several large parallel Haskell programs under development at Glasgow. Our objective is to achieve wall-clock speedups over the best sequential implementations, and we report modest wall-clock speedups for a demonstration program. From experience with these and other programs the group is developing a methodology for parallelising large functional programs. We have also developed strategies, a mechanism to separately specify a function's algorithm and its dynamic behaviour. Keywords: parallel applications, GpH
[10]	P. W. Trinder, K. Hammond, J. S. Mattson Jr., A. S. Partridge, and S. L. Peyton-Jones. GUM: A Portable Parallel Implementation of Haskell. In PLDI'96 - ACM Conf. on Programming Language Design and Implementation, pages 79-88, Philadephia, PA, USA, May 1996. ACM Press. [ bib \| DOI \| .html ] GUM is a portable, parallel implementation of the Haskell functional language. Despite sustained research interest in parallel functional programming, GUM is one of the first such systems to be made publicly available. GUM is message-based, and portability is facilitated by using the PVM communications harness that is available on many multi-processors. As a result, GUM is available for both shared-memory (Sun SPARCserver multiprocessors) and distributed-memory (networks of workstations) architectures. The high message-latency of distributed machines is ameliorated by sending messages asynchronously, and by sending large packets of related data in each message. Initial performance figures demonstrate absolute speedups relative to the best sequential compiler technology. To improve the performance of a parallel Haskell program GUM provides tools for monitoring and visualising the behaviour of threads and of processors during execution. Keywords: implementation, GpH, GUM
[9]	K. Hammond, H.-W. Loidl, and A. S. Partridge. Visualising Granularity in Parallel Programs: A Graphical Winnowing System for Haskell. In HPFC'95 - Conf. on High Performance Functional Computing, pages 208-221, Denver, CO, USA, Apr. 1995. [ bib \| .html ] To take advantage of distributed-memory parallel machines it is essential to have good control of task granularity. This paper describes a fairly accurate parallel simulator for Haskell, based on the Glasgow compiler, and complementary tools for visualising task granularities. Together these tools allow us to study the effects of various annotations on task granularity on a variety of simulated parallel architectures. They also provide a more precise tool for the study of parallel execution than has previously been available for Haskell programs. These tools have already confirmed that thread migration is essential in parallel systems, demonstrated a close correlation between thread execution times and total heap allocations, and shown that fetching data synchronously normally gives better overall performance than asynchronous fetching, if data is fetched on demand. Keywords: tools, GpH, simulator, granularity
[8]	H.-W. Loidl and K. Hammond. On the Granularity of Divide-and-Conquer Parallelism. In Glasgow Workshop on Functional Programming, Workshops in Computing, Ullapool, Scotland, July 1995. Springer. [ bib \| .html \| .pdf ] This paper studies the runtime behaviour of various parallel divide-and-conquer algorithms written in a non-strict functional language, when three common granularity control mechanisms are used: a simple cut-off, a priority thread creation and a priority scheduling mechanism. These mechanisms use granularity information that is currently provided via annotations to improve the performance of the parallel programs. The programs we examine are several variants of a generic divide-and-conquer program, an unbalanced divide-and-conquer algorithm and a parallel determinant computation. Our results indicate that for balanced computation trees a simple, low-overhead mechanism performs well whereas the more complex mechanisms offer further improvements for unbalanced computation trees. Keywords: models of parallelism, GpH, granularity
[7]	K. Hammond, J. S. Mattson Jr., and S. L. Peyton-Jones. Automatic Spark Strategies and Granularity for a Parallel Functional Language Reducer. In CONPAR'94 - Intl. Conf. on Parallel Processing, volume 854 of Lecture Notes in Computer Science, pages 521-532, Linz, Austria, Sept. 1994. Springer. [ bib \| DOI \| .html ] This paper considers the issue of dynamic task control in the context of a parallel Haskell implementation on the GRIP multiprocessor. For the first time, we report the effect of our task control strategies on task granularity, as measured in terms of dynamic heap allocations. This gives a concrete means of measuring the effectiveness of these strategies other than wall-clock timings, which are notoriously uninformative. Keywords: models of parallelism, GpH, GRIP, granularity
[6]	H.-W. Loidl and K. Hammond. GRAPH for PVM: Graph Reduction for Distributed Hardware. In IFL'94 - Intl. Workshop on the Implementation of Functional Languages, Norwich, UK, Sept. 1994. [ bib \| .html ] We describe a version of the GRAPH system (Graph Reduction for an Assortment of Parallel Hardware) designed to execute parallel functional programs on a range of distributed-memory MIMD machines running the PVM communications harness. GRAPH was developed from the runtime system for the novel GRIP multiprocessor. Although this system has proved highly successful, being a novel architecture it is hard to compare results with other architectures or implementations. The CPUs used in the GRIP design are also rather old. The principal extensions from GRAPH for GRIP to GRAPH for PVM are intended to handle high latencies more efficiently, by exploiting asynchronous communication, multi-threaded scheduling, and by grouping packets into larger entities where possible. The main innovation is the development of new, sophisticated packet flushing and packet fetching algorithms whose purpose is to allow graphs to be exported to and fetched from global memory without inter-processor synchronisation. As communication and reduction can be interleaved, a new form of synchronising communication and reduction is necessary. We also give a short outlook on the design of a new portable distributed graph reduction system, GRAPH for UMM, that adepts the ideas and techniques originally developed for the GRIP system to a distributed memory environment. GRAPH for PVM has been implemented on a network of SUN workstations and it is currently being tested and debugged. Although no extensive performance measurements have been performed yet, the available data shows a significant decrease in the overall communication and thus an improvement in the overall performance of the system. Keywords: implementation, GpH, run-time system
[5]	K. Hammond, H.-W. Loidl, J. S. Mattson Jr., A. S. Partridge, S. L. Peyton-Jones, and P. W. Trinder. GRAPHing the Future. In IFL'94 - Intl. Workshop on the Implementation of Functional Languages, Norwich, UK, Sept. 1994. [ bib \| .html ] At Glasgow our research into parallel functional programming has been moving away from our novel architecture, GRIP towards the provision of a general parallel runtime environment. We call this GRAPH (Graph Reduction for an Assortment of Parallel Hardware). This paper describes the design of a new memory and load management model for GRAPH, which is intended to match shared- and distributed-memory machines better and to provide a framework for our research into parallel functional databases and granularity control. This model is currently under implementation. No performance results can therefore be provided at the present time. Keywords: implementation, GpH, run-time system
[4]	G. Akerholt, K. Hammond, S. L. Peyton-Jones, and P. W. Trinder. Processing Transactions on GRIP, a Parallel Graph Reducer. In PARLE'93 - Intl. Conf. on Parallel Architectures and Languages Europe, volume 694 of Lecture Notes in Computer Science, pages 634-647, Munich, Germany, June 1993. Springer. [ bib \| DOI \| .html ] The GRIP architecture allows efficient execution of functional programs on a multi-processor built from standard hardware components. State-of-the-art compilation techniques are combined with sophisticated runtime resource-control to give good parallel performance. This paper reports the results of running GRIP on an application which is apparently unsuited to the basic functional model: a database transaction manager incorporating updates as well as lookup transactions. The results obtained show good relative speedups for GRIP, with real performance advantages over the same application executing on sequential machines. Keywords: parallel applications, GpH, GRIP
[3]	K. Hammond. Getting a GRIP. In Intl. Workshop on the Parallel Implementation of Functional Languages, Nijmegen, The Netherlands, Sept. 1993. [ bib \| .html ] This paper describes a portable implementation of the Graph Reduction in Parallel (GRIP) software, built on the Parallel Virtual Machine (PVM) process control system. Using this system, it is easy for anyone to “get a GRIP” without needing to invest in special-purpose hardware. An important contribution of this paper is its detailed description of GRIP's Intelligent Memory Unit (IMU) software. Keywords: implementation, GpH, run-time system, GRIP
[2]	K. Hammond and S. L. Peyton-Jones. Profiling Scheduling Strategies on the GRIP Multiprocessor. In Intl. Workshop on the Parallel Implementation of Functional Languages, pages 73-98, Aachen, Germany, Sept. 1992. [ bib \| .html ] It is widely claimed that functional languages are particularly suitable for programming parallel computers. A claimed advantage is that the programmer is not burdened with details of task creation, placement, scheduling, and synchronisation, these decisions being taken by the system instead. Leaving aside the question of whether a pure functional language is expressive enough to encompass all the parallel algorithms we might wish to program, there remains the question of how effectively the compiler and run-time system map the program onto a real parallel system, a task usually carried out mostly by the programmer. This is the question we address in this paper. We first introduce the system architecture of GRIP, a shared-memory parallel machine supporting an implementation of the functional language Haskell. GRIP executes functional programs in parallel using compiled supercombinator graph reduction, a form of declarative rule system. We then describe several strategies for run-time resource control which we have tried, presenting comprehensive measurements of their effectiveness. We are particularly concerned with strategies controlling task creation, in order to improve task granularity and minimise communication overheads. This is, so far as we know, one of the first attempts to make a systematic study of task-control strategies in a high-performance parallel functional-language system. GRIP's high absolute performance render these results credible for real applications. Keywords: implementation, GpH, run-time system, GRIP
[1]	K. Hammond and S. L. Peyton-Jones. Some Early Experiments on the GRIP Parallel Reducer. In Intl. Workshop on the Parallel Implementation of Functional Languages, pages 51-72, Nijmegen, The Netherlands, June 1990. [ bib \| .html ] GRIP is a multiprocessor designed to execute functional programs in parallel using graph reduction. We have implemented a compiler for GRIP, based on the Spineless Tagless G-machine, and can now run parallel functional programs with substantial absolute speedup over the same program running on a uniprocessor Sun. Parallel functional programming shifts some of the burden of resource allocation from the programmer to the system. Examples of such decisions include: when to create a new concurrent activity (or thread), when to execute such threads, where to execute them, and so on. It is clearly desirable that the system should take such decisions, provided it does a good enough job. For example, a paged virtual memory system almost always does an adequate job, and a programmer very seldom has to interfere with it. The big question for parallel functional programming is whether good resource-allocation strategies exist, and how well they perform under a variety of conditions. Now that we have an operational system, we are starting to carry out experiments to develop resource-allocation strategies, and measure their effectiveness. This paper reports on some very preliminary results. They mainly concern the question of when, or even whether, to create a new thread. This is an aspect which has so far received little attention - existing work has focused mainly on load sharing rather than on thread creation. Our work is aided significantly by the fact that sparks indicate only the possibility of creating a new parallel thread, not the necessity of creating this thread. It is therefore always safe to discard sparks, while affecting only the performance of the computation, not its result. We begin with an overview of the GRIP system, and how we perform compiled graph reduction. This is followed by a summary of the statistics-gathering system and the results we have obtained so far. Keywords: implementation, GpH, run-time system, GRIP

List of Publications (with Abstracts)