Glasgow Parallel Haskell

@inproceedings{BrownLBH10,
  author = {Brown, Christopher and Loidl, Hans-Wolfgang and Berthold, Jost and Hammond, Kevin},
  title = {Improving your {CASH} flow: The {Computer} {Algebra} {SHell} (Extended Abstract)},
  booktitle = {IFL'10 --- Intl. Workshop on the Implementation of Functional Languages},
  month = sep,
  year = {2010},
  note = {Draft Proceedings},
  keywords = {parallel applications, Eden, symbolic computation},
  url = {http://www.cs.st-andrews.ac.uk/~hwloidl/SCIEnce/SymGrid-Par/ifl2010.pdf},
  abstract = {
    Some important challenges in the field of symbolic computation ---
    and functional programming --- are the transparent access to
    complex, mathematical software, the exchange of data between
    independent systems with specialised tasks and the exploitation of
    modern parallel hardware. One attempt to solve this problem is
    SymGrid-Par, a system for exploiting parallel hardware in the
    context of computer algebra.  Specifically, SymGrid-Par provides
    an easy-to-use platform for parallel computation that connects
    several underlying computer algebra systems, communicating through
    a standardised protocol for symbolic computation.

    In this paper we describe a new component of SymGrid-Par known as
    CaSH: the Computer Algebra SHell. CaSH is a system that allows
    direct access to SymGrid-Par via GHCi. CaSH thus allows Haskell
    programmers to exploit high-performance parallel computations
    using a system designated for solving problems in computer
    algebra; whilst still maintaining the purity and expressability
    offered by the Haskell environment. We demonstrate access to both
    sequential and parallel services of SymGrid-Par.  For the latter
    we use parallel skeletons, implemented in the Haskell dialect of
    Eden; these skeletons are called from CaSH but exploit a
    computational algebra system known as GAP to offload the
    mathematical complexity.
  }
}

@inproceedings{JanjicH10,
  author = {Janjic, Vladimir and Hammond, Kevin},
  title = {Granularity-Aware Work-Stealing for Computationally-Uniform Grids},
  booktitle = {CCGRID 2010 --- Intl. Conf. on Cluster, Cloud and Grid Computing},
  address = {Melbourne, Australia},
  month = may,
  year = {2010},
  pages = {123--134},
  keywords = {implementation, GpH, non-uniform topology, Grid},
  doi = {http://dx.doi.org/10.1109/CCGRID.2010.49},
  abstract = {
    Good scheduling is important for ensuring effective use of Grid
    resources, while maximising parallel performance. In this paper,
    we show how a basic ``Random-Stealing'' load balancing algorithm
    for computational Grids can be improved by using information about
    the task granularity of parallel programs. We propose several
    strategies (SSL, SLL and LLL) for using granularity information to
    improve load balancing, presenting results both from simulations
    and from a real implementation (the Grid-GUM Runtime System for
    Parallel Haskell). We assume a common model of task creation which
    subsumes both master/worker and data-parallel programming
    paradigms under a task-stealing work distribution
    strategy. Overall, we achieve improvement in runtime of up to 19.4
    percent for irregular problems in the real implementation, and up
    to 40 percent for the simulations (typical improvements of more
    that 15 percent for irregular programs, and from 5--10 percent for
    regular ones). Our results show that, for computationally-uniform
    Grids, advanced load balancing methods that exploit granularity
    information generally have the greatest impact on reducing the
    runtimes of irregular parallel programs. Moreover, the more
    irregular the program is, the better the improvements that can be
    achieved.
  }
}

@inproceedings{MarlowMLAT10,
  author = {Marlow, Simon and Maier, Patrick and Loidl, Hans-Wolfgang and Aswad, Mustafa K. and Trinder, Philip W.},
  title = {Seq no more: Better Strategies for Parallel {Haskell}},
  booktitle = {Haskell Symposium 2010},
  address = {Baltimore, MD, USA},
  month = sep,
  year = {2010},
  publisher = {ACM Press},
  note = {To appear},
  keywords = {implementation, GpH, evaluation strategies},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/new-strategies.html},
  abstract = {
    We present a complete redesign of evaluation strategies, a key
    abstraction for specifying pure, deterministic parallelism in
    Haskell.  Our new formulation preserves the compositionality and
    modularity benefits of the original, while providing significant
    new benefits.  First, we introduce an evaluation-order monad to
    provide clearer, more generic, and more efficient specification of
    parallel evaluation.  Secondly, the new formulation resolves a
    subtle space management issue with the original strategies,
    allowing parallelism (sparks) to be preserved while reclaiming
    heap associated with superfluous parallelism.  Related to this,
    the new formulation provides far better support for speculative
    parallelism as the garbage collector now prunes unneeded
    speculation.  Finally, the new formulation provides improved
    compositionality: we can directly express parallelism embedded
    within lazy data structures, producing more compositional
    strategies, and our basic strategies are parametric in the
    coordination combinator, facilitating a richer set of parallelism
    combinators.

    We give measurements over a range of benchmarks demonstrating that
    the runtime overheads of the new formulation relative to the
    original are low, and the new strategies even yield slightly
    better speedups on average than the original strategies.
  }
}

@inproceedings{AswadTZMB09,
  author = {Aswad, Mustafa K. and Trinder, Philip W. and Al Zain, Abdallah and Michaelson, Greg J. and Berthold, Jost},
  title = {Low Pain vs No Pain Multi-core {Haskells}},
  booktitle = {TFP'09 --- Symp. on Trends in Functional Programming},
  address = {Komarno, Slovakia},
  month = jun,
  year = {2009},
  publisher = {Intellect},
  series = {Trends in Functional Programming},
  volume = {10},
  note = {To appear},
  keywords = {models of parallelism, GpH, Eden, GHC-SMP, GUM, multicore},
  url = {http://www.macs.hw.ac.uk/~trinder/papers/LowvsNoPainHaskells.pdf},
  abstract = {
    Multicores are becoming the dominant processor technology and
    functional languages are theoretically well suited to exploit
    them. In practice, however, implementing effective high level
    parallel functional languages is extremely challenging.

    This paper is the first programming and performance comparison of
    functional multicore technologies and reports some of the first
    ever multicore results for two languages. As such it reflects the
    growing maturity of the field by systematically evaluating four
    parallel Haskell implementations on a common multicore
    architecture.  The comparison contrasts the programming effort
    each language requires with the parallel performance
    delivered. The study uses 15 'typical' programs to compare a 'no
    pain', i.e. entirely implicit, parallel language with three 'low
    pain', i.e. semi-explicit languages.

    The parallel Haskell implementations use different versions of GHC
    compiler technology, and hence the comparative performance metric
    is speedup which normalises against sequential performance. We
    ground the speedup comparisons by reporting both sequential and
    parallel runtimes and efficiencies for three of the languages. Our
    experiments focus on the number of programs improved, the absolute
    speedups delivered, the parallel scalability, and the program
    changes required to coordinate parallelism. The results are
    encouraging and, on occasion, surprising.
  }
}

@inproceedings{BertholdMHZ09,
  author = {Berthold, Jost and Marlow, Simon and Hammond, Kevin and Al Zain, Abdallah},
  title = {Comparing and Optimising Parallel {Haskell} Implementations for Multicore Machines},
  booktitle = {ADPNA 2009 --- Intl. Workshop on Advanced Distributed and Parallel Network Applications},
  address = {Vienna, Austria},
  month = sep,
  year = {2009},
  publisher = {IEEE Computer Society},
  pages = {386--393},
  keywords = {models of parallelism, implementation, GpH, Eden, multicore},
  doi = {http://doi.ieeecomputersociety.org/10.1109/ICPPW.2009.10},
  abstract = {
    In this paper, we investigate the differences and tradeoffs
    imposed by two parallel Haskell dialects running on multicore
    machines. GpH and Eden are both constructed using the
    highly-optimising sequential GHC compiler, and share thread
    scheduling, and other elements, from a common code base. The GpH
    implementation investigated here uses a physically-shared heap,
    which should be well-suited to multicore architectures. In
    contrast, the Eden implementation adopts an approach that has been
    designed for use on distributed-memory parallel machines: a system
    of multiple, independent heaps (one per core), with inter-core
    communication handled by message-passing rather than through
    shared heap cells. We report two main results. Firstly, we report
    on the effect of a number of optimisations that we applied to the
    shared-memory GpH implementation in order to address some
    performance issues that were revealed by our testing: for example,
    we implemented a work-stealing approach to task allocation. Our
    optimisations improved the performance of the shared-heap GpH
    implementation by as much as 30 percent on eight cores. Secondly,
    the shared heap approach is, rather surprisingly, not superior to
    a distributed heap implementation: both give similar performance
    results.
  }
}

@inproceedings{JonesMS09,
  author = {{Jones Jr.}, Don and Marlow, Simon and Singh, Satnam},
  title = {Parallel Performance Tuning for {Haskell}},
  booktitle = {Haskell Symposium 2009},
  address = {Edinburgh, Scotland},
  month = sep,
  year = {2009},
  publisher = {ACM Press},
  pages = {81--92},
  keywords = {tools, GpH, GHC-SMP, multicore, profiling},
  doi = {http://doi.acm.org/10.1145/1596638.1596649},
  url = {http://research.microsoft.com/apps/pubs/default.aspx?id=80976},
  abstract = {
    Parallel Haskell programming has entered the mainstream with
    support now included in GHC for multiple parallel programming
    models, along with multicore execution support in the
    runtime. However, tuning programs for parallelism is still
    something of a black art. Without much in the way of feedback
    provided by the runtime system, it is a matter of trial and error
    combined with experience to achieve good parallel speedups.

    This paper describes an early prototype of a parallel profiling
    system for multicore programming with GHC. The system comprises
    three parts: fast event tracing in the runtime, a Haskell library
    for reading the resulting trace files, and a number of tools built
    on this library for presenting the information to the
    programmer. We focus on one tool in particular, a graphical
    timeline browser called ThreadScope.

    The paper illustrates the use of ThreadScope through a number of
    case studies, and describes some useful methodologies for
    parallelizing Haskell programs.
  }
}

@inproceedings{MarlowPS09,
  author = {Marlow, Simon and Peyton-Jones, Simon L. and Singh, Satnam},
  title = {Runtime Support for Multicore {Haskell}},
  booktitle = {ICFP 2009 --- Intl. Conf. on Functional Programming},
  address = {Edinburgh, Scotland},
  month = aug,
  year = {2009},
  publisher = {ACM Press},
  pages = {65--78},
  keywords = {implementation, GpH, GHC-SMP, multicore},
  doi = {http://doi.acm.org/10.1145/1596550.1596563},
  url = {http://research.microsoft.com/apps/pubs/default.aspx?id=79856},
  abstract = {
    Purely functional programs should run well on parallel hardware
    because of the absence of side effects, but it has proved hard to
    realise this potential in practice. Plenty of papers describe
    promising ideas, but vastly fewer describe real implementations
    with good wall-clock performance. We describe just such an
    implementation, and quantitatively explore some of the complex
    design tradeoffs that make such implementations hard to build. Our
    measurements are necessarily detailed and specific, but they are
    reproducible, and we believe that they offer some general
    insights.
  }
}

@inproceedings{TrinderCLM09,
  author = {Trinder, Philip W. and Cole, Murray I. and Loidl, Hans-Wolfgang and Michaelson, Greg J.},
  title = {Characterising Effective Resource Analyses for Parallel and Distributed Coordination},
  booktitle = {FOPARA 2009 --- Intl. Workshop on Foundational and Practical Aspects of Resource Analysis, Revised Selected Papers},
  address = {Eindhoven, The Netherlands},
  month = nov,
  year = {2009},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {6324},
  pages = {67--83},
  keywords = {semantics, models of parallelism, survey, resource analysis},
  doi = {http://dx.doi.org/10.1007/978-3-642-15331-0_5},
  url = {http://www.macs.hw.ac.uk/~trinder/papers/FOPARA09.pdf},
  abstract = {
    An important application of resource analysis is to improve the
    performance of parallel and distributed programs. In this context
    key resources are time, space and communication. Given the
    spectrum of cost models and associated analysis techniques
    available, what combination should be selected for a specific
    parallel or distributed context?

    We address the question as follows. We outline a continuum of
    coordination cost models and a range of analysis techniques. We
    consider six representative parallel/distributed applications of
    resource analysis techniques, and aim to extract general
    principles governing why the combination of techniques is
    effective in its context.
  }
}

@inproceedings{ZainHBTMA09,
author = {Al Zain, Abdallah and Hammond, Kevin and Berthold, Jost and Trinder, Philip W. and Michaelson, Greg J. and Aswad, Mustafa K.},
title = {Low-Pain, High-Gain Multicore Programming in {Haskell}: Coordinating Irregular Symbolic Computations on Multicore Architectures},
booktitle = {DAMP'09 --- Intl. Workshop on Declarative Aspects of Multicore Programming},
address = {Savannah, GA, USA},
month = jan,
year = {2009},
publisher = {ACM Press},
pages = {25--36},
keywords = {parallel applications, Eden, multicore, symbolic computation},
doi = {http://doi.acm.org/10.1145/1481839.1481843},
url = {http://www.macs.hw.ac.uk/~trinder/papers/damp09.pdf},
abstract = {
With the emergence of commodity multicore architectures,
exploiting tightly-coupled parallelism has become increasingly
important. Functional programming languages, such as Haskell, are,
in principle, well placed to take advantage of this trend,
offering the ability to easily identify large amounts of
fine-grained parallelism. Unfortunately, obtaining real
performance benefits has often proved hard to realise in practice.

This paper reports on a new approach using middleware that has
been constructed using the Eden parallel dialect of Haskell. Our
approach is ``low pain'' in the sense that the programmer
constructs a parallel program by inserting a small number of
higher-order algorithmic skeletons at key points in the
program. It is ``high gain'' in the sense that we are able to get
good parallel speedups. Our approach is unusual in that we do not
attempt to use shared memory directly, but rather coordinate
parallel computations using a message-passing implementation. This
approach has a number of advantages. Firstly, coordination,
i.e. locking and communication, is both confined to limited shared
memory areas, essentially the communication buffers, and is also
isolated within well-understood libraries. Secondly, the coarse
thread granularity that we obtain reduces coordination overheads,
so locks are normally needed only on (relatively large) messages,
and not on individual data items, as is often the case for simple
shared-memory implementations. Finally, cache coherency
requirements are reduced since individual tasks do not share
caches, and can garbage collect independently.

We report results for two representative computational algebra
problems. Computational algebra is a challenging application area
that has not been widely studied in the general parallelism
community. Computational algebra applications have high
computational demands, and are, in principle, often suitable for
parallel execution, but usually display a high degree of
irregularity in terms of both task and data structure. This makes
it difficult to construct parallel applications that perform well
in practice. Using our system, we are able to obtain both
extremely good processor utilisation (97 percent) and very good
absolute speedups (up to 7.7) on an eight-core machine.
}
}

@inproceedings{BertholdZL08,
  author = {Berthold, Jost and Al Zain, Abdallah and Loidl, Hans-Wolfgang},
  title = {Scheduling Light-Weight Parallelism in {ArTCoP}},
  booktitle = {PADL 2008 --- Intl. Symp. on Practical Aspects of Declarative Languages},
  address = {San Francisco, CA, USA},
  month = jan,
  year = {2008},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {4902},
  pages = {214--229},
  keywords = {implementation, run-time system},
  doi = {http://dx.doi.org/10.1007/978-3-540-77442-6_15},
  url = {http://www.tcs.informatik.uni-muenchen.de/~hwloidl/publications/PADL08.pdf},
  abstract = {
    We present the design and prototype implementation of the
    scheduling component in ArTCoP (architecture transparent control
    of parallelism), a novel run-time environment (RTE) for parallel
    execution of high-level languages.  A key feature of ArTCoP is its
    support for deep process and memory hierarchies, shown in the
    scheduler by supporting light-weight threads.  To realise a system
    with easily exchangeable components, the system defines a
    micro-kernel, providing basic infrastructure, such as garbage
    collection. All complex RTE operations, including the handling of
    parallelism, are implemented at a separate system level.  By
    choosing Concurrent Haskell as high-level system language, we
    obtain a prototype in the form of an executable specification that
    is easier to maintain and more flexible than conventional RTEs.
    We demonstrate the flexibility of this approach by presenting
    implementations of a scheduler for light-weight threads in ArTCoP,
    based on GHC Version 6.6.
  }
}

@incollection{LoidlTHZB08,
author = {Loidl, Hans-Wolfgang and Trinder, Philip W. and Hammond, Kevin and Al Zain, Abdallah and Baker-Finch, Clement A.},
editor = {Alexander, Michael and Gardner, Bill},
title = {Semi-Explicit Parallel Programming in a Purely Functional Style: {GpH}},
booktitle = {Process Algebra for Parallel and Distributed Processing: Algebraic Languages in Specification-Based Software Development},
publisher = {Chapman and Hall},
month = dec,
year = {2008},
pages = {47--76},
keywords = {semantics, GpH},
url = {http://www.macs.hw.ac.uk/~trinder/papers/PAPDP.pdf},
abstract = {
Declarative programming languages can play an important role in
the process of designing and implementing parallel systems. They
bridge the gap between a high-level specification, with proven
properties of the overall system, and the execution of the system
on real hardware. Efficiently exploiting parallelism on a wide
range of architectures is a challenging task and should in our
view be handled by a sophisticated runtime environment. Based on
this design philosophy we have developed and formalised Glasgow
parallel Haskell (GpH), and implemented it as a conservative
extension of the Glasgow Haskell Compiler.

The high-level nature of declarative languages eases the task of
mapping an algebraic specification down to executable code. In
fact, the operational components of the specification can already
be considered an implementation, with the associated properties
acting as assertions to the code. Based on a formal model of the
declarative language, the validity of these properties can be
established by manual proof, which works on a level of detail
similar to the specification language itself. Many operational
aspects, usually complicating a proof of an implementation, do not
come into the picture at this level. Most importantly,
unnecessary sequentialisation of the code is avoided.

However, the goal of implicit parallelism has proven an elusive
one. Often the automatically generated threads are too
fine-grained to be efficient. In other cases the
data-dependencies between expressions prohibit the generation of a
sufficient amount of parallelism. Thus, we employ an approach of
semi-explicit parallelism, where only potential parallelism has to
be annotated in a program, and all aspects of coordination are
delegated to the runtime environment. A corresponding formal
model, in the form of a structured operational semantics, handling
pools of realised and of potential parallelism, is used to
establish the correctness of the code employing semi-explicit
parallelism. The runtime environment itself is capable of
synchronising parallel threads, using automatic blocking on data
under evaluation, and by simulating virtual shared memory across
networks of machines. Being embedded into the optimised runtime
environment for a sequential language, we achieve efficient
execution of a high-level language, close to the original
specification language, while minimising the programmer effort in
parallelising the code and being scalable to large-scale
applications that can be executed on heterogeneous networks and
computational Grids.
}
}

@inproceedings{MarlowHJP08,
  author = {Marlow, Simon and Harris, Tim and James, Roshan P. and Peyton-Jones, Simon L.},
  title = {Parallel Generational-Copying Garbage Collection with a Block-structured Heap},
  booktitle = {ISMM 2008 --- Intl. Symp. on Memory Management},
  address = {Tucson, AZ, USA},
  month = jun,
  year = {2008},
  publisher = {ACM Press},
  pages = {11--20},
  keywords = {implementation, GpH, GHC-SMP, multicore},
  doi = {http://doi.acm.org/10.1145/1375634.1375637},
  abstract = {
    We present a parallel generational-copying garbage collector
    implemented for the Glasgow Haskell Compiler. We use a
    blockstructured memory allocator, which provides a natural
    granularity for dividing the work of GC between many threads,
    leading to a simple yet effective method for parallelising copying
    GC. The results are encouraging: we demonstrate wall-clock
    speedups of on average a factor of 2 in GC time on a commodity
    4-core machine with no programmer intervention, compared to our
    best sequential GC.
  }
}

@phdthesis{Vasconcelos08,
  author = {Vasconcelos, Pedro B.},
  title = {Cost Inference and Analysis for Recursive Functional Programs},
  school = {University of St Andrews},
  month = feb,
  year = {2008},
  keywords = {semantics, resource analysis, thesis},
  url = {http://www.ncc.up.pt/~pbv/research/PB_Vasconcelos_PhD_thesis.pdf},
  abstract = {
    Programming resource-sensitive systems, such as real-time embedded
    systems, requires guaranteeing both the functional correctness of
    computations and also that time and space usage fit within
    constraints imposed by hardware limits or the environment.
    Functional programming languages have proved very good at meeting
    the former logical kind of guarantees but not the latter resource
    guarantees.

    This thesis contributes to demonstrate the applicability of
    functional programming in resource-sensitive systems with an
    automatic program analysis for obtaining guaranteed upper bounds
    on dynamic space usage of functional programs.

    Our analysis is developed for a core subset of Hume, a
    domain-specific functional language targeting resource-sensitive
    systems (Hammond et al. 2007), and presented as a type and effect
    system that builds on previous sized type systems (Hughes et
    al. 1996, Chin and Khoo 2001) and effect systems for costs (Dornic
    et al. 1992, Reistad and Gifford 1994, Hughes and Pareto 1999). It
    extends previous approaches by using abstract interpretation
    techniques to automatically infer linear approximations of the
    sizes of recursive data types and the stack and heap costs of
    recursive functions.

    The correctness of the analysis is formally proved with respect to
    an operational semantics for the language and an inferrence
    algorithm that automatically reconstructs size and cost bounds is
    presented.

    A prototype implementation of the analysis and operational
    semantics has been constructed and used to experimentally assess
    the quality of the cost bounds with some examples, including
    implementations of textbook functional programming algorithms and
    simplified embedded systems.
  }
}

@article{ZainTML08,
  author = {Al Zain, Abdallah and Trinder, Philip W. and Michaelson, Greg J. and Loidl, Hans-Wolfgang},
  title = {Evaluating a High-Level Parallel Language ({GpH}) for Computational {GRIDs}},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  volume = {19},
  number = {2},
  year = {2008},
  pages = {219--233},
  keywords = {models of parallelism, GpH, non-uniform topology, GridGUM},
  doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2007.70728},
  url = {http://www.macs.hw.ac.uk/~trinder/papers/TPDS.pdf},
  abstract = {
    Computational Grids potentially offer low cost, readily available,
    and large-scale high-performance platforms. For the parallel
    execution of programs, however, computational Grids pose serious
    challenges: they are heterogeneous, and have hierarchical and
    often shared interconnects, with high and variable latencies
    between clusters.

    This paper investigates whether a programming language with
    high-level parallel coordination and a Distributed Shared Memory
    model (DSM) can deliver good, and scalable, performance on a range
    of computational Grid configurations. The high-level language,
    Glasgow parallel Haskell (GpH), abstracts over the architectural
    complexities of the computational Grid, and we have developed
    GridGUM2, a sophisticated grid-specific implementation of GpH, to
    produce the first high-level DSM parallel language implementation
    for computational Grids.

    We report a systematic performance evaluation of GridGUM2 on
    combinations of high/low and homo/hetero-geneous computational
    Grids. We measure the performance of a small set of kernel
    parallel programs representing a variety of application areas, two
    parallel paradigms, and ranges of communication degree and
    parallel irregularity. We investigate GridGUM2's performance
    scalability on medium-scale heterogeneous and high-latency
    computational Grids, and analyse the performance with respect to
    the program characteristics of communication frequency and degree
    of irregular parallelism.
  }
}

@inproceedings{HammondZCPT07,
  author = {Hammond, Kevin and Al Zain, Abdallah and Cooperman, Gene and Petcu, Dana and Trinder, Philip W.},
  title = {{SymGrid}: A Framework for Symbolic Computation on the Grid},
  booktitle = {Euro-Par 2007 --- Intl. Conf. on Parallel Processing},
  address = {Rennes, France},
  month = aug,
  year = {2007},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {4641},
  pages = {457--466},
  keywords = {parallel applications, symbolic computation},
  doi = {http://dx.doi.org/10.1007/978-3-540-74466-5_49},
  url = {http://www.macs.hw.ac.uk/~trinder/papers/europar2007.pdf},
  abstract = {
    This paper introduces the design of SymGrid, a new Grid framework
    that will, for the first time, allow multiple invocations of
    symbolic computing applications to interact via the Grid. SymGrid
    is designed to support the specific needs of symbolic computation,
    including computational steering (greater interactivity), complex
    data structures, and domain-specific computational patterns (for
    irregular parallelism). A key issue is heterogeneity: SymGrid is
    designed to orchestrate components from different symbolic systems
    into a single coherent (possibly parallel) Grid application,
    building on the OpenMath standard for data exchange between
    mathematically-oriented applications. The work is being developed
    as part of a major EU infrastructure project.
  }
}

@inproceedings{ZainHTLLC07,
  author = {Al Zain, Abdallah and Hammond, Kevin and Trinder, Philip W. and Linton, Steve and Loidl, Hans-Wolfgang and Costanti, Marco},
  title = {{SymGrid-Par}: Designing a Framework for Executing Computational Algebra Systems on Computational Grids},
  booktitle = {PAPP 2007 --- Intl. Workshop on Practical Aspects of High-level Parallel Programming},
  address = {Beijing, China},
  month = may,
  year = {2007},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {4488},
  pages = {617--624},
  keywords = {parallel applications, symbolic computation},
  doi = {http://dx.doi.org/10.1007/978-3-540-72586-2_90},
  url = {http://www.macs.hw.ac.uk/~trinder/papers/papp2007.pdf},
  abstract = {
    SymGrid-Par is a new framework for executing large computer
    algebra problems on computational Grids. We present the design of
    SymGrid-Par, which supports multiple computer algebra packages,
    and hence provides the novel possibility of composing a system
    using components from different packages.  Orchestration of the
    components on the Grid is provided by a Grid-enabled parallel
    Haskell (GpH).  We present a prototype implementation of a core
    component of SymGrid-Par, together with promising measurements of
    two programs on a modest Grid to demonstrate the feasibility of
    our approach.
  }
}

@inproceedings{ZainTLM07,
  author = {Al Zain, Abdallah and Trinder, Philip W. and Loidl, Hans-Wolfgang and Michaelson, Greg J.},
  title = {Supporting High-Level Grid Parallel Programming: the Design and Implementation of {Grid-GUM2} (Full Paper)},
  booktitle = {UK e-Science 2007 All Hands Meeting},
  address = {Nottingham, UK},
  month = sep,
  year = {2007},
  keywords = {implementation, GpH, non-uniform topology, GridGUM},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/pdf/GridGUM2Design.pdf},
  abstract = {
    Computational Grids are much more complex than classical
    high-performance architectures: they have high, and dynamically
    varying communication latencies, and comprise heterogeneous
    collections of processing elements.  We argue that such a complex
    and dynamic architecture is extremely challenging for the
    programmer to explicitly manage, and advocate a high-level
    parallel programming language, like GpH, where the programmer
    controls only a few key aspects of the parallel coordination. Such
    a high-level language requires a sophisticated implementation to
    manage parallel execution.

    This paper presents the design and implementation of Grid-GUM2, a
    Grid-specic runtime environment for GpH.  The core of the design
    are monitoring mechanisms that collect static and partial dynamic
    information, and new load management mechanisms that use the
    information to better schedule tasks on computational Grids. These
    mechanisms are built on top of a virtual shared memory (VSM),
    graph-reduction-based computation engine. A systematic evaluation
    of the implementation, reported elsewhere, shows acceptable
    scaleup and good speedups on a range of computational Grids.
  }
}

@article{ZainTLM06,
  author = {Al Zain, Abdallah and Trinder, Philip W. and Loidl, Hans-Wolfgang and Michaelson, Greg J.},
  title = {Managing Heterogeneity in a Grid Parallel {Haskell}},
  journal = {Scalable Computing: Practice and Experience},
  volume = {7},
  number = {3},
  year = {2006},
  pages = {9--25},
  note = {Selected papers from PAPP 2005 --- Intl. Workshop on Practical Aspects of High-level Parallel Programming, Atlanta, GA, USA, May 2005},
  keywords = {implementation, GpH, non-uniform topology, GridGUM},
  url = {http://www.macs.hw.ac.uk/~trinder/papers/PAPP-Journal.pdf},
  abstract = {
    Computational Grids potentially offer cheap large-scale
    high-performance systems, but are a very challenging architecture,
    being heterogeneous, shared and hierarchical. Rather than
    requiring a programmer to explicitly manage this complex
    environment, we recommend using a high-level parallel functional
    language, like GpH, with largely automatic management of parallel
    coordination.

    We present GRID-GUM, an initial port of the distributed virtual
    shared-memory implementation of GpH for computational Grids. We
    show that, GRID-GUM delivers acceptable speedups on relatively low
    latency homogeneous and heterogeneous computational
    Grids. Moreover, we find that for heterogeneous computational
    Grids, load management limits performance.

    We present the initial design of GRID-GUM2, that incorporates new
    load management mechanisms that cheaply and effectively combine
    static and dynamic information to adapt to heterogeneous
    Grids. The mechanisms are evaluated by measuring four non-trivial
    programs with different parallel properties. The measurements show
    that the new mechanisms improve load distribution over the
    original implementation, reducing runtime by factors ranging from
    17 to 57 percent, and the greatest improvement is obtained for the
    most dynamic program.
  }
}

@inproceedings{HarrisMP05,
  author = {Harris, Tim and Marlow, Simon and Peyton-Jones, Simon L.},
  title = {{Haskell} on a Shared-Memory Multiprocessor},
  booktitle = {Haskell Workshop 2005},
  address = {Tallinn, Estonia},
  month = sep,
  year = {2005},
  publisher = {ACM Press},
  pages = {49--61},
  keywords = {implementation, GpH, GHC-SMP, multicore},
  doi = {http://doi.acm.org/10.1145/1088348.1088354},
  url = {http://research.microsoft.com/apps/pubs/default.aspx?id=67424},
  abstract = {
    Multi-core processors are coming, and we need ways to program
    them. The combination of purely-functional programming and
    explicit, monadic threads, communicating using transactional
    memory, looks like a particularly promising way to do so. This
    paper describes a full-scale implementation of shared-memory
    parallel Haskell, based on the Glasgow Haskell Compiler. Our main
    technical contribution is a lock-free mechanism for evaluating
    shared thunks that eliminates the major performance bottleneck in
    parallel evaluation of a lazy language. Our results are
    preliminary but promising: we can demonstrate wall-clock speedups
    of a serious application (GHC itself), even with only two
    processors, compared to the same application compiled for a
    uni-processor.
  }
}

@article{LangeL05,
  author = {Lange, Martin and Loidl, Hans-Wolfgang},
  title = {Parallel and Symbolic Model Checking for Fixpoint Logic with Chop},
  journal = {Electr. Notes Theor. Comput. Sci.},
  volume = {128},
  number = {3},
  year = {2005},
  pages = {125--138},
  note = {Selected Papers from PDMC'04 --- Intl. Workshop on Parallel and Distributed Techniques in Verification, London, UK, Sep 2004},
  keywords = {parallel applications, GpH, model checking},
  doi = {http://dx.doi.org/10.1016/j.entcs.2004.10.023},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/pdmc04.ps},
  abstract = {
    We consider the model checking problem for FLC, a modal fixpoint
    logic capable of defining non-regular properties. This paper
    presents a refinement of a symbolic model checker and discusses
    how to parallelise this algorithm. It reports on a prototype
    implementation of the algorithm in Glasgow Parallel Haskell (GpH)
    and its performance on a cluster of workstations.
  }
}

@article{LoidlRSHHKLMPPRT03,
  author = {Loidl, Hans-Wolfgang and Rubio, Fernando and Scaife, Norman and Hammond, Kevin and Horiguchi, Susumu and Klusik, Ulrike and Loogen, Rita and Michaelson, Greg J. and Pe{\~n}a, Ricardo and Priebe, Steffen and {Reb{\'o}n Portillo}, {\'A}lvaro J. and Trinder, Philip W.},
  title = {Comparing Parallel Functional Languages: Programming and Performance},
  journal = {Higher-Order and Symbolic Computation},
  volume = {16},
  number = {3},
  year = {2003},
  pages = {203--251},
  keywords = {models of parallelism, GpH, Eden, PMLS, distributed memory},
  doi = {http://dx.doi.org/10.1023/A:1025641323400},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/HOSC03.ps},
  abstract = {
    This paper presents a practical evaluation and comparison of three
    state-of-the-art parallel functional languages. The evaluation is
    based on implementations of three typical symbolic computation
    programs, with performance measured on a Beowulf-class parallel
    architecture.

    We assess three mature parallel functional languages: PMLS, a
    system for implicitly parallel execution of ML programs; GPH, a
    mainly implicit parallel extension of Haskell; and Eden, a more
    explicit parallel extension of Haskell designed for both
    distributed and parallel execution. While all three languages
    employ a completely implicit approach to communication, each
    language takes a different approach to specifying and controlling
    parallelism, ranging from explicit identification of processes as
    language constructs (Eden) through annotation of potential
    parallelism (GPH) to automatic detection of parallel skeletons in
    sequential code (PMLS).

    We present detailed performance measurements of all three systems
    on a widely available parallel architecture: a Beowulf cluster of
    low-cost commodity workstations. We use three representative
    symbolic applications: a matrix multiplication algorithm, an exact
    linear system solver, and a simple ray-tracer. Our results show
    how moderate speedups can be achieved with little or no changes to
    the sequential code, and that parallel performance can be
    significantly improved even within our high-level model of
    parallel functional programming by controlling key aspects of the
    program such as load distribution and thread granularity.
  }
}

@inproceedings{BoisPLT02,
  author = {Du Bois, Andr{\'e} Rauber and Pointon, Robert F. and Loidl, Hans-Wolfgang and Trinder, Philip W.},
  title = {Implementing Declarative Parallel Bottom-Avoiding Choice},
  booktitle = {SBAC-PAD 2002 --- Symp. on Computer Architecture and High Performance Computing},
  address = {Vitoria, Brazil},
  month = oct,
  year = {2002},
  publisher = {IEEE Computer Society},
  pages = {82--92},
  keywords = {implementation, semantics, GpH, speculative parallelism},
  doi = {http://doi.ieeecomputersociety.org/10.1109/CAHPC.2002.1180763},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/pad02.ps},
  abstract = {
    Non-deterministic choice supports efficient parallel speculation,
    but unrestricted non-determinism destroys the referential
    transparency of purely-declarative languages by removing
    unfoldability and it bears the danger of wasting resources on
    unncessary computations. While numerous choice mechanisms have
    been proposed that preserve unfoldability, and some concurrent
    implementations exist, we believe that no compiled parallel
    implementation has previously been constructed. This paper
    presents the design, semantics, implementation and use of a family
    of bottom-avoiding choice operators for Glasgow parallel
    Haskell. The subtle semantic properties of our choice operations
    are described, including a careful classification using an
    existing framework, together with a discussion of operational
    semantics issues and the pragmatics of distributed memory
    implementation. The expressiveness of our choice operators is
    demonstrated by constructing a branch and bound search, a merge
    and a speculative conditional. Their effectiveness is demonstrated
    by comparing the parallel performance of the speculative search
    with naive and 'perfect' implementations. Their efficiency is
    assessed by measuring runtime overhead and heap consumption.
  }
}

@inproceedings{BoisLT02,
  author = {Du Bois, Andr{\'e} Rauber and Loidl, Hans-Wolfgang and Trinder, Philip W.},
  title = {Thread Migration in a Parallel Graph Reducer},
  booktitle = {IFL'02 ---  Intl. Workshop on the Implementation of Functional Languages},
  address = {Madrid, Spain},
  month = sep,
  year = {2002},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {2670},
  pages = {199--214},
  keywords = {implementation, GpH, GUM},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/Migration-IFL02.ps},
  abstract = {
    To support high level coordination, parallel functional languages
    need effective and automatic work distribution mechanisms. Many
    implementations distribute potential work, i.e. sparks or
    closures, but there is good evidence that the performance of
    certain classes of program can be improved if current work, or
    threads, are also distributed. Migrating a thread incurs
    significant execution cost and requires careful scheduling and an
    elaborate implementation.

    This paper describes the design, implementation and performance of
    thread migration in the GUM runtime system underlying Glasgow
    parallel Haskell (GpH). Measurements of nontrivial programs on a
    highlatency cluster architecture show that thread migration can
    improve the performance of data-parallel and divide-and-conquer
    programs with low processor utilisation. Thread migration also
    reduces the variation in performance results obtained in separate
    executions of a program. Moreover, migration does not incur
    significant overheads if there are no migratable threads, or on a
    single processor. However, for programs that already exhibit good
    processor utilisation, migration may increase performance
    variability and very occasionally reduce performance.
  }
}

@article{JunaiduT02,
  author = {Junaidu, Sahalu B. and Trinder, Philip W.},
  title = {Parallelising large irregular programs: an experience with {Naira}},
  journal = {Information Sciences},
  volume = {140},
  number = {3--4},
  year = {2002},
  pages = {229--240},
  note = {Title of pre-print differs from published article},
  keywords = {parallel applications, GpH},
  url = {http://www.macs.hw.ac.uk/~trinder/papers/MeasuringNaira.ps.gz},
  abstract = {
    Naira is a compiler for Haskell, written in Glasgow parallel
    Haskell. It exhibits modest, but irregular, parallelism that is
    determined by properties of the program being compiled, e.g. the
    complexity of the types and of the pattern matching. We report
    four experiments into Naira's parallel behaviour using a set of
    realistic inputs: namely the 18 Haskell modules of Naira
    itself. The issues investigated are:

      * Does increasing input size improve sequential efficiency and
        speedup?

      * To what extent do high communications latencies reduce average
        parallelism and speedup?

      * Does migrating running threads between processors improve
        average parallelism and speedup at all latencies?
  }
}

@inproceedings{Loidl02,
  author = {Loidl, Hans-Wolfgang},
  title = {The Virtual Shared Memory Performance of a Parallel Graph Reducer},
  booktitle = {CCGrid 2002 --- Intl. Symp. on Cluster Computing and the Grid},
  address = {Berlin, Germany},
  month = may,
  year = {2002},
  publisher = {IEEE Computer Society},
  pages = {311--318},
  keywords = {implementation, GpH, GUM},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/dsm02.html},
  abstract = {
    This paper assesses the costs of maintaining a virtual shared heap
    in our parallel graph reducer (GUM), which implements a parallel
    functional language. GUM performs automatic and dynamic resource
    management for both work and data. We introduce extensions to the
    original design of GUM, aiming at a more flexible memory
    management and communication mechanism to deal with high-latency
    systems. We then present measurements of running GUM on a Beowulf
    cluster, evaluating the overhead of dynamic distributed memory
    management and the effectiveness of the new memory management and
    communication mechanisms.
  }
}

@inproceedings{RebonPortilloHLV02,
  author = {{Reb{\'o}n Portillo}, {\'A}lvaro J. and Hammond, Kevin and Loidl, Hans-Wolfgang and Vasconcelos, Pedro B.},
  title = {Cost Analysis Using Automatic Size and Time Inference},
  booktitle = {IFL'02 ---  Intl. Workshop on the Implementation of Functional Languages},
  address = {Madrid, Spain},
  month = sep,
  year = {2002},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {2670},
  pages = {232--247},
  keywords = {semantics, resource analysis},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/Analysis-IFL02.ps},
  abstract = {
    Cost information can be exploited in a variety of contexts,
    including parallelizing compilers, autonomic GRIDs and real-time
    systems. In this paper, we introduce a novel type and effect
    system: the sized time system that is capable of determining upper
    bounds for both time and space costs, and which we initially
    intend to apply to determining good granularity for parallel
    tasks. The analysis is defined for a simple, strict, higher-order
    and polymorphic functional language, L, incorporating
    arbitrarily-sized list data structures. The inference algorithm
    implementing this analysis constructs cost- and size-terms for
    L-expressions, plus constraints over free size and cost variables
    in those terms that can be solved to produce information for
    higher-order functions. The paper presents both the analysis and
    the inference algorithm, providing examples that illustrate the
    primary features of the analysis.
  }
}

@article{TrinderLP02,
  author = {Trinder, Philip W. and Loidl, Hans-WolfgangPointon, Robert F.},
  title = {Parallel and Distributed {Haskells}},
  journal = {Journal of Functional Programming},
  volume = {12},
  number = {4{\&}5},
  year = {2002},
  pages = {469--510},
  keywords = {models of parallelism, Eden, GpH, GdH, survey},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/jfp.html},
  abstract = {
    Parallel and distributed languages specify computations on
    multiple processors and have a computation language to describe
    the algorithm, i.e. what to compute, and a coordination language
    to describe how to organise the computations across the
    processors. Haskell has been used as the computation language for
    a wide variety of parallel and distributed languages, and this
    paper is a comprehensive survey of implemented languages. We
    outline parallel and distributed language concepts and classify
    Haskell extensions using them. Similar example programs are used
    to illustrate and contrast the coordination languages, and the
    comparison is facilitated by the common computation language. A
    lazy language is not an obvious choice for parallel or distributed
    computation, and we address the question of why Haskell is a
    common functional computation language.
  }
}

@inproceedings{Loidl01,
  author = {Loidl, Hans-Wolfgang},
  title = {Load Balancing in a Parallel Graph Reducer},
  booktitle = {SFP'01 --- Scottish Functional Programming Workshop},
  address = {Stirling, Scotland},
  month = aug,
  year = {2001},
  publisher = {Intellect},
  series = {Trends in Functional Programming},
  volume = {3},
  pages = {63--74},
  keywords = {implementation, GpH, GUM},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/sfp01.html},
  abstract = {
    Parallel graph reducers such as GUM use dynamic techniques to
    manage resources during execution. One important aspect of the
    dynamic behaviour is the distribution of work. The load balancing
    mechanism, which controls this aspect, should be flexible, to
    adjust the distribution of work to hardware characteristics as
    well as dynamic program characteristics, and scalable, to achieve
    high utilisation of all processors even on massively parallel
    machines.

    In this paper we study the behaviour of GUM's load balancing
    mechanism on a high-latency Beowulf multi-processor. We present
    modifications to the basic load balancing mechanism and discuss
    runtime measurements, which indicate that these modifications can
    significantly enhance the scalability of the system.
  }
}

@article{LoidlTB01,
  author = {Loidl, Hans-Wolfgang and Trinder, Philip W. and Butz, Carsten},
  title = {Tuning Task Granularity and Data Locality of Data Parallel {GpH} Programs},
  journal = {Parallel Processing Letters},
  volume = {11},
  number = {4},
  year = {2001},
  pages = {471--486},
  keywords = {models of parallelism, GpH, evaluation strategies, granularity},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/hlpp01.html},
  note = {Selected papers from HLPP'01 --- Intl. Workshop on High-level Parallel Programming and Applications, Orleans, France, 26--27 March, 2001},
  abstract = {
    The performance of data parallel programs often hinges on two key
    coordination aspects: the computational costs of the parallel
    tasks relative to their management overhead - task granularity;
    and the communication costs induced by the distance between tasks
    and their data - data locality. In data parallel programs both
    granularity and locality can be improved by clustering,
    i.e. arranging for parallel tasks to operate on related
    sub-collections of data.

    The GpH parallel functional language automatically manages most
    coordination aspects, but also allows some high-level control of
    coordination using evaluation strategies. We study the
    coordination behavior of two typical data parallel programs, and
    find that while they can be improved by introducing clustering
    evaluation strategies, further performance improvements can be
    achieved by restructuring the program.

    We introduce a new generic Cluster class that allows clustering to
    be systematically introduced, and improved by program
    transformation. In contrast to many other parallel program
    transformation approaches, we transform realistic programs and
    report performance results on a 32-processor Beowulf cluster. The
    cluster class is highly-generic and extensible, amenable to
    reasoning, and avoids conflating computation and coordination
    aspects of the program.
  }
}

@inproceedings{PointonPLLT01,
  author = {Pointon, Robert F. and Priebe, Steffen and Loidl, Hans-Wolfgang and Loogen, Rita and Trinder, Philip W.},
  title = {Functional vs Object-Oriented Distributed Languages},
  booktitle = {EUROCAST'01 --- Intl. Conf. on Computer Aided Systems Theory},
  address = {Las Palmas de Gran Canaria, Spain},
  month = feb,
  year = {2001},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {2178},
  pages = {642--656},
  keywords = {models of parallelism, Eden, GdH, Java, survey},
  doi = {http://dx.doi.org/10.1007/3-540-45654-6_49},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/eurocast01.ps},
  abstract = {
    Conventional distributed programming languages require the
    programmer to explicitly specify many aspects of distributed
    coordination, including resource location, task placement,
    communication and synchronisation. Functional languages aim to
    provide higher-level abstraction, and this paper investigates the
    effectiveness of this for distributed co-ordination. The
    investigation contrasts and compares contrasts Java and two
    Haskell-based distributed functional languages, Eden and
    GdH. Three distributed programs are used as case studies, and the
    performance and programming effort are reported.
  }
}

@inproceedings{BakerFinchKT00,
  author = {Baker-Finch, Clement A. and King, David J. and Trinder, Philip W.},
  title = {An Operational Semantics for Parallel Lazy Evaluation},
  booktitle = {ICFP'00 --- Intl. Conf. on Functional Programming},
  address = {Montreal, Canada},
  month = sep,
  year = {2000},
  publisher = {ACM Press},
  pages = {162--173},
  keywords = {semantics, GpH},
  doi = {http://doi.acm.org/10.1145/351240.351256},
  url = {http://www.macs.hw.ac.uk/~trinder/papers/ICFPFinal.pdf},
  abstract = {
    We present an operational semantics for parallel lazy evaluation
    that accurately models the parallel behaviour the non-strict
    parallel functional language GpH. Parallelism is modelled
    synchronously, that is, single reductions are carried out
    separately, then combined before proceeding onto the next set of
    reductions. Consequently the semantics has two levels, with
    single-step transition rules at one level and combining rules at
    the other. Parallel threads are modelled by labelled bindings and
    to the best of our knowledge this is the first semantics that
    models thread states. A set of labelled bindings corresponds to a
    heap and is used to model sharing.

    The semantics is set at a higher level of abstraction than an
    abstract machine and is therefore more manageable for proofs about
    programs rather than implementations. At the same time, it is low
    level enough to allow us to reason about programs in terms of
    parallelism (i.e., the number of processors used), as well as work
    and run-time with different numbers of processors.

    The framework used by the semantics is flexible in that the model
    can easily be adapted to express other evaluation models such as
    sequential call-by-need and fully-speculative evaluation,
    non-deterministic choice and others.
  }
}

@inproceedings{LoidlKHLT00,
  author = {Loidl, Hans-Wolfgang and Klusik, Ulrike and Hammond, Kevin and Loogen, Rita and Trinder, Philip W.},
  title = {{GpH} and {Eden}: Comparing Two Parallel Functional Languages on a Beowulf Cluster},
  booktitle = {SFP'00 --- Scottish Functional Programming Workshop},
  address = {St Andrews, Scotland},
  month = jul,
  year = {2000},
  publisher = {Intellect},
  series = {Trends in Functional Programming},
  volume = {2},
  pages = {39--52},
  keywords = {models of parallelism, GpH, Eden, distributed memory},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/sfp00.ps},
  abstract = {
    We investigate two similar but contrasting parallel functional
    language designs: Eden and GpH.  Both languages use the non-strict
    functional language Haskell as a core expression language, both
    their implementations are based on the same host sequential
    implementation --- the high performance Glasgow Haskell Compiler
    (GHC), and both implementations are available on the same
    distributed architecture --- the St Andrews Beowulf cluster.  This
    allows an exceptionally pure comparison of the language design
    characteristics and their impact on parallel performance.

    The comparison is illustrated by two parallel benchmarks which
    expose differences in the communication, process creation, and
    work distribution mechanisms employed by Eden and GpH.  Our
    results show that the explicit process model favoured by Eden
    gives good parallel performance for coarse-grained applications
    running on the Beowulf cluster.  In comparison, the implicit
    process model used in GpH gives poorer absolute speedup for this
    class of application on this architecture.  Further work is needed
    to verify the reasons for this difference in performance and to
    extend these results to other architectures.
  }
}

@inproceedings{PointonTL00,
  author = {Pointon, Robert F. and Trinder, Philip W. and Loidl, Hans-Wolfgang},
  title = {The Design and Implementation of {Glasgow} Distributed {Haskell}},
  booktitle = {IFL'00  --- Intl. Workshop on the Implementation of Functional Languages},
  address = {Aachen, Germany},
  month = sep,
  year = {2000},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {2011},
  pages = {53--70},
  keywords = {implementation, GdH},
  doi = {http://dx.doi.org/10.1007/3-540-45361-X_4},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/ifl00.ps},
  abstract = {
    This paper presents the design and implementation of Glasgow
    distributed Haskell (GdH), a non-strict distributed functional
    language. The language is intended for constructing scalable,
    reliable distributed applications, and is Haskell'98 compliant,
    being a superset of both Concurrent Haskell and Glasgow parallel
    Haskell (GpH).

    GdH distributes both pure and impure threads across multiple
    Processing Elements (PEs), each location is made explicit so a
    program can use resources unique to PE, and objects including
    threads can be created on a named PE. The location that uniquely
    owns a resource is identified by a method of a new Immobile type
    class. Impure threads communicate and synchronise explicitly to
    co-ordinate actions on the distributed state, but both pure and
    impure threads synchronise and communicate implicitly to share
    data. Limited support for fault tolerant programming is provided
    by distributed exception handling. The language constructs are
    illustrated by example, and two demonstration programs give a
    flavour of GdH programming.

    Although many distributed functional languages have been designed,
    relatively few have robust implementations. The GdH implementation
    fuses and extends two mature implementation technologies: the GUM
    runtime system (RTS) of GpH and the RTS for Concurrent
    Haskell. The fused RTS is extended with a small number of
    primitives from which more sophisticated constructs can be
    constructed, and libraries are adapted to the distributed context.
  }
}

@inproceedings{TrinderLBDHKPR00,
  author = {Trinder, Philip W. and Loidl, Hans-Wolfgang and {Barry Jr.}, Ed. and Davis, M. Kei and Hammond, Kevin and Klusik, Ulrike and Peyton-Jones, Simon L. and {Reb{\'o}n Portillo}, {\'A}lvaro J.},
  title = {The Multi-Architecture Performance of the Parallel Functional Language {GpH} (Research Note)},
  booktitle = {Euro-Par 2000 --- Intl. Conf. on Parallel Processing},
  address = {Munich, Germany},
  month = aug,
  year = {2000},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {1900},
  pages = {739--743},
  keywords = {models of parallelism, GpH},
  doi = {http://dx.doi.org/10.1007/3-540-44520-X_101},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/europar00.html},
  abstract = {
    In principle, functional languages promise straightforward
    architecture-independent parallelism, because of their high level
    description of parallelism, dynamic management of parallelism and
    deterministic semantics.  However, these language features come at
    the expense of a sophisticated compiler and/or runtime-system. The
    problem we address is whether such an elaborate system can deliver
    acceptable performance on a variety of parallel architectures.  In
    particular we report performance measurements for the GUM
    runtime-system on eight parallel architectures, including
    massively parallel, distributed-memory, shared-memory and
    workstation networks.
  }
}

@inproceedings{TrinderPL00,
  author = {Trinder, Philip W. and Pointon, Robert F. and Loidl, Hans-Wolfgang},
  title = {Runtime System Level Fault Tolerance for a Distributed Functional Language},
  booktitle = {SFP'00 --- Scottish Functional Programming Workshop},
  address = {St Andrews, Scotland},
  month = jul,
  year = {2000},
  publisher = {Intellect},
  series = {Trends in Functional Programming},
  volume = {2},
  pages = {103--114},
  keywords = {implementation, GdH, GUM, fault tolerance},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/sfp00a.ps},
  abstract = {
    Functional languages potentially offer benefits for distributed
    fault tolerance: many computations are pure, and hence have no
    side-effects to be reversed during error recovery; moreover
    functional languages have a high-level runtime system (RTS) where
    computations and data are readily manipulated. We propose a new
    RTS level of fault tolerance for distributed functional languages,
    and outline a design for its implementation for the GdH
    language. The design distinguishes between pure and impure
    computations: impure computations must be recovered using
    conventional exception-based techniques, but the RTS attempts
    implicit recovery of pure computations.
  }
}

@article{LoidlTHJMP99,
  author = {Loidl, Hans-Wolfgang and Trinder, Philip W. and Hammond, Kevin and Junaidu, Sahalu B. and Morgan, Richard G. and Peyton-Jones, Simon L.},
  title = {Engineering Parallel Symbolic Programs in {GpH}},
  journal = {Concurrency --- Practice and Experience},
  volume = {11},
  number = {12},
  year = {1999},
  pages = {701--752},
  keywords = {parallel applications, GpH},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/ps/cpe.ps},
  abstract = {
    We investigate the claim that functional languages offer low-cost
    parallelism in the context of symbolic programs on modest parallel
    architectures. In our investigation we present the first
    comparative study of the construction of large applications in a
    parallel functional language, in our case in Glasgow Parallel
    Haskell (GPH). The applications cover a range of application
    areas, use several parallel programming paradigms, and are
    measured on two very different parallel architectures.

    On the applications level the most significant result is that we
    are able to achieve modest wall-clock speedups (between factors of
    2 and 10) over the optimised sequential versions for all but one
    of the programs. Speedups are obtained even for programs that were
    not written with the intention of being parallelised. These gains
    are achieved with a relatively small programmer-effort. One reason
    for the relative ease of parallelisation is the use of evaluation
    strategies, a new parallel programming technique that separates
    the algorithm from the coordination of parallel behaviour.

    On the language level we show that the combination of lazy and
    parallel evaluation is useful for achieving a high level of
    abstraction. In particular we can describe top-level parallelism,
    and also preserve module abstraction by describing parallelism
    over the data structures provided at the module interface
    (``data-oriented parallelism'').  Furthermore, we find that the
    determinism of the language is helpful, as is the largely-implicit
    nature of parallelism in GPH.
  }
}

@inproceedings{Trinder99,
  author = {Trinder, Philip W.},
  title = {Motivation for {Glasgow} distributed {Haskell}, a non-strict Functional Language},
  booktitle = {PDSIA'99 --- Intl. Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications},
  address = {Sendai, Japan},
  month = jul,
  year = {1999},
  publisher = {World Scientific},
  pages = {72--81},
  keywords = {models of parallelism, GdH},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/pdsia99.html},
  abstract = {
    Non-strict functional languages offer potential benefits for
    constructing distributed systems: namely a highly-dynamic model of
    distribution, a relatively high degree of distribution
    transparency, and the potential to abstract over
    distribution-control primitives. We describe our motivation for
    implementing such a language, a variant of Haskell, and evaluating
    it. The implementation is a fusion of existing Glasgow Haskell
    Compiler technologies. The evaluation will be based on experiences
    implementing a distributed interactive simulation, and comparing
    it with a Java version.
  }
}

@incollection{TrinderLH99,
  author = {Trinder, Philip W. and Loidl, Hans-Wolfgang and Hammond, Kevin},
  editor = {Hammond, Kevin and Michaelson, Greg J.},
  title = {Large-Scale Functional Applications},
  booktitle = {Research Directions in Parallel Functional Programming},
  publisher = {Springer},
  month = nov,
  year = {1999},
  pages = {399--426},
  keywords = {parallel applications, GpH},
  url = {http://www-fp.dcs.st-and.ac.uk/pfpbook/},
  abstract = {
    This chapter describes applications that have been written using
    parallel functional language technology.  While the majority of
    these applications have been produced as part of a computer
    science research project aimed at testing the limits of the
    technology, others such as those written by Ericsson in Erlang,
    are being used in real commercial products.  Furthermore, projects
    such as Digital's Smart Kiosk application written using the
    data-flow inspired Stampede system represent serious trial uses of
    the technology by commercial organisations.

    Functional programming is a wide-spectrum, general purpose
    paradigm, and the applications therefore cover a wide area,
    including high-performance numerical applications, symbolic
    applications, data-intensive applications and computer vision.
    Some of these areas present hard problems for traditional parallel
    approaches, which are usually best suited to regular computation
    patterns and regular data structures.  Symbolic or irregular
    applications that are described here include compilers, theorem
    provers, computer algebra and natural language processing.

    It is a common misconception that functional programming can only
    be applied to small-scale programs, or that parallelism has only
    been obtained for toy examples.  This chapter describes several
    real parallel applications that are large in absolute terms,
    e.g. the 230000 line Erlang mobility server, or the 47000 line
    Lolita natural language parser.  Since functional programs are
    often a factor of 5-10 shorter than the equivalent C or Fortran,
    these represent serious, and realistic bases for assessing the
    quality of a parallel language and its development environment.

    Although, as the preceding chapters have shown, there is still
    much research to be done, parallel functional programming
    technology has been maturing rapidly, and many large parallel
    functional applications have been produced in recent years.
    Although this chapter aims to be reasonably comprehensive,it is
    therefore necessarily selective in the applications that are
    covered.  We are particularly heartened by results from systems
    such as SAC and SISAL, which are not merely competitive with the
    performance of C, but have actually exceeded the performance of
    equivalent Fortran programs compiled using high-quality
    parallelising compilers.

    The chapter uses the taxonomy that was introduced in Chapter 3.
    Implicit, semi-explicit, coordination, explicit and derivational
    approaches are each covered in separate sections.  Within those
    sections, applications are covered by area.  Ideally we would like
    to report a consistent set of program and parallel platform
    characteristics, and performance figures for each application.  In
    the absence of a standard set of parallel benchmarks, such results
    are currently unobtainable.  Results are therefore given using
    either absolute or relative speedup figures as recorded by the
    authors of the research papers on particular parallel
    architectures.
  }
}

@inproceedings{HallBTK98,
  author = {Hall, Jon G. and Baker-Finch, Clement A. and Trinder, Philip W. and King, David J.},
  title = {Towards an Operational Semantics for a Parallel Non-Strict Functional Language},
  booktitle = {IFL'98 --- Intl. Workshop on the Implementation  of Functional Languages},
  address = {London, UK},
  month = sep,
  year = {1998},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {1595},
  pages = {54--71},
  keywords = {semantics, GpH},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/semantics.html},
  abstract = {
    Parallel programs must describe both computation and coordination,
    i.e. what to compute and how to organise the computation. In
    functional languages equational reasoning is often used to reason
    about computation. In contrast, there have been many different
    coordination constructs for functional languages, and far less
    work on reasoning about coordination.

    We present an initial semantics for GpH, a small extension of the
    Haskell language, that allows us to reason about coordination. In
    particular we can reason about work, average parallelism, and
    runtime. The semantics captures the notions of limited (physical)
    resources, the preservation of sharing and speculative
    evaluation. We show a consistency result with Launchbury's
    well-known lazy semantics.
  }
}

@inproceedings{Hammond98,
  author = {Hammond, Kevin},
  title = {Strategic {SPMD}},
  booktitle = {Glasgow Workshop on Functional Programming},
  address = {Pitlochry, Scotland},
  month = sep,
  year = {1998},
  keywords = {models of parallelism, evaluation strategies, SPMD},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/strategic-spmd.html},
  abstract = {
    This paper extends the existing concepts of evaluation strategies,
    higher-order functions that can be uses to separate behaviour from
    algorithm in parallel functional programs, to support the SPMD
    model of parallelism. Our model allows SPMD sub-programs to be
    composed with other parallel paradigms, or to be arbitrarily
    nested with other parallel paradigms. This represents a major step
    towards supporting the Group SPMD model, a generalisation of SPMD,
    in Glasgow Parallel Haskell. It also represents the first attempt
    of which we are aware to model SPMD programming in a system
    supporting implicit communications.

    In order to ensure that the necessary barrier communications are
    modelled correctly at the high level we use the implementation
    concept synchronisation through logically shared graph
    reduction. We therefore ensure, in common with other work on
    evaluation strategies, that no new constructs need be added to the
    host functional language other than basic ones to introduce
    parallelism and enforce sequential execution. In order to ensure
    accurate modelling of the SPMD paradigm it is, however, convenient
    to introduce a primitive to control task placement.

    Although there is, as yet, no efficient implementation of the SPMD
    strategy, we are confident that this could be constructed using an
    algorithmic skeleton, or similar implementation technique.
  }
}

@inproceedings{KingHT98,
  author = {King, David J. and Hall, Jon G. and Trinder, Philip W.},
  title = {A Strategic Profiler for {Glasgow} Parallel {Haskell}},
  booktitle = {IFL'98 --- Intl. Workshop on the Implementation  of Functional Languages 1998},
  address = {London, UK},
  month = sep,
  year = {1998},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {1595},
  pages = {88--102},
  keywords = {tools, GpH, profiling, evaluation strategies},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/gransp.html},
  abstract = {
    Execution profiling is a crucial part of engineering large
    parallel functional programs. This paper presents the design,
    implementation, and use of a new execution time profiler for
    Glasgow Parallel Haskell GpH --- a well established parallel
    derivative of Haskell. Parallelism is introduced in GpH by using
    evaluation strategies, which provide a clean way of co-ordinating
    parallel threads without destroying a program's original
    structure. The new profiler attributes the cost of evaluating
    parallel threads to the strategies that created them. A unique
    feature of the strategic profiler is that the call-graph of
    evaluation strategies is maintained, allowing the programmer to
    discover the sequence of (perhaps nested) strategies that were
    used to create any given thread. The strategic profiler is
    demonstrated on several examples, and compared with other
    profilers.
  }
}

@phdthesis{Loidl98,
  author = {Loidl, Hans-Wolfgang},
  title = {Granularity in Large-Scale Parallel Functional Programming},
  school = {Dept. of Computing Science, Univ. of Glasgow},
  month = mar,
  year = {1998},
  keywords = {parallel applications, GpH, granularity, thesis},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/loidl-thesis.html},
  abstract = {
    This thesis demonstrates how to reduce the runtime of large
    non-strict functional programs using parallel evaluation. The
    parallelisation of several programs shows the importance of
    granularity, i.e. the computation costs of program
    expressions. The aspect of granularity is studied both on a
    practical level, by presenting and measuring runtime granularity
    improvement mechanisms, and at a more formal level, by devising a
    static granularity analysis.

    By parallelising several large functional programs this thesis
    demonstrates for the first time the advantages of combining lazy
    and parallel evaluation on a large scale: laziness aids
    modularity, while parallelism reduces runtime.  One of the
    parallel programs is the Lolita system which, with more than
    47,000 lines of code, is the largest existing parallel non-strict
    functional program.  A new mechanism for parallel programming,
    evaluation strategies, to which this thesis contributes, is shown
    to be useful in this parallelisation.  Evaluation strategies
    simplify parallel programming by separating algorithmic code from
    code specifying dynamic behaviour. For large programs the
    abstraction provided by functions is maintained by using a
    data-oriented style of parallelism, which defines parallelism over
    intermediate data structures rather than inside the functions.

    A highly parameterised simulator, GranSim, has been constructed
    collaboratively and is discussed in detail in this thesis. GranSim
    is a tool for architecture-independent parallelisation and a
    testbed for implementing runtime-system features of the parallel
    graph reduction model. By providing an idealised as well as an
    accurate model of the underlying parallel machine, GranSim has
    proven to be an essential part of an integrated parallel software
    engineering environment. Several parallel runtime-system features,
    such as granularity improvement mechanisms, have been tested via
    GranSim.  It is publicly available and in active use at several
    universities worldwide.

    In order to provide granularity information this thesis presents
    an inference-based static granularity analysis.  This analysis
    combines two existing analyses, one for cost and one for size
    information. It determines an upper bound for the computation
    costs of evaluating an expression in a simple strict higher-order
    language. By exposing recurrences during cost reconstruction and
    using a library of recurrences and their closed forms, it is
    possible to infer the costs for some recursive functions.  The
    possible performance improvements are assessed by measuring the
    parallel performance of a hand-analysed and annotated program.
  }
}

@unpublished{TrinderBDHJKLP98a,
author = {Trinder, Philip W. and {Barry Jr.}, Ed. and Davis, M. Kei and Hammond, Kevin and Junaidu, Sahalu B. and Klusik, Ulrike and Loidl, Hans-Wolfgang and Peyton-Jones, Simon L.},
title = {{GpH}: An Architecture-independent Functional Language},
month = jul,
year = {1998},
note = {Unpublished draft},
keywords = {models of parallelism, GpH, architecture independence, unpublished},
url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/arch-indep.html},
abstract = {
In principle, pure functional languages promise straightforward
architecture-independent parallelism. We investigate the validity
of this claim in the context of our highly-portable implementation
of an implicitly-parallel functional language: the GUM
implementation of Glasgow Parallel Haskell (GpH). We discuss
architecture independence at two levels: low-level (i.e. the
implementation) and high-level (i.e. the programmer).

Low-level architecture independence is achieved by choosing a
message-passing model for GUM, and implementing it using portable
C and a widely-supported message-passing library like PVM. In fact
GUM is largely independent of the message-passing library, and has
been adapted to use MPI and the CM-5 CMMD libraries as well as
PVM. As a result, GUM is easily ported, and is currently available
on seven platforms including shared-memory machines, distributed
memory machines, and networks of workstations. We provide
indicative measurements of how efficient and effective our
architecture-independent runtime system is across a range of
architectures.

The GpH programming model provides higher-level architecture
independence. The parallelism in GpH is mainly implicit, and hence
relatively small parts of the program need to be changed for a new
architecture. The coordination that is required is expressed with
a new high-level construct, evaluation strategies. Evaluation
strategies provide a clean separation between algorithm and
coordination, easing the task of changing either facet to suit new
parallel environments. Moreover, GpH programs can systematically
be developed for multiple target architectures, using a suite of
simulation, profiling and visualisation tools. Much of the
development is architecture-independent but, once a particular
target architecture has been selected, the tools are parameterised
to support tuning for that architecture. We demonstrate the
systematic development of two real programs to the point of
achieving good speedups on four architectures.
}
}

@inproceedings{TrinderBDHJKLP98b,
  author = {Trinder, Philip W. and {Barry Jr.}, Ed. and Davis, M. Kei and Hammond, Kevin and Junaidu, Sahalu B. and Klusik, Ulrike and Loidl, Hans-Wolfgang and Peyton-Jones, Simon L.},
  title = {Low level Architecture-independence of {Glasgow} Parallel {Haskell} ({GpH})},
  booktitle = {Glasgow Workshop on Functional Programming},
  address = {Pitlochry, Scotland},
  month = sep,
  year = {1998},
  keywords = {models of parallelism, GpH, architecture independence},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/low-level-gph.html},
  abstract = {
    In principle, pure functional languages promise straightforward
    architecture-independent parallelism. We investigate the validity
    of this claim in the context of our highly-portable implementation
    of an implicitly-parallel functional language: the GUM
    implementation of Glasgow Parallel Haskell (GpH). We focus here on
    the low-level architecture independence of GUM: the high-level
    architecture independence of the GpH programming model is
    discussed elsewhere.

    Low-level architecture independence is achieved by choosing a
    message-passing model for GUM, and implementing it using portable
    C and a widely-supported message-passing library like PVM. In fact
    GUM is largely independent of the message-passing library, and has
    been adapted to use MPI and the CM-5 CMMD libraries as well as
    PVM. As a result, GUM is easily ported, and is currently available
    on seven platforms including shared-memory machines, distributed
    memory machines, and networks of workstations. We provide
    indicative measurements of the efficiency and speedups delivered
    by GUM on a range of architectures.
  }
}

@article{TrinderHLP98,
  author = {Trinder, Philip W. and Hammond, Kevin and Loidl, Hans-Wolfgang and Peyton-Jones, Simon L.},
  title = {{Algorithm} + {Strategy} = {Parallelism}},
  journal = {Journal of Functional Programming},
  volume = {8},
  number = {1},
  year = {1998},
  pages = {23--60},
  keywords = {models of parallelism, GpH, evaluation strategies},
  doi = {http://dx.doi.org/10.1017/S0956796897002967},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/strategies.html},
  abstract = {
    The process of writing large parallel programs is complicated by
    the need to specify both the parallel behaviour of the program and
    the algorithm that is to be used to compute its result. This paper
    introduces evaluation strategies, lazy higher-order functions that
    control the parallel evaluation of non-strict functional
    languages. Using evaluation strategies, it is possible to achieve
    a clean separation between algorithmic and behavioural code. The
    result is enhanced clarity and shorter parallel programs.

    Evaluation strategies are a very general concept: this paper shows
    how they can be used to model a wide range of commonly used
    programming paradigms, including divide-and-conquer, pipeline
    parallelism, producer/consumer parallelism, and data-oriented
    parallelism.  Because they are based on unrestricted higher-order
    functions, they can also capture irregular parallel structures.

    Evaluation strategies are not just of theoretical interest: they
    have evolved out of our experience in parallelising several
    large-scale parallel applications, where they have proved
    invaluable in helping to manage the complexities of parallel
    behaviour.  These applications are described in detail here. The
    largest application we have studied to date, Lolita, is a 60,000
    line natural language parser. Initial results show that for these
    programs we can achieve acceptable parallel performance, while
    incurring minimal overhead for using evaluation strategies.
  }
}

@inproceedings{HallLTHD97,
  author = {Hall, Cordelia V. and Loidl, Hans-Wolfgang and Trinder, Philip W. and Hammond, Kevin and O'Donnell, John T.},
  title = {Refining a Parallel Algorithm for Calculating Bowings},
  booktitle = {Glasgow Workshop on Functional Programming},
  address = {Ullapool, Scotland},
  month = sep,
  year = {1997},
  keywords = {parallel applications, GpH},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/bowings.html},
  note = {Draft proceedings},
  abstract = {
    String players know that bowing properly is the hardest skill they
    have to learn. In other work [FH97], we develop an algorithm that
    calculates bowings for BowTech, a project that supports string
    performers. This algorithm takes a significant amount of time to
    execute (in our second test case, an hour for the implementation,
    written in Ada). We have implemented a parallel version of the
    algorithm in GpH, a parallel superset of Haskell, and measured the
    quality of output on size pieces of music. The parallel program
    has been refined using the GranSim simulator and measured on two
    parallel architectures: a shared memory multiprocessor and a
    network of workstations.
  }
}

@inproceedings{HammondLT97,
  author = {Hammond, Kevin and Loidl, Hans-Wolfgang and Trinder, Philip W.},
  title = {Parallel Cost Centre Profiling},
  booktitle = {Glasgow Workshop on Functional Programming},
  address = {Ullapool, Scotland},
  month = sep,
  year = {1997},
  pages = {51--72},
  keywords = {tools, GpH, profiling},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/grancc.html},
  note = {Draft proceedings},
  abstract = {
    Good profiling is a major issue in extracting performance from
    parallel programs. We report on a novel synthesis of sequential
    cost-centre profiling and state-of-the-art parallel simultion for
    the pure functional language Haskell that promises to provide
    detailed and accurate information about parallel Haskell
    programs. Exploiting simulation also improves the quality of
    sequential cost centre profiling, though at a significant
    performance overhead.
  }
}

@inproceedings{JunaiduDH97,
  author = {Junaidu, Sahalu B. and Davie, Antony J. T. and Hammond, Kevin},
  title = {{Naira}: A Parallel^2 {Haskell} Compiler},
  booktitle = {IFL'97 --- Intl. Workshop on the Implementation of Functional Languages},
  address = {St Andrews, Scotland},
  month = sep,
  year = {1997},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {1467},
  pages = {214--230},
  keywords = {parallel applications, GpH},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/naira.html},
  abstract = {
    Naira is a compiler for a parallel dialect of Haskell, compiling
    to a dataflow-inspired parallel abstract machine. Unusually
    (perhaps even uniquely), Naira has itself been parallelised using
    state-of-the-art tools developed at Glasgow and St. Andrews. This
    paper reports initial performance results that have been obtained
    using the GranSim simulator, both for the top-level pipeline and
    for the individual compilation states. Our results show that a
    modest but useful degree of parallelism can be achieved even for a
    distributed memory machine.
  }
}

@inproceedings{Loidl97,
  author = {Loidl, Hans-Wolfgang},
  title = {{LinSolv}: a Case Study in Strategic Parallelism},
  booktitle = {Glasgow Workshop on Functional Programming},
  address = {Ullapool, Scotland},
  month = sep,
  year = {1997},
  keywords = {parallel applications, GpH},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/linsolv.html},
  note = {Draft proceedings},
  abstract = {
    This paper discusses the parallelisation and performance tuning of
    a typical computer algebra algorithm, LinSolv, using evaluation
    strategies. We present three steps in the parallelisation process
    starting with a naive parallel version. As this algorithm uses
    infinite data structures as intermediate values it is necessary to
    define very sophisticated strategies in order to improve parallel
    performance.

    We also compare the strategic parallel code with pre-strategy
    code. This comparison shows how evaluation strategies help to
    localise changes needed for parallelisation. In particular, the
    separation between algorithmic and parallel code makes the
    structure of the parallelism much clearer.
  }
}

@inproceedings{LoidlMTPCPG97,
  author = {Loidl, Hans-Wolfgang and Morgan, Richard G. and Trinder, Philip W. and Poria, Sanjay and Cooper, Chris and Peyton-Jones, Simon L. and Garigliano, Roberto},
  title = {Parallelising a Large Functional Program or: Keeping {LOLITA} Busy},
  booktitle = {IFL'97 --- Intl. Workshop on the Implementation of Functional Languages},
  address = {St Andrews, Scotland},
  month = sep,
  year = {1997},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {1467},
  pages = {198--213},
  keywords = {parallel applications, GpH},
  doi = {http://dx.doi.org/10.1007/BFb0055432},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/lolita.html},
  abstract = {
    In this paper we report on the ongoing parallelisation of Lolita,
    a natural language engineering system. Although Lolita currently
    exhibits only modest parallelism, we believe that it is the
    largest parallel functional program ever, comprising more than
    47,000 lines of Haskell. Lolita has the following interesting
    features common to real world applications of lazy languages:

      * the code was not specifically designed for parallelism; 

      * laziness is essential for efficiency in Lolita; 

      * Lolita interfaces to data structures outside the Haskell heap,
        using a foreign language interface;

      * Lolita was not written by those most closely involved in the
        parallelisation.

    Our expectations in parallelising the program were to achieve
    moderate speedups with small changes in the code.  To date
    speedups of up to 2.4 have been achieved for Lolita running under
    a realistic simulation of our 4 processor shared-memory target
    machine. Most notably, the parallelism is achieved with a very
    small number of changes to, and without requiring an understanding
    of most of the application.  On the Sun SPARCserver target machine
    wall-clock speedup is currently limited by physical memory
    availability.
  }
}

@inproceedings{LoidlT97,
  author = {Loidl, Hans-Wolfgang and Trinder, Philip W.},
  title = {Engineering Large Parallel Functional Programs},
  booktitle = {IFL'97 --- Intl. Workshop on the Implementation of Functional Languages},
  address = {St Andrews, Scotland},
  month = sep,
  year = {1997},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {1467},
  pages = {178--197},
  keywords = {parallel applications, GpH},
  doi = {http://dx.doi.org/10.1007/BFb0055431},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/eng.html},
  abstract = {
    The design and implementation of useful programming languages,
    whether sequential or parallel, should be driven by large,
    realistic applications.  In constructing several medium- and
    large-scale programs in Glasgow Parallel Haskell, GPH, a parallel
    extension of Haskell, the group at Glasgow has investigated
    several important engineering issues:

      * Real Application Parallelism: The programs achieve good
        wall-clock speedups and acceptable scale-up on both a
        shared-memory and a distributed memory machine. The programs
        typify a number of application areas and use a number of
        different parallel paradigms, e.g. pipelining or
        divide-and-conquer, often combining several paradigms in a
        single program.

      * Language Issues: Although the largely implicit parallelism in
        GPH is a satisfactory programming model in general the base
        constructs for introducing and controlling parallelism tend to
        obfuscate the semantics of large programs.  As a result we
        developed evaluation strategies, a more abstract, and
        systematic mechanism for introducing and controlling
        parallelism.

      * Engineering Environment: The development and performance
        tuning of these programs emphasised the importance of an
        integrated engineering environment. In the process we have
        refined components of this environment like the simulator, the
        runtime system, and the profiling tools.
  }
}

@inproceedings{LoidlH96a,
  author = {Loidl, Hans-Wolfgang and Hammond, Kevin},
  title = {A Sized Time System for a Parallel Functional Language},
  booktitle = {Glasgow Workshop on Functional Programming},
  address = {Ullapool, Scotland},
  month = jul,
  year = {1996},
  keywords = {semantics, resource analysis},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/sized.html},
  abstract = {
    This paper describes an inference system, whose purpose is to
    determine the cost of evaluating expressions in a strict purely
    functional language. Upper bounds can be derived for both
    computation cost and the size of data structures. We outline a
    static analysis based on this inference system for inferring size
    and cost information. The analysis is a synthesis of the sized
    types of Hughes et al., and the polymorphic time system of Dornic
    et al., which was extended to static dependent costs by Reistad
    and Gifford.

    Our main interest in cost information is for scheduling tasks in
    the parallel execution of functional languages. Using the GranSim
    parallel simulator, we show that the information provided by our
    analysis is sufficient to characterise relative task granularities
    for a simple functional program. This information can be used in
    the runtime-system of the Glasgow Parallel Haskell compiler to
    improve dynamic program performance.
  }
}

@inproceedings{LoidlH96b,
  author = {Loidl, Hans-Wolfgang and Hammond, Kevin},
  title = {Making a Packet: Cost-Effective Communication for a Parallel Graph Reducer},
  booktitle = {IFL'96 --- Intl. Workshop on the Implementation of Functional Languages},
  address = {Bad-Godesberg, Germany},
  month = sep,
  year = {1996},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {1268},
  pages = {184--199},
  keywords = {implementation, GpH, GUM},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/packet.html},
  abstract = {
    This paper studies critical runtime-system issues encountered when
    packing data for transmission in a lazy, parallel graph reduction
    system. In particular, we aim to answer two questions:

      * How much graph should go into a packet?

      * How aggressively should a processor look for work after
        requesting remote data?

    In order to answer the first question, we compare various packing
    schemes, of which one extreme packs just the node that is demanded
    (``incremental fetching''), and the other packs all the graph that
    is reachable from that node (``bulk fetching''). The second
    question is addressed by considering various mechanisms for
    latency hiding during communication, ranging from fully
    synchronous communication with no attempt to mask latency, to full
    thread migration during asynchronous communication.

    In order to make our results as general as possible, we have used
    the GranSim simulator to study a wide variety of parallel machine
    configurations. Based on these measurements we propose concrete
    improvements for parallel graph reducers such as the GUM
    implementation of Glasgow Parallel Haskell.
  }
}

@inproceedings{TrinderHLPW96,
  author = {Trinder, Philip W. and Hammond, Kevin and Loidl, Hans-Wolfgang and Peyton-Jones, Simon L. and Wu, J.},
  title = {Accidents always Come in Threes: A Case Study of Data-intensive Programs in Parallel {Haskell}},
  booktitle = {Glasgow Workshop on Functional Programming},
  address = {Ullapool, Scotland},
  month = jul,
  year = {1996},
  keywords = {parallel applications, GpH},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/data-intensive.html},
  abstract = {
    Accidents happen:

      * "An invisible car came out of nowhere, struck my vehicle and
        vanished."

      * "I pulled away from the side of the road, glanced at my
        mother-in-law, and headed for the embankment."

      * "As I approached the intersection a sign suddenly appeared in
        a place where no stop sign had ever appeared before."

    Luckily, we don't normally have to deal with problems as bizarre
    as these. One interesting application that does arise at the
    Centre for Transport Studies consists of matching police reports
    of several accidents so as to locate accident blackspots. The
    application provides an interesting, data-intensive, test-bed for
    the persistent functional language PFL. We report here on an
    approach aimed at improving the performance of this application
    using Glasgow Parallel Haskell.

    The accident application is one of several large parallel Haskell
    programs under development at Glasgow. Our objective is to achieve
    wall-clock speedups over the best sequential implementations, and
    we report modest wall-clock speedups for a demonstration
    program. From experience with these and other programs the group
    is developing a methodology for parallelising large functional
    programs. We have also developed strategies, a mechanism to
    separately specify a function's algorithm and its dynamic
    behaviour.
  }
}

@inproceedings{TrinderHMPP96,
  author = {Trinder, Philip W. and Hammond, Kevin and {Mattson Jr.}, James S. and Partridge, Andrew S. and Peyton-Jones, Simon L.},
  title = {{GUM}: A Portable Parallel Implementation of {Haskell}},
  booktitle = {PLDI'96 --- ACM Conf. on Programming Language Design and Implementation},
  address = {Philadephia, PA, USA},
  month = may,
  year = {1996},
  publisher = {ACM Press},
  pages = {79--88},
  doi = {http://doi.acm.org/10.1145/231379.231392},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/gum.html},
  keywords = {implementation, GpH, GUM},
  abstract = {
    GUM is a portable, parallel implementation of the Haskell
    functional language. Despite sustained research interest in
    parallel functional programming, GUM is one of the first such
    systems to be made publicly available. GUM is message-based, and
    portability is facilitated by using the PVM communications harness
    that is available on many multi-processors. As a result, GUM is
    available for both shared-memory (Sun SPARCserver multiprocessors)
    and distributed-memory (networks of workstations)
    architectures. The high message-latency of distributed machines is
    ameliorated by sending messages asynchronously, and by sending
    large packets of related data in each message. Initial performance
    figures demonstrate absolute speedups relative to the best
    sequential compiler technology. To improve the performance of a
    parallel Haskell program GUM provides tools for monitoring and
    visualising the behaviour of threads and of processors during
    execution.
  }
}

@inproceedings{HammondLP95,
  author = {Hammond, Kevin and Loidl, Hans-Wolfgang and Partridge, Andrew S.},
  title = {Visualising Granularity in Parallel Programs: A Graphical Winnowing System for {Haskell}},
  booktitle = {HPFC'95 --- Conf. on High Performance Functional Computing},
  address = {Denver, CO, USA},
  month = apr,
  year = {1995},
  pages = {208--221},
  keywords = {tools, GpH, simulator, granularity},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/winnowing.html},
  abstract = {
    To take advantage of distributed-memory parallel machines it is
    essential to have good control of task granularity. This paper
    describes a fairly accurate parallel simulator for Haskell, based
    on the Glasgow compiler, and complementary tools for visualising
    task granularities. Together these tools allow us to study the
    effects of various annotations on task granularity on a variety of
    simulated parallel architectures. They also provide a more precise
    tool for the study of parallel execution than has previously been
    available for Haskell programs.

    These tools have already confirmed that thread migration is
    essential in parallel systems, demonstrated a close correlation
    between thread execution times and total heap allocations, and
    shown that fetching data synchronously normally gives better
    overall performance than asynchronous fetching, if data is fetched
    on demand.
  }
}

@inproceedings{LoidlH95,
  author = {Loidl, Hans-Wolfgang and Hammond, Kevin},
  title = {On the Granularity of Divide-and-Conquer Parallelism},
  booktitle = {Glasgow Workshop on Functional Programming},
  address = {Ullapool, Scotland},
  month = jul,
  year = {1995},
  publisher = {Springer},
  series = {Workshops in Computing},
  keywords = {models of parallelism, GpH, granularity},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/div-conc.html},
  pdf = {http://www.bcs.org/upload/pdf/ewic_fp95_paper13.pdf},
  abstract = {
    This paper studies the runtime behaviour of various parallel
    divide-and-conquer algorithms written in a non-strict functional
    language, when three common granularity control mechanisms are
    used: a simple cut-off, a priority thread creation and a priority
    scheduling mechanism. These mechanisms use granularity information
    that is currently provided via annotations to improve the
    performance of the parallel programs.

    The programs we examine are several variants of a generic
    divide-and-conquer program, an unbalanced divide-and-conquer
    algorithm and a parallel determinant computation. Our results
    indicate that for balanced computation trees a simple,
    low-overhead mechanism performs well whereas the more complex
    mechanisms offer further improvements for unbalanced computation
    trees.
  }
}

@inproceedings{HammondMP94,
  author = {Hammond, Kevin and {Mattson Jr.}, James S. and Peyton-Jones, Simon L.},
  title = {Automatic Spark Strategies and Granularity for a Parallel Functional Language Reducer},
  booktitle = {CONPAR'94 --- Intl. Conf. on Parallel Processing},
  address = {Linz, Austria},
  month = sep,
  year = {1994},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {854},
  pages = {521--532},
  keywords = {models of parallelism, GpH, GRIP, granularity},
  doi = {http://dx.doi.org/10.1007/3-540-58430-7_46},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/spark-strats.html},
  abstract = {
    This paper considers the issue of dynamic task control in the
    context of a parallel Haskell implementation on the GRIP
    multiprocessor. For the first time, we report the effect of our
    task control strategies on task granularity, as measured in terms
    of dynamic heap allocations. This gives a concrete means of
    measuring the effectiveness of these strategies other than
    wall-clock timings, which are notoriously uninformative.
  }
}

@inproceedings{LoidlH94,
  author = {Loidl, Hans-Wolfgang and Hammond, Kevin},
  title = {{GRAPH} for {PVM}: Graph Reduction for Distributed Hardware},
  booktitle = {IFL'94 --- Intl. Workshop on the Implementation of Functional Languages},
  address = {Norwich, UK},
  month = sep,
  year = {1994},
  keywords = {implementation, GpH, run-time system},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/graph-pvm.html},
  abstract = {
    We describe a version of the GRAPH system (Graph Reduction for an
    Assortment of Parallel Hardware) designed to execute parallel
    functional programs on a range of distributed-memory MIMD machines
    running the PVM communications harness.  GRAPH was developed from
    the runtime system for the novel GRIP multiprocessor.  Although
    this system has proved highly successful, being a novel
    architecture it is hard to compare results with other
    architectures or implementations.  The CPUs used in the GRIP
    design are also rather old.

    The principal extensions from GRAPH for GRIP to GRAPH for PVM are
    intended to handle high latencies more efficiently, by exploiting
    asynchronous communication, multi-threaded scheduling, and by
    grouping packets into larger entities where possible.  The main
    innovation is the development of new, sophisticated packet
    flushing and packet fetching algorithms whose purpose is to allow
    graphs to be exported to and fetched from global memory without
    inter-processor synchronisation.  As communication and reduction
    can be interleaved, a new form of synchronising communication and
    reduction is necessary.

    We also give a short outlook on the design of a new portable
    distributed graph reduction system, GRAPH for UMM, that adepts the
    ideas and techniques originally developed for the GRIP system to a
    distributed memory environment.

    GRAPH for PVM has been implemented on a network of SUN
    workstations and it is currently being tested and debugged.
    Although no extensive performance measurements have been performed
    yet, the available data shows a significant decrease in the
    overall communication and thus an improvement in the overall
    performance of the system.
  }
}

@inproceedings{TrinderHLMPP94,
  author = {Hammond, Kevin and Loidl, Hans-Wolfgang and {Mattson Jr.}, James S. and Partridge, Andrew S. and Peyton-Jones, Simon L. and Trinder, Philip W.},
  title = {{GRAPHing} the Future},
  booktitle = {IFL'94 --- Intl. Workshop on the Implementation of Functional Languages},
  address = {Norwich, UK},
  month = sep,
  year = {1994},
  keywords = {implementation, GpH, run-time system},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/graphing.html},
  abstract = {
    At Glasgow our research into parallel functional programming has
    been moving away from our novel architecture, GRIP towards the
    provision of a general parallel runtime environment.  We call this
    GRAPH (Graph Reduction for an Assortment of Parallel Hardware).

    This paper describes the design of a new memory and load
    management model for GRAPH, which is intended to match shared- and
    distributed-memory machines better and to provide a framework for
    our research into parallel functional databases and granularity
    control.  This model is currently under implementation.  No
    performance results can therefore be provided at the present time.
  }
}

@inproceedings{AkerholtHPT93,
  author = {Akerholt, Gert and Hammond, Kevin and Peyton-Jones, Simon L. and Trinder, Philip W.},
  title = {Processing Transactions on {GRIP}, a Parallel Graph Reducer},
  booktitle = {PARLE'93 --- Intl. Conf. on Parallel Architectures and Languages Europe},
  address = {Munich, Germany},
  month = jun,
  year = {1993},
  publisher = {Springer},
  series = {Lecture Notes in Computer Science},
  volume = {694},
  pages = {634--647},
  keywords = {parallel applications, GpH, GRIP},
  doi = {http://dx.doi.org/10.1007/3-540-56891-3_51},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/trans-grip.html},
  abstract = {
    The GRIP architecture allows efficient execution of functional
    programs on a multi-processor built from standard hardware
    components. State-of-the-art compilation techniques are combined
    with sophisticated runtime resource-control to give good parallel
    performance. This paper reports the results of running GRIP on an
    application which is apparently unsuited to the basic functional
    model: a database transaction manager incorporating updates as
    well as lookup transactions. The results obtained show good
    relative speedups for GRIP, with real performance advantages over
    the same application executing on sequential machines.
  }
}

@inproceedings{Hammond93,
  author = {Hammond, Kevin},
  title = {Getting a {GRIP}},
  booktitle = {Intl. Workshop on the Parallel Implementation of Functional Languages},
  address = {Nijmegen, The Netherlands},
  month = sep,
  year = {1993},
  keywords = {implementation, GpH, run-time system, GRIP},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/grip.html},
  abstract = {
    This paper describes a portable implementation of the Graph
    Reduction in Parallel (GRIP) software, built on the Parallel
    Virtual Machine (PVM) process control system. Using this system,
    it is easy for anyone to ``get a GRIP'' without needing to invest
    in special-purpose hardware. An important contribution of this
    paper is its detailed description of GRIP's Intelligent Memory
    Unit (IMU) software.
  }
}

@inproceedings{HammondP92,
  author = {Hammond, Kevin and Peyton-Jones, Simon L.},
  title = {Profiling Scheduling Strategies on the {GRIP} Multiprocessor},
  booktitle = {Intl. Workshop on the Parallel Implementation of Functional Languages},
  address = {Aachen, Germany},
  month = sep,
  year = {1992},
  pages = {73--98},
  keywords = {implementation, GpH, run-time system, GRIP},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/grip-sched.html},
  abstract = {
    It is widely claimed that functional languages are particularly
    suitable for programming parallel computers. A claimed advantage
    is that the programmer is not burdened with details of task
    creation, placement, scheduling, and synchronisation, these
    decisions being taken by the system instead. Leaving aside the
    question of whether a pure functional language is expressive
    enough to encompass all the parallel algorithms we might wish to
    program, there remains the question of how effectively the
    compiler and run-time system map the program onto a real parallel
    system, a task usually carried out mostly by the programmer. This
    is the question we address in this paper.

    We first introduce the system architecture of GRIP, a
    shared-memory parallel machine supporting an implementation of the
    functional language Haskell. GRIP executes functional programs in
    parallel using compiled supercombinator graph reduction, a form of
    declarative rule system.

    We then describe several strategies for run-time resource control
    which we have tried, presenting comprehensive measurements of
    their effectiveness. We are particularly concerned with strategies
    controlling task creation, in order to improve task granularity
    and minimise communication overheads. This is, so far as we know,
    one of the first attempts to make a systematic study of
    task-control strategies in a high-performance parallel
    functional-language system. GRIP's high absolute performance
    render these results credible for real applications.
  }
}

@inproceedings{HammondP90,
  author = {Hammond, Kevin and Peyton-Jones, Simon L.},
  title = {Some Early Experiments on the {GRIP} Parallel Reducer},
  booktitle = {Intl. Workshop on the Parallel Implementation of Functional Languages},
  address = {Nijmegen, The Netherlands},
  month = jun,
  year = {1990},
  pages = {51--72},
  keywords = {implementation, GpH, run-time system, GRIP},
  url = {http://www.macs.hw.ac.uk/~dsg/gph/papers/abstracts/early-grip.html},
  abstract = {
    GRIP is a multiprocessor designed to execute functional programs
    in parallel using graph reduction. We have implemented a compiler
    for GRIP, based on the Spineless Tagless G-machine, and can now
    run parallel functional programs with substantial absolute speedup
    over the same program running on a uniprocessor Sun.

    Parallel functional programming shifts some of the burden of
    resource allocation from the programmer to the system. Examples of
    such decisions include: when to create a new concurrent activity
    (or thread), when to execute such threads, where to execute them,
    and so on.

    It is clearly desirable that the system should take such
    decisions, provided it does a good enough job. For example, a
    paged virtual memory system almost always does an adequate job,
    and a programmer very seldom has to interfere with it. The big
    question for parallel functional programming is whether good
    resource-allocation strategies exist, and how well they perform
    under a variety of conditions.

    Now that we have an operational system, we are starting to carry
    out experiments to develop resource-allocation strategies, and
    measure their effectiveness. This paper reports on some very
    preliminary results. They mainly concern the question of when, or
    even whether, to create a new thread. This is an aspect which has
    so far received little attention --- existing work has focused
    mainly on load sharing rather than on thread creation. Our work is
    aided significantly by the fact that sparks indicate only the
    possibility of creating a new parallel thread, not the necessity
    of creating this thread. It is therefore always safe to discard
    sparks, while affecting only the performance of the computation,
    not its result.

    We begin with an overview of the GRIP system, and how we perform
    compiled graph reduction. This is followed by a summary of the
    statistics-gathering system and the results we have obtained so
    far.
  }
}

GpH Bibliography