MMnet 2013

Call for Participation

MMnet'13: Language and Runtime Support for Concurrent Systems
Wednesday 8th May 2013
Heriot-Watt University, Edinburgh
http://www.macs.hw.ac.uk/~dsg/events/mmnet13

Schedule

We plan for MMNet'13 to start around 11:00 on Wed 8 May, and finish by 17:30. We will post the official meeting schedule up here once it is finalised. Talks will be either 15 minutes or 30 minutes, including time for questions. Lunch and refreshments will be provided throughout the day.

Schedule for the day (all in EM 1.82):

10:30-11:00 Arrival and Coffee
11:00-11:30 Tim Harris
Rethinking the stack for distributed runtime systems

Oracle Labs, Cambridge

Cluster computing is becoming increasingly important because the size of workloads continues to grow faster than the size of individual machines. In this talk I will argue that:

The resource demands of emerging workloads (e.g., distributyed Graph analytics) look different from software traditionally deployed on clusters (HPC and distributed/replicated servers).
With jobs spanning multiple machines, no individual system is in control of traditional OS functions such as scheduling and resource management. This leads to poor interactions (e.g., where cluster-wide scheduling of jobs to machines is unaware of the exact load on individual machines) and wasted resources (e.g., if machines or VMs are statically assigned, but resources go unused).
I will describe some of the trends I am seeing, and research directions I am exploring in the design of distributed runtime systems. This is an informal work-in-progress talk - feedback very welcome.

11:30-12:00 Lokesh Gidra
A study of the Scalability of Stop-the-World Garbage Collectors on Multicore

University Pierre & Marie Curie, Paris

Large-scale multicore architectures create new challenges for garbage collectors (GCs). In particular, throughput-oriented stop-the-world algorithms demonstrate good per- formance with a small number of cores, but have been shown to degrade badly beyond approximately 8 cores on a 48-core with OpenJDK 7. This negative result raises the question whether the stop-the-world design has intrinsic limitations that would require a radically different approach. Our study suggests that the answer is no, and that there is no compelling scalability reason to discard the existing highly-optimised throughput-oriented GC code on contemporary hardware. In this talk, I will present our paper that studies the default through-put-oriented garbage collector of OpenJDK 7, called Parallel Scavenge. We identify its bottlenecks, and show how to eliminate them using well-established parallel programming techniques. On SPECjbb2005, SPECjvm2008 and DaCapo 9.12 benchmarks, the improved GC matches the performance of Parallel Scavenge at low core count, but scales well, up to 48 cores.

12:00-12:30 Susmit Sarkar
From C/C++11 to Power and ARM: What is Shared-Memory Concurrency Anyway?

Univ of St Andrews

Shared memory concurrent programming and verification traditionally assumes Sequential Consistency, but real-world hardware only provides relaxed consistency models. Furthermore, the consistency models of different hardware architectures vary widely and have often been poorly defined, while programming language models (aiming to abstract from hardware details) are different again. The IBM Power line of multiprocessors has long been recognised to have one of the most complex and subtle memory consistency models, and ARM multiprocessors have a broadly similar model. Together, they directly inspired features of the new C/C++ concurrency model in ISO C/C++11. However, it was an open question among the ISO concurrency committee whether the model even made sense, or could be implemented. I will point out some of the key features of the C/C++ concurrency model, and what would be involved in its implementability.

A formal proof of implementability depends crucially on mathematically precise abstract models for the target hardware. I will discuss a model of the Power and ARM multiprocessors, which has shed light on hitherto unknown programmer-observable behaviour. The model was developed based on extensive experiments and on discussion with IBM and ARM architects; it is expressed as an abstract machine, abstracting as much as possible (but no more) from microarchitectural implementation details.

I will also describe proofs that show that a recommended implementation strategy for C/C++11 concurrency to Power is sound. All compilers will have to implement such a strategy, and furthermore, all have to agree on a common strategy for linking to be sound. This proof builds confidence in both models, and also provides insight into how the two very different models really work.

12:30-13:30 Lunch
13:30-14:00 Jeremy Singer
Control Theory for Adaptive Heap Resizing

Univ of Glasgow

We formulate heap sizing as a control problem, apply and tune a standard controller algorithm, and evaluate its performance on a set of well-known benchmarks. We find our controller adapts the heap size more responsively than existing mechanisms. This responsiveness allows tighter virtual machine memory footprints while preserving target application throughput, which is ideal for both embedded and utility computing domains. In short, we argue that formal, systematic approaches to memory management should be replacing ad-hoc heuristics as the discipline matures. Control-theoretic heap sizing is one such systematic approach.

Paper: ISMM'13 paper

14:00-14:30 David Matthews
Parallel GC and Heap Management in Poly/ML and Isabelle

Prolingua Limited

The Isabelle proof system is a major user of Poly/ML and the addition of concurrency to Poly/ML has focussed attention on the storage management system particularly with larger proofs. The talk will describe the new parallelised garbage-collector of Poly/ML and the heap-sizing strategy used as well as the extra "sharing" pass that can be activated when memory is particularly short. The talk will conclude with some performance results of Isabelle runs on multi-core processors.

14:30-15:00 Sven-Bodo Scholz
Efficient Reference Counting for Nested Task- and Data-Parallelism

Heriot-Watt Univ, Edinburgh

Truly implicit memory management of large data structures benefits from non-delayed garbage collection implemented through reference counting. When pushing this approach towards nested forms of massively parallel executions new challenges arise wrt the implementation of reference counting. Naive implementations can easily introduce overheads that impact the overall performance rather badly.

This talk proposes a staged, multi-modal technique for reference counting which has been implemented in the context of nested functional and data-parallel concurrency in SaC. We show some performance figures that demonstrate very low overheads for fine-granular nested concurrency on a massively parallel micro-threading architecture.

15:00-15:30 Coffee
15:30-16:00 Richard Jones
Rigorous Benchmarking in Reasonable Time

University of Kent

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.

In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.

We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.

16:00-16:30 Kevin Hammond
Automatic amortised analysis of dynamic memory allocation for lazy functional programs

Univ of St Andrews

This talk describes the first successful attempt, of which we are aware, to define an automatic, type-based static analysis of resource bounds for lazy functional programs. Our analysis uses the automatic amortisation approach developed by Hofmann and Jost, which was previously restricted to eager evaluation. In this paper, we extend this work to a lazy setting by capturing the costs of unevaluated expressions in type annotations and by amortising the payment of these costs using a notion of lazy potential. We present our analysis as a proof system for predicting heap allocations of a minimal functional language (including higher-order functions and recursive data types) and define a formal cost model based on Launchbury's natural semantics for lazy evaluation. We prove the soundness of our analysis with respect to the cost model. Our approach is illustrated by a number of representative and non-trivial examples that have been analysed using a prototype implementation of our analysis.

Paper: ICFP'12 paper

16:30-17:00 Paul Keir
Visualising DMA Operations for Improved Parallel Performance

Codeplay Software Ltd, Edinburgh

A single-source heterogeneous parallel programming model may be attractive for a number of reasons, including: the provision of an intuitive development interface; ease of debugging; and the facilitation of a concise codebase. Common examples are CUDA; OpenACC; and C++ AMP. At Codeplay Software we also provide a range of single-source C++ solutions within our Offload technology.

Such programming models ensure that large, existing codebases are rapidly parallelised for heterogeneous architectures. A second edge to the sword exists though, in the consequent implicit nature of data transfers occurring between distinct physical memories and corresponding address spaces.

Our Offload technology represents a novel, patented, set of heterogeneous parallel programming language technologies, featuring the integration of enhanced pointer types which are implicitly, statically, and immutably associated with a single address space; and typically correspond to a unit of physical memory.

In this talk I will describe our ongoing research work within the EU Framework 7 project LPGPU to visualise and react to expensive direct memory access operations, by applying new tools which augment the Offload C++ compiler for PlayStation 3.

17:00-17:15 Malak Aljabri
The Design and Implementation of Scalable Parallel Haskell

Heriot-Watt Univ, Edinburgh

The most widely available high performance platforms today are multi-level clusters of multi-cores. GHC provides at least two parallel Haskell implementations: GHC-SMP for shared memory, and GUM for distributed memory. Both implementations use different but related runtime-environment (RTE) mechanisms. Good performance results can be achieved on shared memory architectures and on networks individually, but a combination of both, for networks of multi-cores, is lacking.

We present the design of the new, multi-level, scalable parallel Haskell implementation, GUMSMP to better exploit such platforms. It is designed to efficiently combine distributed memory parallelism using a virtual shared heap over the cluster with optimised, low-overhead shared memory parallelism on the multi-cores. Key design objectives in realising this system are even but asymmetric load balance, effective latency hiding, and mostly passive load distribution. We give early results on a network of 8-core machines, using a flat distributed memory GUM and a 2-level GUMSMP implementation.

17:15 Closing

Sponsors

Attendance of MMnet'13 is free thanks to sponsorship of the MMnet network by EPSRC and by SICSA. MMnet'13 is hosted by the Dependable Systems Group at Heriot-Watt University.