|
MMnet'13: Language and Runtime Support for Concurrent Systems
Wednesday 8th May 2013
Heriot-Watt University, Edinburgh
http://www.macs.hw.ac.uk/~dsg/events/mmnet13
We plan for MMNet'13 to start around 11:00 on Wed 8 May, and finish by 17:30. We will post the official meeting schedule up here once it is finalised. Talks will be either 15 minutes or 30 minutes, including time for questions. Lunch and refreshments will be provided throughout the day.
Schedule for the day (all in EM 1.82):
|
|
|
10:30-11:00 |
|
Arrival and Coffee |
11:00-11:30 |
Tim Harris |
|
|
Oracle Labs, Cambridge |
|
|
|
Cluster computing is becoming increasingly important because the size of workloads continues to grow faster than the size of individual machines. In this talk I will argue that:
- The resource demands of emerging workloads (e.g., distributyed
Graph analytics) look different from software traditionally
deployed on clusters (HPC and distributed/replicated servers).
- With jobs spanning multiple machines, no individual system is in
control of traditional OS functions such as scheduling and resource
management. This leads to poor interactions (e.g., where
cluster-wide scheduling of jobs to machines is unaware of the exact
load on individual machines) and wasted resources (e.g., if machines
or VMs are statically assigned, but resources go unused).
I will describe some of the trends I am seeing, and research directions I am exploring in the design of distributed runtime systems. This is an informal work-in-progress talk - feedback very welcome.
|
11:30-12:00 |
Lokesh Gidra |
|
|
University Pierre & Marie Curie, Paris |
|
|
|
Large-scale multicore architectures create new challenges
for garbage collectors (GCs). In particular, throughput-oriented stop-the-world algorithms demonstrate good per-
formance with a small number of cores, but have been shown
to degrade badly beyond approximately 8 cores on a 48-core with OpenJDK 7.
This negative result raises the question whether the stop-the-world design
has intrinsic limitations that would require a radically different approach.
Our study suggests that the answer is no, and that there
is no compelling scalability reason to discard the existing
highly-optimised throughput-oriented GC code on contemporary hardware. In this talk, I will present our paper that
studies the default through-put-oriented garbage collector
of OpenJDK 7, called Parallel Scavenge. We identify its
bottlenecks, and show how to eliminate them using
well-established parallel programming techniques. On
SPECjbb2005, SPECjvm2008 and DaCapo 9.12
benchmarks, the improved GC matches the performance
of Parallel Scavenge at low core count, but scales
well, up to 48 cores.
|
12:00-12:30 |
Susmit Sarkar |
|
|
Univ of St Andrews |
|
|
|
Shared memory concurrent programming and verification traditionally
assumes Sequential Consistency, but real-world hardware only provides
relaxed consistency models. Furthermore, the consistency models of
different hardware architectures vary widely and have often been
poorly defined, while programming language models (aiming to abstract
from hardware details) are different again. The IBM Power line of
multiprocessors has long been recognised to have one of the most
complex and subtle memory consistency models, and ARM multiprocessors
have a broadly similar model. Together, they directly inspired
features of the new C/C++ concurrency model in ISO C/C++11. However,
it was an open question among the ISO concurrency committee whether
the model even made sense, or could be implemented. I will point out
some of the key features of the C/C++ concurrency model, and what
would be involved in its implementability.
A formal proof of implementability depends crucially on
mathematically precise abstract models for the target hardware. I will
discuss a model of the Power and ARM multiprocessors, which has shed
light on hitherto unknown programmer-observable behaviour. The model
was developed based on extensive experiments and on discussion with
IBM and ARM architects; it is expressed as an abstract machine,
abstracting as much as possible (but no more) from microarchitectural
implementation details.
I will also describe proofs that show that a recommended
implementation strategy for C/C++11 concurrency to Power is sound. All
compilers will have to implement such a strategy, and furthermore, all
have to agree on a common strategy for linking to be sound. This proof
builds confidence in both models, and also provides insight into how
the two very different models really work.
|
12:30-13:30 |
|
Lunch |
13:30-14:00 |
Jeremy Singer |
|
|
Univ of Glasgow |
|
|
|
We formulate heap sizing as a control problem, apply and tune a standard controller algorithm, and evaluate its performance on a set of well-known benchmarks. We find our controller adapts the heap size more responsively than existing mechanisms.
This responsiveness allows tighter virtual machine memory footprints while preserving target application throughput, which is ideal for both embedded and utility computing domains.
In short, we argue that formal, systematic
approaches to memory management should be replacing ad-hoc heuristics
as the discipline matures.
Control-theoretic heap sizing is one such systematic approach.
Paper: ISMM'13 paper
|
14:00-14:30 |
David Matthews |
|
|
Prolingua Limited |
|
|
|
The Isabelle proof system is a major user of Poly/ML and the addition of
concurrency to Poly/ML has focussed attention on the storage management
system particularly with larger proofs. The talk will describe the new
parallelised garbage-collector of Poly/ML and the heap-sizing strategy
used as well as the extra "sharing" pass that can be activated when
memory is particularly short. The talk will conclude with some
performance results of Isabelle runs on multi-core processors.
|
14:30-15:00 |
Sven-Bodo Scholz |
|
|
Heriot-Watt Univ, Edinburgh |
|
|
Truly implicit memory management of large data structures benefits
from non-delayed garbage collection implemented through reference counting.
When pushing this approach towards nested forms of massively parallel executions
new challenges arise wrt the implementation of reference counting.
Naive implementations can easily introduce overheads that impact the overall
performance rather badly.
This talk proposes a staged, multi-modal technique for reference counting
which has been implemented in the context of nested functional and data-parallel
concurrency in SaC. We show some performance figures that demonstrate
very low overheads for fine-granular nested concurrency on a massively
parallel micro-threading architecture.
|
15:00-15:30 |
|
Coffee |
15:30-16:00 |
Richard Jones |
|
|
University of Kent |
|
|
|
Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.
In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.
We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.
|
16:00-16:30 |
Kevin Hammond |
|
|
Univ of St Andrews |
|
|
|
This talk describes the first successful attempt, of which we are
aware, to define an automatic, type-based static analysis of resource
bounds for lazy functional programs. Our analysis uses the automatic amortisation approach developed by Hofmann and Jost,
which was previously restricted to eager evaluation. In this paper,
we extend this work to a lazy setting by capturing the costs of unevaluated expressions in type annotations and by amortising the
payment of these costs using a notion of lazy potential. We present
our analysis as a proof system for predicting heap allocations of
a minimal functional language (including higher-order functions
and recursive data types) and define a formal cost model based on
Launchbury's natural semantics for lazy evaluation. We prove the
soundness of our analysis with respect to the cost model. Our approach is illustrated by a number of representative and non-trivial
examples that have been analysed using a prototype implementation of our analysis.
Paper: ICFP'12 paper
|
16:30-17:00 |
Paul Keir |
|
|
Codeplay Software Ltd, Edinburgh |
|
|
|
A single-source heterogeneous parallel programming model may be attractive
for a number of reasons, including: the provision of an intuitive
development
interface; ease of debugging; and the facilitation of a concise codebase.
Common examples are CUDA; OpenACC; and C++ AMP. At Codeplay Software we also
provide a range of single-source C++ solutions within our Offload
technology.
Such programming models ensure that large, existing codebases are rapidly
parallelised for heterogeneous architectures. A second edge to the sword
exists though, in the consequent implicit nature of data transfers
occurring
between distinct physical memories and corresponding address spaces.
Our Offload technology represents a novel, patented, set of heterogeneous
parallel programming language technologies, featuring the integration of
enhanced
pointer types which are implicitly, statically, and immutably associated
with a
single address space; and typically correspond to a unit of physical memory.
In this talk I will describe our ongoing research work within the EU
Framework
7 project LPGPU to visualise and react to expensive direct memory access
operations, by applying new tools which augment the Offload C++ compiler for
PlayStation 3.
|
17:00-17:15 |
Malak Aljabri |
|
|
Heriot-Watt Univ, Edinburgh |
|
|
|
The most widely available high performance platforms today are
multi-level clusters of multi-cores. GHC provides at least two parallel
Haskell implementations: GHC-SMP for shared memory, and GUM for distributed memory.
Both implementations use different but related runtime-environment (RTE)
mechanisms. Good performance results can be achieved on shared memory
architectures and on networks individually, but a combination of both, for networks
of multi-cores, is lacking.
We present the design of the new, multi-level, scalable parallel Haskell implementation,
GUMSMP to better exploit such platforms. It is designed to efficiently combine
distributed memory parallelism using a virtual shared heap over the cluster with
optimised, low-overhead shared memory parallelism on the multi-cores. Key design
objectives in realising this system are even but asymmetric load balance,
effective latency hiding, and mostly passive load distribution.
We give early results on a network of 8-core machines, using
a flat distributed memory GUM and a 2-level GUMSMP implementation.
|
17:15 |
|
Closing |
|