This section discusses the internals of the GUM runtime-system. This should give the user of our system an idea about possible performance costs in the system, without going into too much detail. The current version focuses on the memory management in GUM. Separate sections on communication and scheduling will follow. For a more detailed discussion see our GUM paper.
In this section we give an overview of the design of GUM with special emphasis on memory management. We motivate the design decisions made for GUM and discuss improvements to the basic runtime-system as well as the impact of certain design changes in the new runtime-system for GHC starting with version 4.00 [NewRTS]. For a more detailed discussion of GUM the interested reader is referred to [GUM-IFL].
The key concepts in the design of GUM are an underlying distributed-memory model, a virtual shared heap implemented on top of that model and automatic resource control for both work and data. Based on this design we can identify several components with interrelated tasks:
The ``evaluate-and-die'' thread management model we use in GUM has been developed for the GRIP parallel graph reduction machine [GRIP-eval]. This model uses sparks, pointers to graph structures, in order to represent potential parallelism. Sparks are generated via an explicit par annotation in the program and maintained by the runtime-system in a flat spark pool. A ``sparked'' expression may be executed by an independent thread. However, if another thread needs the value of the expression, and no other thread is evaluating it, this thread will perform the computation itself. We call this behaviour thread subsumption because the potentially parallel work is inlined by another thread, typically the parent thread. This idea of dynamically increasing the granularity of the threads by deferring the decision whether to generate a thread is similar to the independently developed lazy task creation model [LTK].
The synchronisation between threads is implicit via accessing shared ``closures'', nodes in the graph structure. We distinguish between normal-form closures, which represent data, and thunks, which represent work (unevaluated data). If a thread needs the result of a graph structure that is currently being evaluated or that resides on another PE it is blocked on this structure. As soon as the result becomes available the thread is awoken and can continue.
One basic characteristic of the memory management in GUM is its focus on heap allocated data structures and related to that the transfer of arbitrary graph structures, possibly including unevaluated expressions. Most closely related to our work are Kesseler's studies on graph copying costs in a distributed implementation of Concurrent Clean [Kesseler]. However, rather than adjusting the generic graph packing algorithm, which is similar to our normal-form-only packing, he focuses on using arrays as representation of data that has to be transported. Thus, some of the complexity is lifted from the system to the program level and possible advantages of a lazy data transfer of complex graphs may be lost. Another virtual shared heap implementation of a parallel functional language is pHluid [pHluid]. In this implementation the heap is partitioned into pages, being owned by a specific processor. A cache coherence protocol decides when an entire page has to be sent to another processor, resulting in very coarse-grained communication without the flexibility of a special treatment based on the type of the closure, which we have in GUM. Goldstein [Goldstein] presents a thorough analysis of the performance of different thread management and memory management models in the context of the dataflow-inspired TAM machine. His focus, however, is on the efficiency of a work stealing scheduler with the transfer of stack segments (for different stack representations) rather than arbitrary heap objects.
The older GRIP model used a two level memory hierarchy, where only closures in the global heap have global addresses (GAs), i.e.\ globally unique identifiers. Because no pointers from global to local heap were permitted, graphs in the local heap often had to be globalised, or ``flushed'' into the global heap, for example when updating a potentially shared graph. To improve parallel performance the amount of flushing had to be minimised, a problem that GRIP shares with the more recent design of a shared-memory runtime-system for GHC [SMP-RTS]. See [GRIP-Sparking] for a detailed discussion of GRIP's memory management model.
In order to provide a more generic memory management model the following design decisions have been made for GUM. A flat memory hierarchy is used, i.e. the access to any closure in one PE's heap is uniform. Every globally visible closure in the heap is identified via a global address (GA). The mapping of these GAs to local heap addresses and vice versa is done via a hash table, the global indirection table (GIT). Global indirections (FetchMes) always use the GA to identify the remote object (see Figure ). This design enables separate local garbage collection on each of the PEs, provided that the GIT is rebuilt after every garbage collection to map those GAs, which are still alive, to their new addresses in the heap. Rebuilding the GIT is therefore one major source of overhead of the memory management model. This design is based on the assumption, verified on GRIP, that only a small fraction of the heap is globally visible and that rebuilding the GIT is therefore cheaper than allocating and maintaining one additional word, the GA, for every closure in the system.
The mapping of global to local addresses is only needed for determining whether a copy of a newly imported graph structure already exists on that PE. In this case, the less evaluated version of the graph will become an indirection to the further evaluated version. This avoids duplicating data that might have been imported via different PEs (one exception of this rule is discussed in the following paragraph). The decision of ``globalising'' both normal forms and thunks, i.e. generating GAs for both, reduces the total memory consumption in the system. However, it increases packing costs and the size of the packet which is sent. In Section we will measure different globalisation schemes.
An important feature of GUM is the independent local garbage collection on each PE. In order to enable global garbage collection weighted reference counting on GAs is used [Bevan]. Whenever a new reference to a globally visible closure is created, it receives half the weight from the original reference. When during local garbage collection a reference to a remote closure is freed, the weight of that reference is added to the reference maintained in the GIT. Once a GA contains the maximal weight, there are no more remote references to this closure and it can be removed from the GIT.
When the weight of the GA becomes one, it can no longer be split. In this case a new GA has to be allocated, with both GAs referring to the same closure. The new GA will be used for further splitting of weights if necessary. However, since the GAs are different now, it is no longer possible to determine that they refer to the same local object and sharing is lost in this case.
While GRIP was designed for a low-latency machine, using low-overhead, synchronous, single-closure communication, GUM was designed for a generic distributed-memory machine. In order to reduce the total amount of communication and to permit latency hiding, i.e. overlapping communication with computation, GUM uses asynchronous, bulk communication, as shown in Figure . The packing scheme determines how much of the graph structure to put into a communication packet. By default GUM uses full subgraph packing, limited by the fixed packet size, i.e. when a graph is larger than the packet size only the initial portion of the graph is sent. In Section we will measure different packing schemes.
The packing algorithm traverses the graph in a breadth-first fashion. For each closure a new GA is allocated, if it does not already have one. When packing the closure itself, thunks and normal-forms have to be treated differently. In order to avoid the duplication of work, thunks are never copied but moved between processors. Therefore, the original thunk is replaced with a ``revertible black hole''. Should other threads demand the value before the transfer is complete they will be blocked. Normal-forms on the other hand can be freely copied.
On the receiver side, the graph is unpacked, checking for the presence of other copies of the imported closures in order to maintain sharing. For thunks new GAs are allocated, which determine the location of the closure. After unpacking the whole graph a mapping of old to new GAs is transmitted to the sender, which then replaces all revertible black holes with global indirections to the new GAs, and the old GAs become garbage. Figure summarises the transfer of graph structures in GUM.