Go to the first, previous, next, last section, table of contents.

Parallel Functional Programming in the Large: Strategies

[SEE THE STRATEGY PAPER: "STRATEGIES FOR WRITING PARALLEL NON-STRICT PROGRAMS", TRINDER P.W., LOIDL H.W., HAMMOND K., PEYTON JONES S.L. (IN PREPARATION). ]

This chapter deals with parallel, lazy functional programming in the large. It introduces the notion of strategies and shows how they make it easier to develop large parallel programs.

Motivation for Strategies

The following naive version parmap often does not give good parallelism:

parmap :: (a -> b) -> [a] -> [b]
parmap  f []     = []
parmap  f (x:xs) = fx `par` pmxs `par` (fx:pmxs)
                    where fx = f x
                          pmxs = parmap f xs

If in this version the spark for pmxs is discarded almost all parallelism in the program is lost. On the other hand, if all par's are executed more parallel threads than necessary are created (two for each list element).

An alternative version of parmap reduces the total number of generated threads by replacing the second par with a seq:

parmap :: (a -> b) -> [a] -> [b]
parmap  f []     = []
parmap  f (x:xs) = fx `par` pmxs `seq` (fx:pmxs)
                    where fx = f x
                          pmxs = parmap f xs

The problem of this version is that in a context that requires the result of parmap only in weak head normal form, the recursive call to parmap will be deferred until its value is needed. This drastically reduces the total amount of parallelism in the program. Especially, if such a function is used as the producer in a program with producer-consumer parallelism, this gives very poor parallelism (see section Example Program: Sum of Squares).

To improve this behaviour one can write a forcing function, which guarantees that all of the result is demanded immediately:

forcelist []     = ()
forcelist (x:xs) = x `seq` (forcelist xs)

Using this function in parmap gives the following function:

parmap :: (a -> b) -> [a] -> [b]
parmap  f []     = []
parmap  f (x:xs) = fx `par`  (forcelist pmxs) `seq` (fx:pmxs)
                   where fx = f x
                         pmxs = parmap f xs

However, following this approach yields a big set of forcing functions and a mixture of defining the result and driving the computation. We want to avoid both.

The Main Idea

[FOR NOW MAINLY THE ABSTRACT OF THE STRATEGY PAPER]

Separate the components of a parallel, lazy functional program:

Parallelism
Evaluation degree
Definition of the result

In most current parallel non-strict functional languages, a function must both describe the value to be computed, and the dynamic behaviour. The dynamic behaviour of a function has two aspects: parallelism, i.e. what values could be computed in parallel, and evaluation-degree, i.e. how much of each value should be constructed. Our experience is that specifying the dynamic behaviour of a function obscures its semantics. Strategies have been developed to address this problem; a strategy being an explicit description of a function's potential dynamic behaviour. The philosophy is that it should be possible to understand the semantics of a function without considering it's dynamic behaviour.

In essence a strategy is a runtime function that traverses a data structure specifying how much of it should be evaluated, and possibly sparking threads to perform the construction in parallel. Because strategies are functions they can be defined on any type and can be combined to specify sophisticated dynamic behaviour. For example a strategy can control thread granularity or specify evaluation over an infinite structure. A disadvantage is that a strategy requires an additional runtime pass over parts of the data structure. Example programs are given that use strategies on divide-and-conquer, pipeline and data-parallel applications.

Example Program: Quicksort

This section compares two version of parallel quicksort: one using a forcing function and another one using strategies.

Parallel Quicksort using Forcing Functions

The following naive attempt to introduce parallelism in quicksort fails because the threads generating loSort and hiSort only create a single cons cell.

quicksortN []     = []
quicksortN [x]    = [x]
quicksortN (x:xs) = 
 losort `par`
 hisort `par`
 result        
  where    
   losort = quicksortN [y|y <- xs, y < x] 
   hisort = quicksortN [y|y <- xs, y >= x]
   result = losort ++ (x:hisort)

The current practice of parallel Haskell programmers is to introduce a forcing function that forces the evaluation of the required elements of a data structure. For example

forceList :: [a] -> ()
forceList [] = ()
forceList (x:xs) = x `seq` forceList xs

Quicksort can be rewritten to have the desired behaviour using forceList as follows.

quicksortF []      = []
quicksortF [x]     = [x]
quicksortF (x:xs)  = 
 (forceList losort) `par`
 (forceList hisort) `par`
 losort ++ (x:hisort)
  where
   losort = quicksortF [y|y <- xs, y < x] 
   hisort = quicksortF [y|y <- xs, y >= x]

To obtain the required dynamic behaviour for the parMap example we might use forceList within the definition of f:

f x = forceList result `seq` result
      where
        result = [fib x]

When programs are written in this style, a number of forcing functions are required: at least one for each type, e.g. forcePair. To obtain good dynamic behaviour, par, seq and forcing functions are inserted throughout the program. In consequence the dynamic behaviour and static semantics of the computation are intertwined. This intertwining obscures the semantics of the program. Small-scale programs remain manageable, but in larger programs, particularly those with complex data structures and parallel behaviour, discerning the meaning of a program becomes very hard.

Parallel Quicksort using Strategies

In the strategy version we can specify the evaluation degree by applying one of the predefined strategies to the components of the result. These strategies are then combined by using either par or seq to specify the parallelism.

In quicksort, each parallel thread should construct all of its result list, and rnf expresses this neatly. The interesting equation becomes

quicksortS (x:xs)  = losort ++ (x:hisort) `using` strategy 
		    where
                      losort = quicksortS [y|y <- xs, y < x] 
                      hisort = quicksortS [y|y <- xs, y >= x]
                      strategy result = 
                        rnf losort `par`
                        rnf hisort `par`
                        rnf result `par`
                        ()

Example Program: Sum of Squares

. . . . .

Although the sum-of-squares program is a very simple producer-consumer program, it already shows some basic problems in this model. Unfortunately, the simple version of the program, which was presented in section Semi-explicit Parallelism, produces hardly any parallelism at all. The reason for this behaviour is that because of the lazy evaluation approach the producer will only create the top cons cell of the intermediate list. When the consumer then demands the rest of the list it computes squares as well as the result sum. In the example below version 1 (with res_naive) shows this behaviour.

To improve the parallelism one can use a forcing function on the intermediate list. The parallel thread is then the application of this forcing function on the squares list. As expected, this creates two threads that are computing most of the time and occasionally have to exchange data. Version 2 (with res_force) shows this behaviour.

A better way to achieve the same operational behaviour is to define a strategy, how to compute the result. The strategy describes the parallelism and the evaluation degree. In doing so, it uses the reduce to normal form (rnf) strategy, which is defined in the class NFData. Compared to version 2 the strategy version 3 (with res_strategy) achieves a clean separation of defining the result and specifying operational details like parallelism and evaluation degree.

module Main where

import StrategiesVII

main = getArgs abort ( \ a -> 
       let 
         args :: [Int]
         args = map (\ a1 -> fst ((readDec a1) !! 0)) a       
         version =  args !! 0      
         n = args !! 1
         -- l = [1..]
         squares :: [Int] 
         squares = [ i^2 | i <- [1..n] ]
         -- Naive version sparks a producer for list squares but doesn't force it
         res_naive = squares `par` sum squares
         -- This version sparks a forcing function on the list (producer)
         res_force = (foldl1 seq squares) `par` sum squares
         -- The strategy version
         res_strat = sum squares `using` (rnf squares `par` rnf)

         res = case version of
                1 -> res_naive		     
                2 -> res_force		     
                3 -> res_strat		     
                _ -> res_naive
         str = case version of
                1 -> "Naive version"
                2 -> "Forcing version"		     
                3 -> "Strategy version"
                _ -> "Naive version"
       in 
       print ("Sum of squares (" ++ str ++ ") of length " ++ (show n) ++ 
              " = " ++ (show res) ++ "\n")  )

When running the naive version of this program the spark for the consumer process is pruned and all the computation of the consumer is subsumed by the producer. This gives a totally sequential program. The relevant parts of the GranSim profile are:

Granularity Simulation for sq 1 99 +RTS -bP -bp: -bs 

++++++++++++++++++++

PE  0 [0]: START    	0	0x100620	[SN 0]
PE  0 [67445]: SPARK    	20004	0x347dbc	[sparks 0]
PE  0 [68241]: PRUNED   	20004	0x347dbc	[sparks 1]
PE  0 [327088]: END 0, SN 0, ST 0, EXP F, BB 847, HA 5426, RT 327088, BT 0 (0), FT 0 (0), LS 0, GS 1, MY T

The other two versions of the program create one producer and one consumer thread as expected.

The per thread activity profile shows the behaviour of the strategy version of the program. Thread 1 is the main thread which computes sum squares and therefore is the consumer process. It creates thread 2 as the producer process, which evaluates squares. The producer is evaluating the list to normal form and therefore is one continuous thread. The consumer has to wait for the results from the producer from time to time and is blocked during these periods.

@centerline{@psfig{angle=90,file=sq3-ap.ps,width=@hsize}}

The per thread activity profile shows the behaviour of the strategy version of the program. Thread 1 is the main thread which computes @t{sum squares} and therefore is the consumer process. It creates thread 2 as the producer process, which evaluates @t{squares}. The producer is evaluating the list to normal form and therefore is one continuous thread. The consumer has to wait for the results from the producer from time to time and is blocked during these periods.

Per-thread activity profile for sum of squares

Using a Class of Strategies

[FOR NOW THIS SECTION IS JUST A SHORT DISCUSSION OF THE STRATEGIES MODULE]

This section discusses the contents of the strategies module. It shows how strategies are implemented on top of the basic par and seq annotations. Haskell's overloading mechanism is used to define a strategy for reducing to normal form. The current implementation is based on Haskell 1.2 but in the near future we want to use constructor classes as provided in Haskell 1.3.

Type Definition of Strategies

A strategy does not return a meaningful result it only specifies parallelism and evaluation degree. Therefore, we use the following type for strategies

type Strategy a = a -> ()

We can now define a strategy application, which adds information about the operational behaviour of the execution to an expression

using :: a -> Strategy a -> a
using x s = s x `seq` x

Note that x `using` s is a projection on x, i.e. both

a retraction: x `using` s [ x
- and idempotent: (x `using` s) `using` s = x `using` s

Primitive Strategies

The primitive strategies are basically just syntactic sugar to fit par and seq into the strategy scheme.

sPar is a strategy corresponding to par, i.e. x `par` e <=> e `using` sPar x

sPar :: a -> Strategy b
sPar x y = x `par` ()

sSeq is a strategy corresponding to seq, i.e. x `seq` e <=> e `using` sSeq x

sSeq :: a -> Strategy b
sSeq x y = x `seq` ()

Basic Strategies

Three basic strategies describe the evaluation degree of a data structure:

r0 is a strategy which performs no evaluation at all
```
r0 :: Strategy a 
r0 x = ()
```
rwhnf reduces a data structure to weak head normal form
```
rwhnf :: Strategy a 
rwhnf x = x `seq` ()  
```

rnf reduces a data structure to normal form. This strategy can't be defined independent of the type of the data structure. Therefore, we define a class NFData and overload the rnf strategy

class NFData a where
  -- rnf reduces it's argument to (head) normal form
  rnf :: Strategy a
  -- Default method. Useful for base types. A specific method is necessary for
  -- constructed types 
  rnf = rwhnf

For primitive types like Ints and Chars the default methods can be used

instance NFData Int 
instance NFData Char
...

Strategy Combinators

When defining a strategy on a compound data type we can use these three basic strategies and compose them with the strategy combinators below. These combinators describe the parallelism in the evaluation.

-- Pairs
instance (NFData a, NFData b) => NFData (a,b) where
  rnf (x,y) = rnf x `seq` rnf y

seqPair :: Strategy a -> Strategy b -> Strategy (a,b)
seqPair strata stratb (x,y) = strata x `seq` stratb y 

parPair :: Strategy a -> Strategy b -> Strategy (a,b)
parPair strata stratb (x,y) = strata x `par` stratb y

-- Lists
instance NFData a => NFData [a] where
  rnf [] = ()
  rnf (x:xs) = rnf x `seq` rnf xs

-- Applies a strategy to every element of a list in parallel
parList :: Strategy a -> Strategy [a]
parList strat []     = ()
parList strat (x:xs) = strat x `par` (parList strat xs)

-- Applies a strategy to the first  n elements of a list  in parallel 
parListN :: (Integral b) => b -> Strategy a -> Strategy [a]
parListN n strat []     = ()
parListN 0 strat xs     = ()
parListN n strat (x:xs) = strat x `par` (parListN (n-1) strat xs)

-- Sequentially applies a strategy to each element of a list 
seqList :: Strategy a -> Strategy [a]
seqList strat []     = ()
seqList strat (x:xs) = strat x `seq` (seqList strat xs)

-- Sequentially applies a strategy to the first  n elements of a list 
seqListN :: (Integral a) => a -> Strategy b -> Strategy [b]
seqListN n strat []     = ()
seqListN 0 strat xs     = ()
seqListN n strat (x:xs) = strat x `seq` (seqListN (n-1) strat xs)

Now we can use these predefined strategies to define a parallel map function

-- parMap applies a function to each element of the argument list in
-- parallel.  The result of the function is evaluated using `strat' 
parMap :: Strategy b -> (a -> b) -> [a] -> [b]
parMap strat f xs 	= map f xs `using` parList strat

For bigger programs the programmer only has to define strategies on the most important data structures in order to introduce parallelism in to his program. With this approach the parallel program will not be cluttered with par annotations and it's easier to determine semantics and dynamic behaviour of the individual functions.

Experiences with Strategies

[STRATEGIES ARE COOL]

Go to the first, previous, next, last section, table of contents.