Supplementary Parallel Haskell

forward

these are part of my study notes for my project for real time neural networks in haskell the notes on this page are largely stolen from http://chimera.labs.oreilly.com/books/1230000000929/ch02.html#sec_par-eval-whnf

how lazy evaluation works

given the following expression

let x = 1 + 2 :: Int

x is bound to the expression but it is not evaluated until needed

the sprint command prints an unevaluate expression and doesnt forcd evaluation

so we can write

:sprint x

>  x = _

showing that it is unevaluated

we can extend that to more complex expressions

let x = 1 + x
let y = x + 1
:sprint x 
x = _
:sprint y 
y = _

we cant see the linked thunk structure but its there

we can force evaluation using the seq command, the seq function evaluates its first argument and returns it second , in this case ()

seq y () 
()

data structure thunks

given

let x = x + 1 :: Int
let z = (x,x)

:sprint z 
z = (_,_)

the compenents of z both poitn to the unevaluated thunk for x

let perform an operation on z

import Data.Tuple
let z = swap (x,x+1)

:sprint z
z = _
seq z ()
()
:sprint z
z = (_,_)

apply seq to z caused it to be evaluate but the components are still unevaluated the seq argument evaluates only as far as the first constructor and doesnt evaluate any deeper this is called weak head normal form normal form means fully evaluated for that we use deepseq

finally we evaluate x

seq x ()
()
:sprint z
z = (_, 3)

we see that we have evaluated the , expression , the and the x expression but not the x + 1 expression

map :: ( a -> b ) -> [a] -> [b]
map f [] = []
map f (x:xs) = let 
                    x' = f x
                    xs' = map f xs
               in 
                    x':xs'

so we see that map is evaluated lazily and builds a structure composed of the first constructor which then evaluated f x lazily, before proceeding to evaluate map f xs on demand

so, given:

let xs = map (+1) [1..10] :: [Int]

:sprint xs 
xs = _

seq xs ()
()
:sprint xs
xs = _ : _

now “map” is evaluated and all we know so far is that we have a list with one element

given the definition of length

length :: [a] -> Int
length [] = 0
length (_:xs) = 1 + length xs

if when then

length xs
10

we note that length ignores the head of the list (note the “(_:xs)” never actuall examines the head) and recurses on the tail so when length is applied to a list it will descend the structure of the list but never actually examine any elements, we can see that if we

:sprint xs
xs = [,_,_,_,_,_,_,_,_,_,_,_]

ghc noticed that the cells where evaluated so it used bracketed notaction to show it knows there are cells there but not the contents

calling sum forces the contents of the cells to be evaluated

sum xs
65
:sprint xs
xs =  [ 2,3,4,5,6,7,8,9,10,11]

parallel

to see the effect of rpara and rseq , suppose whe ahve a function f along with 2 arguments and appy it to x and y and we want to calculate f x and f y in parallel.

lets say that f x takes longer to evaluate than f y ,

runEval $ do

  a <- rpar ( f x )

  b <- rpar ( f y )

  return (a,b)

we see that a and b execute inparallel and return returns immediately

runEval $ do

   a <- rpar ( f x )

   b <- rseq ( f y )

   return (a,b)

here fx and f y are evaluated in parallel but now the final return doesn’t happen until fy is complete

because fseq blocks until finish

we can then wait for both to complete

runEval $ do

 a <-  rpar ( f x )

b <-  rseq ( f y )

rseq a

return (a,b)

which to use

  • rpar/rseq when you know which will defo finish first, kinda ugly you need knowledge of the system
    • rpar/rpar/rseq/rseq is the same but nice and pretty
  • rpar/rpar
    • if you dont depend on the results of the operation
    • and you need to genereate more parallelism
  • rseq/rseq
    • finished generating parallelism
    • or results are needed

compiling

compiling with ghc

ghc -O2 rpar.hs -threaded

in order to use parallelism we must use the -threaded option

ghc -O2 rpar.hs -threaded -eventlog

./rpar +RTS -N2 -l

compiling with -eventlog and then running with -l causes the program to emit a lof file called {programname}.log which we can pass to threadscope

running

./rpar 2 +RTS -N2

the +RTS -N2 flag tells ghc to use two cores to run the program (ensure you havea at least 2 cores)

interpretiing the out put

  • the first timestamp is printed when the rpar fragment returns
  • the second timestamp is printed when the last calculation finishes or when the program exits

getting stats

+RTS -s

instructs GHC to emit stats

look at total time which is split into

  • cpu time
    • time spent on a cpu core
  • wall clock time
    • time spent in total running the program
    • speedups are usually calculated as a percentage of wallclock times

using the force!

rpar only evaluates to WHNF

if we

rpar (map solve as)

the evaluation would stop at the first : constructor so the rpar would not cuase any work to take place in parallel

Instead we use force

as <- rpar (force (map solve as))

The force fucntion evaluates the entire structure of its argument#

it reduces the argument to normal form (read fully evaluated)

it is provided by

import Control.Deepseq

think

for each rpar

“How much of this structure do I want to evaluate”

  • force is default behaviour for the Par monad

Threadscope

  • x axis is time
  • horizontal bars
    • show how the program executed over time
  • bars
    • top most bar
      • activity profile
        • shows how many cores where executing haskell code
    • other bars
      • one bar per core
      • upper half – (green) where the core is executing haskell
      • lower – (orange) performing garbage collection

look out for uneven distributions of load, shows ineffcient scheduling

  • in pracitse chunks rearely contain equal owkr so there will be some inbalance
  • the paralleslism we can achieve is limited by the number of chunks , in our example even if the workloads are even we could never achieve more that 2 times speedup

fixed partitioning vs dynamic partitioning

  • ghc doesnt foree us to use a fixed number of rpar cals we can call it as many times as we like
  • a fixed division of work is static partiton

ghc already provides the mechanism for dynamic partition ing we just have to supply it with enouggh tasks by calling rpar

the argument to rpar os called a spark, the work runtime collects spa5rks in a pool and uses this as a soruce of work when there are spare processors available

dynamic partitioning

parMap :: (a -> b) -> [a ] -> Eval [b]

parMap f [] = return []

parMap f (a:as) = do

              b <- rpar (f a)

              bs <- parMap f as

              return (b:bs)

this is rather like the monadic version of map except that we have used rpar to lift the application of the function f to the element a into the Eval monad.

parmap runs down the whole list, eagerky creating sparks for the application of f to each elelment and finally returns the new list

When parmap returns it will have created one spark for each element of the list. Now, the evaluatin of all results can hapen in parallel

main :: IO ()

main = do

         [f] <- getArgs

         file <- readFile f

         let puzzles = lines file

             solutions = runEval (parMap solve puzzles)

         print (length (filter isJust solutions))

Running this version yields more speedup

sparks

  • overflowed

the spark pool has no morre sparks left in it

  • dud

expression already evaluated

  • GC’d

the expression was unused by the program so the runtime removed the spark

  • fizzled

the expression was unevaluated at the time it ws sparked but was later evaluated by the program

evaluate

the evaluate function is like a seq in the IO mondad it evaludes its arguments to weak head normal form and then returns it

evaluate (length puzzles)
evaluate :: a -> IO a

forcing the lines to be evaluate early reduces the parllelism slightly because the reading is no longer lazy and it now waits forit to be complete

amhdals law

a program that has a sequential portion can never achieve perfect speedup

we can calculate the maximum achievable speedup for a given number of cores using ahmdahls law

1 / ((1 - P)+P/N)

where

  • P is the portion of the runtime that can be paralleslided
  • N is the number of processors

Amhdahls law tells us that noonly does parallel speedup become harder toachieve the more processorw we add but in pracise most programs have a theroatical maximum parallelism

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s