## forward

these are part of my study notes for my project for real time neural networks in haskell the notes on this page are largely stolen from http://chimera.labs.oreilly.com/books/1230000000929/ch02.html#sec_par-eval-whnf

## how lazy evaluation works

given the following expression

let x = 1 + 2 :: Int

x is bound to the expression but it is not evaluated until needed

the sprint command prints an unevaluate expression and doesnt forcd evaluation

so we can write

:sprint x > x = _

showing that it is unevaluated

we can extend that to more complex expressions

let x = 1 + x let y = x + 1 :sprint x x = _ :sprint y y = _

we cant see the linked thunk structure but its there

we can force evaluation using the seq command, the seq function evaluates its first argument and returns it second , in this case ()

seq y () ()

## data structure thunks

given

let x = x + 1 :: Int let z = (x,x)

:sprint z z = (_,_)

the compenents of z both poitn to the unevaluated thunk for x

let perform an operation on z

import Data.Tuple let z = swap (x,x+1)

:sprint z z = _ seq z () () :sprint z z = (_,_)

apply seq to z caused it to be evaluate but the components are still unevaluated the seq argument evaluates only as far as the first constructor and doesnt evaluate any deeper this is called weak head normal form normal form means fully evaluated for that we use deepseq

finally we evaluate x

seq x () () :sprint z z = (_, 3)

we see that we have evaluated the , expression , the and the x expression but not the x + 1 expression

map :: ( a -> b ) -> [a] -> [b] map f [] = [] map f (x:xs) = let x' = f x xs' = map f xs in x':xs'

so we see that map is evaluated lazily and builds a structure composed of the first constructor which then evaluated f x lazily, before proceeding to evaluate map f xs on demand

so, given:

let xs = map (+1) [1..10] :: [Int]

:sprint xs xs = _ seq xs () () :sprint xs xs = _ : _

now “map” is evaluated and all we know so far is that we have a list with one element

given the definition of length

length :: [a] -> Int length [] = 0 length (_:xs) = 1 + length xs

if when then

length xs 10

we note that length ignores the head of the list (note the “(_:xs)” never actuall examines the head) and recurses on the tail so when length is applied to a list it will descend the structure of the list but never actually examine any elements, we can see that if we

:sprint xs xs = [,_,_,_,_,_,_,_,_,_,_,_]

ghc noticed that the cells where evaluated so it used bracketed notaction to show it knows there are cells there but not the contents

calling sum forces the contents of the cells to be evaluated

sum xs 65 :sprint xs xs = [ 2,3,4,5,6,7,8,9,10,11]

## parallel

to see the effect of rpara and rseq , suppose whe ahve a function f along with 2 arguments and appy it to x and y and we want to calculate f x and f y in parallel.

lets say that f x takes longer to evaluate than f y ,

runEval $ do a <- rpar ( f x ) b <- rpar ( f y ) return (a,b)

we see that a and b execute inparallel and return returns immediately

runEval $ do a <- rpar ( f x ) b <- rseq ( f y ) return (a,b)

here fx and f y are evaluated in parallel but now the final return doesn’t happen until fy is complete

because fseq blocks until finish

we can then wait for both to complete

runEval $ do a <- rpar ( f x ) b <- rseq ( f y ) rseq a return (a,b)

### which to use

- rpar/rseq when you know which will defo finish first, kinda ugly you need knowledge of the system
- rpar/rpar/rseq/rseq is the same but nice and pretty

- rpar/rpar
- if you dont depend on the results of the operation
- and you need to genereate more parallelism

- rseq/rseq
- finished generating parallelism
- or results are needed

## compiling

### compiling with ghc

ghc -O2 rpar.hs -threaded

in order to use parallelism we must use the -threaded option

ghc -O2 rpar.hs -threaded -eventlog ./rpar +RTS -N2 -l

compiling with -eventlog and then running with -l causes the program to emit a lof file called {programname}.log which we can pass to **threadscope**

### running

./rpar 2 +RTS -N2

the +RTS -N2 flag tells ghc to use two cores to run the program (ensure you havea at least 2 cores)

### interpretiing the out put

- the first timestamp is printed when the rpar fragment returns
- the second timestamp is printed when the last calculation finishes or when the program exits

### getting stats

+RTS -s

instructs GHC to emit stats

look at total time which is split into

- cpu time
- time spent on a cpu core

- wall clock time
- time spent in total running the program
- speedups are usually calculated as a percentage of wallclock times

### using the force!

rpar only evaluates to WHNF

if we

```
rpar (map solve as)
```

the evaluation would stop at the first : constructor so the rpar would not cuase any work to take place in parallel

Instead we use force

```
as <- rpar (force (map solve as))
```

The force fucntion evaluates the entire structure of its argument#

it reduces the argument to normal form (read fully evaluated)

it is provided by

import Control.Deepseq

#### think

for each rpar

“How much of this structure do I want to evaluate”

- force is default behaviour for the
**Par**monad

### Threadscope

- x axis is time
- horizontal bars
- show how the program executed over time

- bars
- top most bar
- activity profile
- shows how many cores where executing haskell code

- activity profile
- other bars
- one bar per core
- upper half – (green) where the core is executing haskell
- lower – (orange) performing garbage collection

- top most bar

look out for uneven distributions of load, shows ineffcient scheduling

- in pracitse chunks rearely contain equal owkr so there will be some inbalance

- the paralleslism we can achieve is limited by the number of chunks , in our example even if the workloads are even we could never achieve more that 2 times speedup

### fixed partitioning vs dynamic partitioning

- ghc doesnt foree us to use a fixed number of rpar cals we can call it as many times as we like
- a fixed division of work is static partiton

ghc already provides the mechanism for dynamic partition ing we just have to supply it with enouggh tasks by calling rpar

the argument to rpar os called a spark, the work runtime collects spa5rks in a pool and uses this as a soruce of work when there are spare processors available

### dynamic partitioning

parMap :: (a -> b) -> [a ] -> Eval [b] parMap f [] = return [] parMap f (a:as) = do b <- rpar (f a) bs <- parMap f as return (b:bs)

this is rather like the monadic version of map except that we have used rpar to lift the application of the function f to the element a into the Eval monad.

parmap runs down the whole list, eagerky creating sparks for the application of f to each elelment and finally returns the new list

When parmap returns it will have created one spark for each element of the list. Now, the evaluatin of all results can hapen in parallel

main :: IO () main = do [f] <- getArgs file <- readFile f let puzzles = lines file solutions = runEval (parMap solve puzzles) print (length (filter isJust solutions))

Running this version yields more speedup

### sparks

- overflowed

the spark pool has no morre sparks left in it

- dud

expression already evaluated

- GC’d

the expression was unused by the program so the runtime removed the spark

- fizzled

the expression was unevaluated at the time it ws sparked but was later evaluated by the program

### evaluate

the evaluate function is like a seq in the IO mondad it evaludes its arguments to weak head normal form and then returns it

evaluate (length puzzles)

evaluate :: a -> IO a

forcing the lines to be evaluate early reduces the parllelism slightly because the reading is no longer lazy and it now waits forit to be complete

### amhdals law

a program that has a sequential portion can never achieve perfect speedup

we can calculate the maximum achievable speedup for a given number of cores using ahmdahls law

1 / ((1 - P)+P/N)

where

- P is the portion of the runtime that can be paralleslided
- N is the number of processors

Amhdahls law tells us that noonly does parallel speedup become harder toachieve the more processorw we add but in pracise most programs have a theroatical maximum parallelism