Parallel Functional Programming – Lab A: “Parallel Programming in Haskell”DAT280 / DIT261, LP4 2017
Home | Schedule | Labs | Lectures | Exam | AboutFire | Forum | TimeEdit | Links
Parallel Functional Programming – Lab A: “Parallel Programming in Haskell”DAT280 / DIT261, LP4 2017
Home | Schedule | Labs | Lectures | Exam | AboutFire | Forum | TimeEdit | Links

(When you are done, please submit your solution using the Fire system)

In this lab assignment, you will get started with parallel programming in Haskell, including the use of Criterion for benchmarking and Threadscope for profiling.

Note: Please make sure you follow the submission guidelines when you write your code.


Assignment 1

This first assignment is about parallelising an “embarassingly parallel” function.

An embarassingly parallel function is one that has many separate and non-communicating tasks. In Haskell, it is often a map. The file given.hs contains the function jackknife that maps the mean function over a list of lists (created using resamples). The code is borrowed from this lecture (I think by Bryan O’Sullivan, author of Criterion, the benchmarking tool you will use).

Your job is to parallelise that map in a number of different ways, and to benchmark and report on your results. (If you get bored with parallelising map, you can always move on to resamples. I have not tried that!)

At a minimum, you should do the following

  1. Parallelise jackknife using par and pseq (for example by defining a parallel map function that uses par and pseq).

  2. Parallelise jackknife using rpar and rseq from the Eval monad by defining a parallel map function that uses rpar and rseq. Compare with the built in parMap.

  3. Parallelise jackknife using Strategies

  4. Parallelise jackknife using the Par Monad.

Remember that PCPH is a great source of information and examples. Note that there is sample code on hackage; play with it!


Assignment 2

Write a simple merge sort on lists and parallelise it in at least two different ways. Comment on the resulting performance. Include a method that gives you control over task granularity (for example a depth parameter). Benchmark your implementations systematically. Report on your results.

If you wish, you may also submit other attempts to speed up sorting in Haskell in order to get feedback from the TAs. (Don’t expect great speedups for sorting on lists; moderate speedups can be obtained with relatively little effort, however.)

The main thing is to get experience of using the various libraries and of figuring out why you get the speedups (or non-speedups) that you observe.


Getting more practice

If you have run out of assignments, consider tackling LabA from last year (which contains some trickier tasks). You may submit solutions to these too. This is not obligatory.

Various kinds of search would also be interesting to parallelise.

Advice on GHC flags, garbage collection, task size etc (Nick Frolov, former TA)

If you’re not getting speedups, or if the performance of a parallel program is worse than that of a sequential one, make sure that garbage collection is not the bottleneck in your program. If a significant part of your Threadscope plots is GC (orange color), try to run your program pretending you have unlimited memory, so no GC will happen. Set a ridiculously large nursery size with the -A flag (100 Mb will do), this will effectively turn the collector off, as the nursery will never be full, therefore no GC will be needed.

If your program actually requires more memory for intermediate values over its runtime, you might want to consider to recycle memory. A small nursery size will cause more frequent GC but also more eager promotion. The former brings an overhead when allocated data doesn’t tend to be short-living, and the latter in the opposite case (if short-living data ended up being promoted to an older generation, it will still have to be collected). A good nursery size should not usually also be larger than L2 cache to exploit locality, but a better one can only be found experimentally. The -H flag (setting the initial heap size) should not be neglected either.

While the heap will expand if its initial size was not enough, doing so is an extra work for the collector. Set it generously (perhaps, 1 Gb), there is no fault in doing it other than exhausting your RAM. Which, ideally, you shouldn’t do, as swapping is much more expensive than GC, just as serving cache faults from main memory is if the nursery exceeds the cache size.

Don’t forget to experiment with depth (or unit of work size, if you prefer that). If you’re getting many fizzled sparks (check on it with the -s flag or in Threadscope), it is a clear sign that your units of work are too small and not worth of spawning a thread. Remember that to run even a Haskell green thread means tens of instructions, this does not justify an addition of 32-bit integers or a similarly cheap operation. Another cause of fizzled sparks is uneven division of work into chunks, which can also lead to sparked computation results being needed sooner than the corresponding sparks get a chance to be converted. You can see spark size statistics in Threadscope (“Spark sizes” tab).

If you’re wondering why your sequential scan with an “expensive” operator runs much faster than a parallelized version, try to switch off optimization. Some “expensive” operators are not that expensive if GHC takes a good look at them, especially if it sees a chance for aggressive inlining (and there are many indeed, when no sparks are being created).

Submission

This lab has two deadlines:

Your submission needs to include the following information:

Before you submit your code, Clean It Up! Remember, submitting clean code is Really Important, and simply the polite thing to do. After you feel you are done, spend some time on cleaning your code; make it simpler, remove unneccessary things, etc. We will reject your solution if it is not clean. Clean code:

When you are done, please submit using the Fire system.

Good luck!