Parallel Functional Programming

General Information

There are 2 lab assignments, each with two parts, so we end up with labs A to D. You should work in pairs. If you have a good reason for doing the assignments by yourself, please contact Nick, Mary or John. You need to pass all lab assignments in order to pass the course.

The assignments should be handed in using the Fire system.

The final deadline for all labs is May 28 at midnight. That means that you should aim to have all labs finished (and preferably also approved!) by then. Each lab should, in addition, be submitted at least once before the initial deadline for that lab.

Rules - IMPORTANT

Please read these early and carefully!

Deadlines are hard. If you for some reason cannot make the deadline, contact us before the deadline, and tell us what your reason is, together with a realistic proposal of a new personal deadline for you. You may then get an extension of the deadline.
Your last attempt has to be submitted before the final deadline. If you fail to do this, your submission will be rejected.
Cheating is taken very seriously. Before you start working on the assignments, please read the note on cheating.

Submission

Clean Code

Before you submit your code, Clean It Up! Submitting clean code is Really Important, and simply the polite thing to do. After you feel you are done, spend some time on cleaning your code; make it simpler, remove unneccessary things, etc. We will reject your solution if it is not clean. Clean code:

Does not have long lines (< 80 characters)
Has a consistent layout
Has type signatures for all top-level functions
Has good comments
Has no junk (junk is unused code, commented code, unnecessary comments)
Has no overly complicated function definitions
Does not contain any repetitive code (copy-and-paste programming)

What to include

Your submission needs to include the following information:

Your Haskell file (.hs or .lhs), containing your solution. A single Haskell file is preferred. Call it LabYxx where xx is your group number and Y is the letter name of that lab sub-part. Do not submit .o, .hi or .exe files!
report.txt or report.pdf, a file containing brief documentation of what you have done. Answer the questions asked in the assignment.

Before you submit, please read the note on cheating.

You are supposed to submit your solution using the Fire system.

Note: You do NOT have to use zip/gzip/tar/bzip on your files! Just upload each file that is part of your solution. Don't forget to submit when you have uploaded all the files.

Benchmarking Requirements (Haskell part)

Include Threadscope plots in the report and draw conclusions from them

Draw conclusions from runtime statistics (+RTS -s)

Play with different granularities (for example by introducing a threshold or depth parameter). To the Fire System

Tools

The stuDAT (Linux) computers have a recent Haskell Platform installed. We will also be using Amazon's Cloud (EC2) to provide you with access to parallel machines. More on that shortly.

The GHC user guide contains a chapter about Using GHC (command-line arguments and the like).

Threadscope is also available on the (Linux) stuDAT computers. (Type threadscope at the prompt.)

To get the Haskell Platform for your own laptop, go to the Haskell Platform page on haskell.org (which is a great source of Haskell info).

The Threadscope page includes information about how to use the tool, as well as how to install it.

Installing on Linux laptops seems straightforward. There are binary releases for Windows and Mac, thank goodness.

You should plan to get used to using Threadscope during the first week of the course.

Advice on GHC flags, garbage collection, task size etc. from Nick Frolov (TA)

If you're not getting speedups, or if the performance of a parallel program is worse than that of a sequential one, make sure that garbage collection is not the bottleneck in your program. If a significant part of your Threadscope plots is GC (orange color), try to run your program pretending you have unlimited memory, so no GC will happen. Set a ridiculously large nursery size with the -A flag (100 Mb will do), this will effectively turn the collector off, as the nursery will be never full, therefore no GC will be needed.

If your program actually requires more memory for intermediate values over its runtime, you might want to consider to recycle memory. A small nursery size will cause more frequent GC but also more eager promotion. The former brings an overhead when allocated data doesn't tend to be short-living, and the latter in the opposite case (if short-living data ended up being promoted to an older generation, it will still have to be collected). A good nursery size should not usually also be larger than L2 cache to exploit locality, but a better one can only be found experimentally. The -H flag (setting the initial heap size) should not be neglected either.

While the heap will expand if its initial size was not enough, doing so is an extra work for the collector. Set it generously (perhaps, 1 Gb), there is no fault in doing it other than exhausting your RAM. Which, ideally, you shouldn't do, as swapping is much more expensive than GC, just as serving cache faults from main memory is if the nursery exceeds the cache size.

* * *

Don't forget to experiment with depth (or unit of work size, if you prefer that). If you're getting many fizzled sparks (check on it with the -s flag or in Threadscope), it is a clear sign that your units of work are too small and not worth of spawning a thread. Remember that to run even a Haskell green thread means tens of instructions, this does not justify an addition of 32-bit integers or a similarly cheap operation. Another cause of fizzled sparks is uneven division of work into chunks, which can also lead to sparked computation results being needed sooner than the corresponding sparks get a chance to be converted. You can see spark size statistics in Threadscope ("Spark sizes" tab).

* * *

If you're wondering why your sequential scan with an "expensive" operator runs much faster than a parallelized version, try to switch off optimization. Some "expensive" operators are not that expensive if GHC takes a good look at them, especially if it sees a chance for aggressive inlining (and there are many indeed, when no sparks are being created).

Exercise 1 - Getting Started

Study the use of Threadscope. Write a simple merge sort program in Haskell and parallelise it

Details.

Lab A

Parallelising scan and a Fast Fourier Transform algorithm in Haskell. The lab description also contains pointers to interesting papers and slides.

Details

Lab B

Parallelising a Sudoku solver in Erlang. The deadline is Friday April 25th at midnight.

Details, problems.txt, sudoku.erl

Lab C

NESL-style programming and tutorial writing.

Details

Lab D

Distributed Erlang and Map-Reduce.

Details

Exercise 2 : Using Accelerate (not obligatory, but advised)

It would be a good idea to get some experience of using Accelerate for GPU-acceleration. Chapter 6 of Simon Marlow's book is a good introduction to using Accelerate. My approach was to read that and to play with the array manipulation functions in the API by messing about with BMP pictures, which are easy to import, transform and write out to file. It's quite fun to program up your own pictures too. (Here is one of my productions (z1.bmp) so that you have one to start with). You might get some inspiration from last year's tutorial winner (which is about Repa, mind you). I also found this Haskell library for image manipulation quite inspiring. It is interesting to experience the difference between Repa programming (where you are in control of the array representation) and Accelerate programming (where you are not). Note that you can get all the examples from Marlow's book in the package parconc-examples on Hackage.

Study the Accelerate examples at https://github.com/AccelerateHS/accelerate-examples, including how they are benchmarked. For a non-graphics exercise, see page 14 and onwards in this set of exercises by Simon Marlow. (You will also be able to find the solution online if you look hard enough.)

I followed Simon Marlow's advice to install accelerate-cuda with -fdebug, so that -ddump-cc can be used when running examples, to see what kernels are generated. This didn't work for me at the same time as using Criterion, however. (I am running on MacOS and the installation worked without any problems. Unix should also be easy. Windows is harder, though it is rumoured that Geoff Mainland has made this work.) If you would like to use Amazon's EC2 GPU instances (in Linux), send your SSH keys to Nick (frolov at chalmers.se) and he will set you up. These instances no longer commit to providing a Tesla, but they should still be an easy way to get access to a decent GPU.

For the guide on how to create SSH keys, see the GitHub tips section on generating SSH keys.

Assignments

General Information

Rules - IMPORTANT

Submission

Clean Code

What to include

Benchmarking Requirements (Haskell part)

Tools

Advice on GHC flags, garbage collection, task size etc. from Nick Frolov (TA)

Exercise 1 - Getting Started

Lab A

Lab B

Lab C

Lab D

Exercise 2 : Using Accelerate (not obligatory, but advised)