This is the lecture that sets the scene for the course. It introduces what we mean by parallelism, and distinguishes that from concurrency. It explains why functional languages are well suited for parallel programming, and this despite the fact that they really failed to deliver on the promise of solving parallel programming in the past. So why are they interesting now?
You should submit your code using the Fire system. This is for fun, so you may form any group you like. We suggest that a single person submits the code (with comments about who has worked on it).
Last year's competition was to make a parallel sort implementation in Haskell. The submitted entries and John's slides giving the results are available in this zipped file.
On this page of Simon Marlow's papers, you can find notes from his course on Parallel and Concurrent Programming in Haskell. The notes give a good explanation of why the topics of this course are interesting. They also make the same distinction between concurrency and parallelism as that made in this course. (We consider only the parallelism part.) Later in the course, we will discuss Simon's work on the Par Monad.
Haskell on a Shared-Memory Multiprocessr, Harris, Marlow and Peyton Jones, Haskell'05
Feedback Directed Implicit Parallelism, Harris and Singh, ICFP'07
Runtime Support for Multicore Haskell, Marlow, Peyton Jones and Singh, IFFP'09
This lecture considers par and pseq more critically, and concludes that it might be a good idea to separate the control of behaviour relating to parallelisation from the description of the algorithm itself. The idea of Strategies is described in a well-known paper called Algorithms + Strategies = Parallelism by Trinder, Hammond, Loidl and Peyton Jones. More recently, Marlow and some of the original authors have updated the idea, in Seq no more: Better Strategies for Parallel Haskell. We expect you to read both of these papers. The lecture is based on the newer paper. (Note that in week 2 of the course Kevin Hammond will come and give a lecture.)
See above for papers
The documentation of the Strategies Library is very helpful.
One thing we hope to do at the end of the course is to compile a list of suggestions for useful improvements to Threadscope. So keep a note of your ideas along these lines.
You would be well advised to study the code and to try some of the exercises.
This lecture is about a new programming model for deterministic parallelism, introduced by Simon Marlow. It introduces the Par Monad, a monad for deterministic parallelism, and shows how I-structures are used to exchange information between parallel tasks (or "blobs"), see Marlow's Haskell'11 paper with Ryan Newton and Simon PJ. Take a look at the I-Structures paper referred to in the lecture.
These lectures will introduce high-level structured parallel programming in Parallel Haskell using high-level patterns of parallelism. The parallel patterns that are introduced can be implemented using standard par/pseq constructs in Parallel Haskell building on algorithmic skeletons and evaluation strategy approaches. The lectures will introduce a range of data parallel and task parallel patterns, including bulk synchronous parallelism, map-reduce and parallel folds, and show how these can be implemented in Parallel Haskell.
The paper about solving problems with parallel quicksort (and about execution replay) is available here. A slightly revised version will appear in the proceedings of TFP 2012 (copyright Springer).
This lecture covers imperative and functional approaches to GPU programming. There are many references at the end, and the assignments page also points to references (for Lab B). The lecture will cover Obsidian, which is the topic of Joel's PhD thesis work. Other approaches to GPU programming in Haskell include Accelerate and Nikola, and the following papers are very interesting:
Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L.
McDonell, and Vinod Grover.
Accelerating Haskell array codes with multicore GPUs.
In Proceedings of the sixth workshop on Declarative aspects of
multicore programming, DAMP ’11, ACM, 2009.
[pdf]
Trevor L. McDonell, Manuel M. T. Chakravarty, Gabriele Keller, and
Ben Lippmeier.
Optimising Purely Functional GPU Programs.
Submitted to ICFP 2013.
[pdf]
Geoffrey Mainland and Greg Morrisett. Nikola: Embedding Compiled GPU Functions in Haskell.
Proceedings of the 2010 ACM SIGPLAN Symposium on Haskell (Haskell '10), 2010.
[pdf]
This lecture is about how Obsidian is implemented as a compiled embedded language in Haskell.
Joel would like to thank the students for the constructive feedback that they provided after his second lecture.
Jost discusses skeletons as a means to structure parallel computations -- viewing skeletons as higher order functions. He distinguishes three types of skeletons: small scale skeletons (like parMap), process communication topology skeletons, and proper algorithmic skeletons (like divide and conquer). He introduces the Eden dialect as a way to both implement and use skeletons in Haskell.
Recommended reading (Tutorial)
Eden Libraries (Modules and Skeleton) on Hackage:
http://hackage.haskell.org/package/edenmodules/
http://hackage.haskell.org/package/edenskel/
#>cabal install edenskel
This will build a *simulation* of Eden using Concurrent Haskell, you can use it with -threaded and get (some) speedup.
For a real distributed-heap execution use the modified GHC available at
http://www.mathematik.uni-marburg.de/~eden/?content=down_eden&navi=down
(or http://james.mathematik.uni-marburg.de:8080/ for the latest version)
Three different variants exist:
-parpvm (Linux): cluster execution using PVM
-parmpi (Linux): cluster execution using MPI
-parcp (Linux/Windows): multicore execution using shared memory regions
This lecture is all about Guy Blelloch's seminal work on the NESL programming language and on parallel functional algorithms and associated cost models. The best introduction is to watch the video of his marvellous invited talk at ICFP 2010, which John and Mary had the pleasure to attend. There are pages about NESL and about his publications in general. For the notions of work and depth, see this part of the 1996 CACM paper, and also this page, which considers work and depth for three algorithms.
This lecture gave a brief intro to Data Parallel Haskell by showing the Barnes-Hut algorithm in DPH. The slides contain a link to the original Barnes-Hut paper from Nature. It is great!
Next, the lecture covered Data parallel programming using the Repa library, which gives flat data parallelism. A main source is the Repa paper from ICFP 2010. And then there are two more Repa papers, one from Haskell'11 and one (on Repa 3) from Haskell'12.
This lecture introduced Erlang for Haskell programmers, taking parallelising quicksort as an example, both within one Erlang VM and distributed across a network. The latest version of the Erlang system can be downloaded from here. There is a Windows installer. Many linux versions have an Erlang packagage available, but not necessarily a package suitable for development of Erlang code, and not necessarily the latest version. On Ubuntu, try
sudo apt-get install erlang-dev
If that doesn't work or you can't find an appropriate package, build the VM from source.
In 2012, his lecture focussed on the fault tolerance constructs in Erlang--links and system processes--and the motivation for the "Let It Crash" philosophy. It introduced supervision trees and the Open Telecoms Platform, and developed a simple generic server.
Lennart told us about how functional programming is used in the investment banking part of Standard Chartered. He explained how many of the pricing and risk analysis problems that demand heavy computation are embarassingly parallel, so that a form of pmap is just about the only way that is used to express parallelism. A strategy parameter determines whether the resulting computation is run on multiple threads or processes on a local machine, or is sent off to a grid. The grid computations must be pure and Lennart stressed the usefulness of the type system of either Mu (a strict version of Haskell) or Haskell in enforcing this. He emphasised that putting the Quant library, Context (based on the financial contracts paper by Peyton Jones, Eber and Seward), into practical use at many sites around the world involved a lot of hard engineering work related both to how to serialise both data and functions and to having to cope with the fact that different sites may be running different versions of the library, and on different architectures. Along the way, Lennart mentioned that it is well known that some programmers are ten times more productive than others, and pointed out that such programmers can, in fact, get paid ten times as much if they choose the right employer :)
Scalable parallel programming in Erlang demands dividing tasks into sufficiently many processes, which can run in parallel, and which avoid heavy sequential parts, such as the last step of divide-and-conquer algorithms which is often the most expensive, and runs in parallel with nothing. But even then, congestion for shared resources can spoil performance. The lecture discussed ways of reducing congestion, both at the Erlang source level and in the virtual machine, for example by replacing one resource shared by n processes with n^2 resources shared by just two. "Invisible" shared resources, such as the scheduler queue(s) and the L3 cache can hit performance badly, so even Erlang programmers do need to be aware of architectural limitations such as cache sizes.
We present a divide-and-conquer algorithm for parsing that enables both parallel and incremental parsing of context-free languages in polylogarithmic time, under certain conditions that seem to hold in practice. These conditions occur for example when parsing program texts written by humans. Our algorithm is a refinement of Valiant's (1975), who reduced the problem of parsing to that of doing matrix multiplications, yielding sub-cubic complexity for the context-free recognition problem. We are able to obtain a much improved complexity result in practice, because the multiplications performed by Valiant's algorithm involve an overwhelming majority of empty matrices. Under our assumptions, our implementation of Valiant's algorithm takes O(n log^3 n) time when run sequentially; and O(\log^4 n) when run using O(n) processors in parallel, or when making an incremental update.
Google's Map-Reduce framework has become a popular approach for processing very large datasets in distributed clusters. Although originally implemented in C++, it's connections with functional programming are close: the original inspiration came from the map and reduce functions in LISP; MapReduce is a higher-order function for distributed computing; purely functional behaviour of mappers and reducers is exploited for fault tolerance; it is ideal for implementation in Erlang. This lecture explains what Map-Reduce is, shows a sequential and a simple parallel implementation in Erlang, and discusses the refinements needed to make it work in reality.
In 2012, Richard told us why Erlang is a good fit for Klarna, emphasizing that though Erlang's performance can, of course, be beaten, it lets you get close enough to the best possible performance in a very short time. He talked about designing for parallelism, for example splitting shared resources to reduce contention. Databases in parallel distributed systems bring consistency problems, and Richard explained the famous CAP-theorem. Finally he mentioned that Klarna are always hiring!
Over the last few decades, performance of processors has grown at a much faster pace than performance of memories. The issue becomes even more severe with the advent of the (massive) multicore era. This gap is addressed by clever design and use of caches. One wouldn't be wrong to say that design of parallel computers is, above all, caches. The quantitative study of algorithms in a parallel setting has already extended the time and space complexity analyses with notions of work and depth. In this lecture, we take one more step and show how to reason about the cache behavior of algorithms.