Parallel Functional Programming

This page describes each lecture and contains links to related materials.

Course Introduction

This is the lecture that sets the scene for the course. It explains why functional languages are well suited for parallel programming, and this despite the fact that they really failed to deliver on the promise of solving parallel programming in the past. So why are they interesting now?

Slides:

The slides

Reading:

Note that we distinguish between parallel and concurrent programming. Our undergraduate curriculum at Chalmers has arguably overemphasized the latter (locks, semaphores, synchronisation mechanisms and the like) and underemphasized the business of making programs run faster by using many cores (see this discussion by Simon Marlow (previously Microsoft Research, now Facebook)), who makes distinctions that exactly mirror our view). Similar opinions were later expressed by Bob Harper, who has radically reshaped the introductory computing curriculum at CMU to place parallelism at the centre — a major source of inspiration for this course.

Simon Marlow’s book on Parallel and Concurrent Programming in Haskell gives a good explanation of why the topics of this course are interesting. It also makes the same distinction between concurrency and parallelism as that made in this course. We consider only Part I on parallelism. We will simply call the book PCPH. Simon will kindly give us a guest lecture in Week 4 of the course.

Be inspired by this video of Simon Peyton Jones lecturing on parallel programming in Haskell
The three papers listed on the second last slide of the first lecture are
- Haskell on a Shared-Memory Multiprocessr, Harris, Marlow and Peyton Jones, Haskell’05
- Feedback Directed Implicit Parallelism, Harris and Singh, ICFP’07
- Runtime Support for Multicore Haskell, Marlow, Peyton Jones and Singh, ICFP’09
Make sure to read the last of these.

from par and pseq to Strategies

This lecture considers par and pseq more critically, and concludes that it might be a good idea to separate the control of behaviour relating to parallelisation from the description of the algorithm itself. The idea of Strategies is described in a well-known paper called Algorithms + Strategies = Parallelism by Trinder, Hammond, Loidl and Peyton Jones. More recently, Marlow and some of the original authors have updated the idea, in Seq no more: Better Strategies for Parallel Haskell. We expect you to read both of these papers. The lecture is based on the newer paper. See also PCPH chapters 2 and 3.

Slides: * The slides

Other Material:

The documentation of the Strategies Library is very helpful.

exercise session on parallelising Haskell

Code:

Haskell file

The Par Monad

This lecture is about a programming model for deterministic parallelism, introduced by Simon Marlow and colleagues. It introduces the Par Monad, a monad for deterministic parallelism, and shows how I-structures are used to exchange information between parallel tasks (or “blobs”), see Marlow’s Haskell’11 paper with Ryan Newton and Simon PJ. You should read this paper.

Take a look at the I-Structures paper referred to in the lecture (not obligatory but interesting). See PCPH chapter 4.

Also, Phil Wadler’s “Essence of Functional Programming” is a very interesting read, and it covers monads and continuation passing style.

The lecture starts with a presentation by Koen Claessen on his Poor Man’s Concurrency Monad (see his JFP Pearl).

Slides:

GHC Heap Internals

In this lecture, Nick (TA) will tell you things that you need to know to make a good job of parallel programming in Haskell, based on experience from previous years.

There is only so much parallelism the memory can handle (the effect known as “memory wall”). While both functional and imperative languages use the concept of heap for managing memory, the behavior of programs written in pure languages like Haskell is radically different from that of programs written with aggressive use of side effects — there is no mutation of data but much more allocation of it. We will review the major design decisions behind GHC’s implementation of heap, including garbage collection, multithreading and I/O management. We will also take a look at how tweaking heap runtime parameters can impact performance of a program, with help of Threadscope.

Slides:

The slides

Parallel Programming in Erlang

This lecture introduces Erlang for Haskell programmers, taking parallelising quicksort as an example, both within one Erlang VM and distributed across a network. The latest version of the Erlang system can be downloaded from here. There is a Windows installer. Many linux versions have an Erlang packagage available, but not necessarily a package suitable for development of Erlang code, and not necessarily the latest version. On Ubuntu, try

sudo apt-get install erlang-dev

If that doesn’t work or you can’t find an appropriate package, build the VM from source.

Slides:

The slides

Skeletons for Parallel Scientific Computing (David Duke, Leeds Univ.)

Scientific datasets arising from observation (e.g. satellites, microscopes) or simulation (supercomputing) are simply files of numbers. While small data (kilobytes and megabytes) are sometimes valuable, more common examples are at the gigabyte/terabyte scale, with peta-byte datasets now routine. And work is underway to develop machines capable of processing exabyte-scale data. But numbers alone are useless to scientists - value comes from converting them into forms that answer open questions of shed new insight into the phenomena being studied. In computational science this step is achieved through a combination of mathematical analysis to extract structural features of the data, and visualization to present those features.

Given the size of datasets in computational science, parallel computing is routine. But while some algorithms are “embarrassingly parallel”, others have proven more difficult. A particularly awkward class are those for extracting topological abstractions, which provide highly compact and useful descriptors of the “shape” of phenomena. In this lecture I will introduce two of these abstractions, and then explain how we exploited Haskell to carry out analysis of data from nuclear simulations, first using a sequential implementation, and then moving to parallel for scalability. This work, carried out in collaboration with Nicholas Schunck, a physicist at the Lawrence Livermore National Laboratory in the US, has resulted in multiple papers including two publications [1],[2] in Physical Review C, and new insight into the process of nuclear fission.

The lecture will highlight an benefit of parallel programming in Haskell or indeed other functional languages: the opportunity to develop new higher-order abstractions, in the form of “skeletons” that capture patterns of computation. In principle these afford concise and elegant construction of parallel computations by applying computational patterns to problem-specific code. The reality is somewhat messier: effective use of parallel resources sometimes requires deep domain-specific knowledge and the creation of specialised skeletons. In the lecture I will illustrate these challenges using examples from computational topology, along with the steps needed to tune performance of our parallel applications on both shared and distributed memory architectures. Finally, I will also touch on some of the non-technical challenges in working with Haskell.

To read more about algorithmic skeletons, the recommended reference is

Parallel Functional Programming in Eden, R. Loogen, Y. Ortega-Mallen, and R. Pena-Mari, Journal of Functional Programming, 15(3), 431-475, 2005

Slides :

The slides (pdf)
The slides (pptx)

Robust Erlang

This lecture focusses on the fault tolerance constructs in Erlang–links and system processes–and the motivation for the “Let It Crash” philosophy. It introduces supervision trees and the Open Telecoms Platform, and develops a simple generic server.

Slides :

The slides

Haskell in Production at Facebook (Simon Marlow)

Facebook has a large existing system that identifies and remediates abuse: primarily spam, but also other types of abuse, using a combination of techniques including manually written rules and machine learning classifiers. This system actively and automatically prevents vast amounts of undesirable content from reaching users of Facebook. The system provides a domain-specific language in which the detection logic is written, and we are in the process of migrating this language from an in-house functional language called FXL to Haskell. At the current time, the system is running nearly all of its requests on Haskell. We believe this is the largest Haskell deployment currently in existence. In this talk I’ll explain the problem domain, and why Haskell is uniquely suited to it. The key value proposition of the DSL implementation is implicit concurrency, and I’ll outline our solution to this (the Haxl framework). I’ll also cover many of the engineering problems we had to solve, including how we deploy new code, going from a source code change to running new code on all the machines in a few minutes. The migration path we are on is littered with the corpses of bugs found and problems solved; I’ll share a few of the war stories and the lessons we have learned. Bio: Simon Marlow is a Software Engineer at Facebook in London. He is a co-author of the Glasgow Haskell Compiler, author of the book “Parallel and Concurrent Programming in Haskell”, and has a string of research publications in functional programming, language design, compilers, and language implementation.

Slides :

The slides (pdf)
The slides (pptx)

Paper:

Paper about desugaring Haskell’s do-notation into applicative operations (in submission)

Parallel Functional Programming in Java (Peter Sestoft)

It has long been assumed in academic circles that functional programming, and declarative processing of streams of immutable data, are convenient and effective tools for parallel programming. Evidence for this is now provided, paradoxically, by the object-imperative Java language, whose version 8 (from 2014) supports functional programming, parallelizable stream processing, and parallel array prefix operations. We illustrate some of these features and use them to solve computational problems that are usually handled by (hard to parallelize) for-loops, and also combinatorial problems such as the n-queens problem, using only streams, higher-order functions and recursion. We show that this declarative approach leads to very good performance on shared-memory multicore machines with a near-trivial parallelization effort on this widely used programming platform. We also highlight a few of the warts caused by the embedding in Java. Some of the examples presented are from Sestoft: Java Precisely, 3rd edition, MIT Press 2016.

Slides:

The slides

Single Assignment C — Functional Programming for HP^3 (Sven-Bodo Scholz)

SaC is designed to combine High-Productivity with High-Performance and High-Portability. The key to achieving this goal is a purely functional core of the language combined with several advanced compilation and runtime techniques. This lecture gives an overview of the key design choices that SaC is based upon and it sketches how these can be leveraged to producing codes for various heterogeneous many-core systems that often outperform hand-written low-level counterparts.

Slides:

The slides

Map Reduce

Google’s Map-Reduce framework has become a popular approach for processing very large datasets in distributed clusters. Although originally implemented in C++, it’s connections with functional programming are close: the original inspiration came from the map and reduce functions in LISP; MapReduce is a higher-order function for distributed computing; purely functional behaviour of mappers and reducers is exploited for fault tolerance; it is ideal for implementation in Erlang. This lecture explains what Map-Reduce is, shows a sequential and a simple parallel implementation in Erlang, and discusses the refinements needed to make it work in reality.

Reading:

One of
MapReduce: Simplified Data Processing on Large Clusters, the original paper from 2004.
MapReduce: Simplified Data Processing on Large Clusters, a retrospective published in CACM in 2008.

Yes, both papers have the same title (and the same authors). What can you do?

Slides:

The slides

The Erlang Virtual Machine (Erik Stenman)

Slides:

The slides

Parallel Functional Programming in Erlang at Klarna (Richard Carlsson)

Slides:

The slides

Cache Complexity and Parallelism

Over the last few decades, performance of processors has grown at a much faster pace than performance of memories. The issue becomes even more severe with the advent of the (massive) multicore era. This gap is addressed by clever design and use of caches. One wouldn’t be wrong to say that design of parallel computers is, above all, caches. The quantitative study of algorithms in a parallel setting has already extended the time and space complexity analyses with notions of work and depth. In this lecture, we take one more step and show how to reason about the cache behavior of algorithms.

Reading:

Cache-Oblivious Algorithms, Harald Prokop, MSc Thesis, MIT, 1999.

Slides:

The slides

Data Parallel Programming

This lecture is all about Guy Blelloch’s seminal work on the NESL programming language, and on parallel functional algorithms and associated cost models.

Material:

The best introduction is to watch the video of his marvellous invited talk at ICFP 2010
A page about NESL (including interactive tutorial, papers and information about applications)

Reading:

To read about Work and Depth, start with this page. Note that there will be an exam question about calculating work and depth of a NESL program. So study examples!
We advise reading the whole of Blelloch’s Programming Parallel Algorithms, CACM 39(3).

Slides:

The slides

General Purpose Computations on GPUs

Graphics Processing Units (GPUs) are massively parallel computers that offer great performance benefits over traditional CPUs for highly data parallel problems. The cost of this performance benefit is a more complicated software development procedure. When programming GPUs the programmer needs to manage, layout and store intermediate or often used values manually in a scratchpad memory. This can be compared to the transparent service provided caches in a CPU. GPUs thrive under workloads consisting of (tens of) thousands of independent threads, all doing the same work and exhibiting highly regular memory access patterns.

For achieving optimal performance on a GPU, programmers often specialize code for a particular problem size and decomposition of work over the available resources. This is because the cost of dynamic choice within the threads running on the GPU is high. The choices that leads to the best performance may also differ between different GPUs. The cost of experimentation with program decomposition is high in a language such as CUDA, where changing a decision of work-to-thread mapping made early may mean a complete rewrite of the application code.

Obsidian is an embedded language for design space exploration. The idea is to raise the level of abstraction enough to enable a faster turn around time when experimenting with decompositions of work onto the GPU resources. Using Obsidian it is possible to write parameterized, higher level, descriptions of algorithms and to generate specialized CUDA code for each parameter setting. Thus once the high level description is written, the variants are generated by the a small tweak of a parameter.

In paper [1], we outline the general ideas and goal behind Obsidian without going in-depth on any details. In Paper [2], we combine parameterized Obsidian programs with an auto-tuning system to do the parameter exploration automatically. For a long and very in-depth description of Obsidian you can refer to our JFP paper [3]. This is not required reading.

References:

[1] Bo Joel Svensson, Mary Sheeran, Ryan R. Newton Design Exploration through Code-generating DSLs

[2] Michael Vollmer, Bo Joel Svensson, Eric Holk, Ryan R. Newton Meta-programming and auto-tuning in the search for high performance GPU code

[3] Bo Joel Svensson, Ryan R. Newton, Mary Sheeran A language for hierarchical data parallel design-space exploration on GPUs

Slides:

The slides

To learn more:

A Udacity course about GPU programming that also gives access to GPUs for running CUDA code (for doing assignments). We have not tried this. Let us know what you think of it!

Data Parallel Programming II

We may continue with the General Purpose Computations on GPU topic from the previous lecture.

We will most likely return to non-GPU data parallel programming by briefly presenting some details of (and some non-idiomatic programming in) Repa (a library for data parallel programming in Haskell).

And some wrap up.

Material:

Chapter 5 of Marlow’s book is about Repa
The Repa home page
The third Repa paper: Guiding Parallel Array Fusion with Indexed Types

Slides:

The slides

Databases in the New World

No-SQL databases have become very popular for the kind of scalable applications that Erlang is used for. In this lecture, we introduce the mother of them all, Amazon’s Dynamo, and one of its descendants – Riak, implemented in Erlang by Basho Technologies. We discuss scalability, the CAP theorem, eventual consistency, consistent hashing and the ring, and the mechanisms used to detect, tolerate, and repair inconsistency.

Reading:

The key reference is the Dynamo paper.
For amusing and informative background reading, check out The Network is Reliable (yeah right).

Parallel Functional Programming – Lecture content	DAT280 / DIT261, LP4 2016
Home \| Schedule \| Labs \| Lectures \| Exam \| About	Fire \| Forum \| TimeEdit \| Links

Parallel Functional Programming – Lecture content	DAT280 / DIT261, LP4 2016
Home \| Schedule \| Labs \| Lectures \| Exam \| About	Fire \| Forum \| TimeEdit \| Links

Course Introduction

from par and pseq to Strategies

exercise session on parallelising Haskell

The Par Monad

GHC Heap Internals

Parallel Programming in Erlang

Skeletons for Parallel Scientific Computing (David Duke, Leeds Univ.)

Robust Erlang

Haskell in Production at Facebook (Simon Marlow)

Parallel Functional Programming in Java (Peter Sestoft)

Single Assignment C — Functional Programming for HP^3 (Sven-Bodo Scholz)

Map Reduce

The Erlang Virtual Machine (Erik Stenman)

Parallel Functional Programming in Erlang at Klarna (Richard Carlsson)

Cache Complexity and Parallelism

Data Parallel Programming

General Purpose Computations on GPUs

Data Parallel Programming II

Databases in the New World

A Brief History of Time (in Riak) (Russell Brown, Basho)