Approximate Schedule

The following units are not necessarily the same length.

Module 1: A concise introduction to decision theory. [Slides]

Course introduction (15 min)

Probability and measure theory refresher (30 min)

This unit will give a gentle introduction to axiomatic measure and probability theory. We shall focus on the general concepts rather than technical details.

Subjective probability and utility [Slides]

While probability is generally seen as a tool to express randomness, the same mathematical machinery can be used to express subjective belief and uncertainty. We shall give an overview of the mathematical construction of subjective probability, and examples of how it can be used to quantify personal beliefs.

We shall also look at preferences, the theory of utility, and its connections to psychological experiments.

Decision problems [Slides]

By combining subjective probability and utility, we arrive at statistical decision problems. These are problems where we must take a decision to maximise the expected utility that arise frequently in practice. We shall examine the general properties of such problems and give examples.

We will also examine alternative notions of preferences and consider the links between utility, regret and game theory.

Conjugate priors and estimation (use as a reference - not covered as a single unit, but in segments throughout the course)

Personal beliefs change over time. Assuming a probabilistic framework, this change can be realised via Bayes rule, or some other conditioning method. For conjugate priors, this conditioning is extremely simple and efficient. A few examples will be given.

Conditioning can be used to form a more accurate belief about the value of some unknown parameter. Thus, conditioning leads us directly to Bayesian parameter estimation, where our uncertainty about the parameters is expressed as a probability distribution.

Finally, we are sometimes interested in estimation of parameters with as little prior knowledge as possible. In that case, concentration inequalities can be used to bound the errors of our estimates.

Module 2: Markov decision processes

Experiment design, Reinforcement learning and Markov decision processes (2-3 days) [Slides]

The more general case is when our decisions are not just whether to continue or stop the experiment, but also what experiment to perform. This unit will give a thorough overview of problems in experiment design, by formalising them as Markov decision processes, where the state of the process is the state of our belief. This links directly to multi-armed bandit problems and reinforcement learning. This problem will also be solved using backwards induction.

We then shall consider Markov decision processes as models for dynamic environments. There, the state is the state of the environment itself. We also shall examine basic algorithms for Markov decision processes, including value iteration (a variant of backwards induction), policy iteration and linear programming.

Module 3: Reinforcement learning I

Reinforcement learning: basic algorithms [Slides]

Here we examine stochastic algorithms for solving Markov decision processes and we examine their connection to the reinforcement learning problem, where the MDP is unknown.

Reinforcement learning: approximations [Slides]

In this unit, we consider the case where an exact representation of the optimal value function or policy is not possible. We look at a number of approximation architectures and see how they connect to the stochastic setting including approximate policy iteration and actor-critic algorithms.

Module 4: Reinforcement learning II

Reinforcement learning: Optimal exploration [Slides]

We take a closer look at reinforcement learning problems from a Bayesian perspective. We employ a model-based approach, and show that the optimal solution when we have a prior over MDPs is itself an MDP. We then extend the discussion to partially-observable Markov decision processes (POMDPs).

Reinforcement learning: Distribution-free models [Slides]

We revisit the MDP planning problem introduced in "Experiment Design". We give algorithms for both multi-armed bandit problems and the complete reinforcement learning problem, based on statistical confidence bounds. This has an "optimism under uncertainty" behaviour. We link this to the Bayesian approaches seen in the previous lecture.

Preference elicitation and learning from demonstration (not covered)

[Slides] [Paper 1] [Paper 2]

We now view MDPs as expressing the preferences of individuals. We present some Bayesian models for estimating those MDPs from data of individual behaviour. We discuss applications in advertising, cognitive science and learning to play games from demonstration.

Module 5: Adversarial models and games

Prediction with expert advice

Frequently there is no way to construct a plausible model, but we have many rules-of-thumb to choose from. We take a look at some aspects of the problem of prediction with expert advice in the full information setting. (Chapters 1 and 2 of the linked book).

Prediction under partial monitoring

[Slides]

Here we generalise bandit problems to other problems of prediction under partial monitoring and show how this relates to learning in games.

Multi-agent problems

The final chapter looks at multi-agent problems. We mainly cover Bayesian games, which are useful for the case when the other agents are neither completely adversarial nor completely co-operative.