Supervised learning and principles of data mining

This part of the course is about supervised learning, and general methodology for data mining, machine learning and data science.

Questions and office hours

You must must enroll in the Piazza platform for online questions and answers.

Questions can be answered by students, as well as me. If there are still unresolved problems, I am normally available every Tuesday morning 10:00-12:00 and 13:00-15:00 for questions at building C. On Wednesday afternoons I am only available by appointment.

Assessment

You must must enroll in the Piazza platform to obtain a grade.

Assessment is continuous. It is important to participate in the in-lecture exercises, the lecture feedback, participation in the QA platform. For short exercises, we'll use socrative.com - please use your full name.

If you are unable to attend a lecture, you are advised to inform me in advance. if you have missed a session, importan questions are posted in the QA platform. Answer them.

Course programme

All the course files and examples are here

Introduction to classification problems
Introduction to trees, ID3 and entropy
Practical session: Trees with R
Holdouts for tuning trees
Practical session: Effect of hyperparameters and data size on error
Nearest neighbours
Practical session: Comparison between algorithms
Bayesian inference and the Naive Bayes classifier

Course content

The second part of the course deals with supervised learning problems, such as classification and regression. Just like unsupervised problems, these are problems where we wish to draw a conclusion from a set of training data. In supervised learning problems, the training examples are composed of two parts. The first is a set of features or attributes that decribe each example. The second is a label, or target. The supervised learning problem is to use the training data to learn a decision rule that, can predict the labels for new, unlabelled examples. The two main types of supervised learning problems are classification and regression.

Classification

The problem is called classification when the labels represent classes; then our decision rule would classify new examples into one of the classes. This is in contrast to the unsupervised problem of clustering. In this course, we will cover the following classification methods:

Decision trees.
Nearest neighbour.
Naive Bayes.

Other methods, such as neural networks and support vector machines, will not be covered in this course. They form part of the advanced data mining course.

Regression

When the labels are continuous numbers or vectors, supervised learning is called regression. As regression will be covered in more detail in other courses in statistics, in this course we will only explain the basic problem.

Methodology

Beyond specific methods for solving classification problems, a main theme in this course is proper methodology for selecting models and their parameters. This means that we shall go over basic techniques such as:

Holdout sets
Cross-validation
Bootstrapping

for validating the performance of our models. This is a crucial step to ensure that we are not misled by our data analysis.