This is an old webpage! For the spring 2014 course, visit: http://www.cse.chalmers.se/edu/course/TIN172/
This project concerns very important topics, which are used in all AI areas, not only NLP:
You will implement some basic classifiers yourself (Naive Bayes and the Perceptron), and use some existing libraries that implement other algorithms, such as k-nearest-neighbors, support vector machines, decision trees, maximum entropty classifiers, or probabilistic topic models.
You will use the following data:
If you want to do more than the basic tasks, you might want to use the following datasets:
You should at least implement the following machine learning algorithms:
You should also test at least one other machine learning algorithm, such as such as k-nearest-neighbors, support vector machines, decision trees, maximum entropty classifiers, or probabilistic topic models.
The Amazon data can be used for at least three different kinds of classification tasks:
For each of these tasks you should use your implemented classifiers; Naive Bayes, Perceptron and Averaged Perceptron. You should also use as at least one external library of your own chosing.
To get more accurate evaluations, you should use N-fold cross-validation for each task. Also, think about how you want to split your data to be sure that you don't get any strange biases.
Here are some academic papers that describe these kinds of classification problems, and their associated algorithms.
On each of the tasks you perform you should evaluate the accuracy of your different classifiers. Compute confidence intervals and carry out McNemar tests to see whether the differences are statistically significant.
Print a few documents that are misclassified by the best of the classifiers. Try to see if you have any idea why they were difficult to classify.
What is the relation between the size of the training set and the performance of the classifier on the test set? Train classifiers on subsets of the training set and evaluate them all on the same test set. Report the result in a plot or a table.
There are several ways you can extend your work, here are some suggestions:
The perceptron learning algorithm has the drawback that its end result is biased toward the final examples it saw. A solution proposed by Freund and Schapire (1999) and also used by Collins (2002) is to return the average weight vector rather than the final weight vector.
It's a bit impractical to keep all versions of the weight vector in memory, and then taking the average at the end. Instead, we can build the average vector incrementally, updating it simultaneously while we're building the usual weight vector.
Then the perceptron algorithm will be something like this:
1. Let N = number of iterations through training set, and T = the size of the training set. 2. Initialize the weight vector w to all zeros, and the average weight vector wa to all zeros, and the counter c to 0. 3. Iterate N times through the training set: • For each x, y in the training set: a) Let guess = classify(x) using our current w b) If y is not equal to guess: • w := w + f(x, y) - f(x, guess) • average_weight := (N*T - c) / (N*T) • wa := wa + average_weight * (f(x, y) - f(x, guess)) c) Increase the counter c 4. Return wa