Assignment 1: Word Sense Disambiguation

In this assignment, you will investigate neural network architectures for word sense disambiguation in English. The main didactic goal of this assignment is that you should explore different solutions for a classification task and see how your design choices affect the classifier's performance.

Submission: Please submit your solution via the Canvas submission page. If you work in a group, please remember to register in a Canvas group.

Deadline: April 19.

Introduction

Some words can take different meanings – senses – depending on the context. For instance, here are some sentences exemplifying the word line occurring in three different senses.

Yesterday, Walter procrastinated and wrote just two lines of code.
I had to spend two hours waiting in line.
In terms of total passengers, the Central line is the second busiest on the Underground.

WordNet is a lexical database that contains lists of senses for English words, including definitions and examples. For instance, if we consider the WordNet entry for line, we can see that the examples above would correspond to three different WordNet senses, identified by the sense keys line%1:10:02:: (line of text), line%1:14:01:: (queue) and line%1:06:03:: (railway line), respectively.

Word sense disambiguation (WSD) is the task of automatically determining the sense of a word occurrence. It can be operationalized as a machine learning problem in a few different ways, depending on the type of data we have available. In our case, we will treat it as a straightforward classification problem: given an occurrence of a word in a text, determine its WordNet sense.

Implementing a supervised word sense disambiguator

You will implement a supervised classification system that is able to select WordNet senses for 30 different English words.

Dataset. The following file contains the training and evaluation data that you need for the assignment. The file containing the training data consists of tab-separated columns and one training example per line. For instance, the examples above would be formatted as follows:

line%1:10:02::   line.n   8   Yesterday , Walter procrastinated and wrote just two lines of code .
line%1:14:01::   line.n   10  I had to spend more than two hours waiting in line .
line%1:06:03::   line.n   8   In terms of total passengers , the Central line is the second busiest on the Underground .

The columns in the training data correspond to

the WordNet sense key identifying the sense, as discussed above;
a WordNet entry identifier, which we will refer to as the "lemma", in this case line as a noun;
the position of the word that should be disambiguated;
the tokenized text containing the word to disambiguate and its context.

Your classifier should be trained to predict the WordNet sense, so the first column will be used as the classifier's output. In the file containing the evaluation data, the WordNet sense keys are hidden and will need to be predicted by your classifier.

Your tasks. Investigate how this task can be implemented as a neural network classifier. In your submitssion, you should describe the different approaches that you have tried, and how well they worked.

There are several different ways that such classifiers can be implemented. You may, for instance, find some inspiration in the paper Word Sense Disambiguation using a Bidirectional LSTM by Kågebäck and Salomonsson (2016), but you don't have to follow their solution. You might also consider simpler solutions, e.g. just treating this as a straightforward document classification problem (using a separate classifier for each lemma) and ignoring the position of the target word.

Possible design choices that you can explore.

What is the best way to represent a piece of text for this task? CBoW, CNN, RNN, attention, ...?
Is it useful to use some pre-trained component, such as word embeddings or pre-trained language models (ELMo, BERT etc.)?
How do we use the position of the target word? Should we just ignore the position and treat this as a document classification problem, or should the representation use the position somehow?
Should we train completely separate classifiers for each lemma, e.g. one line.n classifier, one force.n classifier, etc.? Or just one big classifer? Or some architecture where some but not all components are shared?

If you implement a dummy classifier that selects the most frequent sense for each lemma (e.g. always line%1:04:01:: for line), you will get an accuracy of around 0.30. This simple classifier is called the most-frequent-sense baseline (MFS).

Hint. In the document classification examples shown at the lectures, the lecturer was a bit sloppy and trained for a fixed number of epochs. It is probably a better idea to use early stopping or something similar. It may also be a good idea to rerun the program several times and consider how much variation you see in the performance.

Your submission

Rules of the game. It is allowed to copy code snippets from the examples shown during the lectures, or other code that you find online. In that case, you need to describe in your code what parts have been copied. If you include some code from external sources, you need to state this explicitly.

You are allowed to use external libraries. This could be utility libraries to simplify PyTorch coding, or something NLP-specific such as NLTK or spaCy.

You are allowed to use additional unlabeled training data. Please ask for permission if you want to use more WordNet-annotated data.

You are allowed to use generic pre-trained neural components, such as word2vec or GloVe word embeddings or contextualized models such as ELMo, BERT, etc.

You are allowed to use data from WordNet, such as the definitions, examples, or sense-to-sense relations. If you want to access WordNet, the NLTK Python API might be useful. (But please note that the method synset_from_sense_key seems to be buggy as of today.)

You are not allowed to use pre-trained WSD models or any code from WSD systems.

Deliverables. You should submit your code, either as plain program code or a Jupyter notebook (.ipynb). Alternatively, submit a link to a Colab notebook.

You also need to include a brief report that describes the technical approach that you have chosen, as well as your experimental results. This documentation can be a separate file or (preferably) Markdown cells in a Jupyter notebook.

If you use any pre-trained models, please include a download link unless they come packaged with a library.

It is required that the code can be run easily, maybe just with some trivial modification to point to the location of the training data. You should include instructions how to train the model and how to run it on the blind test data.

Evaluating on the test set. When you have developed a model that you are confident about, compute its predictions for the blind test set and print the predictions to an output file (with one output label per line). Download this package, which contains the complete test set for this assignment, unpack, and run the evaluation script (python3 evaluate.py wsd_test.txt YOUR_OUTPUT). For your reference, the package also includes a script (dummy_baseline.py) that computes the most-frequent-sense baseline and saves this output to a file (dummy_baseline.txt).

Grading. To pass the assignment, you minimally need to investigate at least two neural architectures that are different in a nontrivial way. At least one of these solutions should have an accuracy that is higher than the MFS baseline.