Word Sense Embedded in Geometric Spaces - From Induction to Applications using Machine Learning

Licentiate seminar

Date & time: December 2, 2016
Location: HC1, Hörsalsvägen 14, Chalmers

Discussion leader

Richard Socher - Chief scientist at Salesforce and lecturer at Stanford University


Main supervisor: Devdatt Dubhashi
Co-supervisors: Richard Johansson, Shalom Lappin

Thesis abstract

Words are not detached individuals but part of an interconnected web of related concepts, and to capture the full complexity of this web they need to be represented in a way that encapsulates all the semantic and syntactic facets of the language. Further, to enable computational processing they need to be expressed in a consistent manner so that common properties, e.g. plurality, are encoded in a similar way for all words sharing that property. In this thesis dense real valued vector representations, i.e. word embeddings, are extended and studied for their applicability to natural language processing (NLP).
Word embeddings of two distinct flavors are presented as part of this thesis, sense aware word representations where different word senses are represented as distinct objects, and grounded word representations that are learned using multi-agent deep reinforcement learning to explicitly express properties of the physical world while the agents learn to play Guess Who?. The empirical usefulness of word embeddings is evaluated by employing them in a series of NLP related applications, i.e. word sense induction, word sense disambiguation, and automatic document summarisation. The results show great potential for word embeddings by outperforming previous state-of-the-art methods in two out of three applications, and achieving a statistically equivalent result in the third application but using a much simpler model than previous work.

Included papers and my contributions:

Paper I: Neural context embeddings for automatic discovery of word senses [pdf]

  • Main author.
  • Developed the main idea.
  • Wrote ~50% of the text.
  • Implemented ~50% of the experiments.

Paper II: Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence [pdf]

  • Initiated the project.
  • Supervised the main author
  • Contributed towards the manuscript (abstract, introduction, and conclusions)
  • Contributed towards the technical contribution of the paper.

Paper III: Word Sense Disambiguation using a Bidirectional LSTM [Accepted to Coling workshop]

  • Developed the main idea.
  • Wrote 90% of the text.

Paper IV: Extractive Summarization using Continuous Vector Space Models [pdf]

  • Main author.
  • Developed the main idea.
  • Wrote ~80% of the text.
  • Implemented ~50% of the experiments.

Paper V: Extractive Summarization by Aggregating Multiple Similarities [pdf]

  • Second author.
  • Multiplicative interaction between kernels.
  • Wrote ~20% of the text.
  • Implemented ~20% of the experiments (the parts relating to word embeddings).


Summarisation demo

Future Direction of Research

As the licentiate thesis, to a large extent, represent a milestone on the way to a PhD, some thoughts on current and future work that will lead up to the dissertation are presented next. The general direction that is being taken is towards sequences of words and emergent properties captured through the interaction between agents. At the time of writing, this translates to the following list of ongoing projects:

  • Symbolic input sequence optimization - Taking an optimization approach to the sequence to sequence decoding problem by utilizing the gradient to do optimization over a one-hot input space.
  • Grounded word embeddings of human language - Connecting the grounded embeddings described in Paper II with existing human language, to learn grounded embeddings of real words.
  • Waveform translation - Realizing that the models behind neural machine translation are independent of the underlying data, we try to connect the spectral voiceprint of the source sentence to the voiceprint of the target sentences directly. Though challenging, this approach has the potential of producing a far superior speech-to-speech translation system than approaches that are constraint by having to transcode the spoken language in text, since a lot of information gets lost in that step.

[1] Harris, Z. S.. (1954). Distributional structure.. Word.
title={Distributional structure.},
author={Harris, Zellig S},

Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence

Learning your first language is an incredible feat and not easily duplicated. Doing this using nothing but a few pictureless books, a corpus, would likely be impossible even for humans. As an alternative we propose to use situated interactions between agents as a driving force for communication, and the framework of Deep Recurrent Q-Networks (DRQN) for learning a common language grounded in the provided environment. We task the agents with interactive image search in the form of the game Guess Who?. The images from the game provide a non trivial environment for the agents to discuss and a natural grounding for the concepts they decide to encode in their communication. Our experiments show that it is possible to learn this task using DRQN and even more importantly that the words the agents use correspond to physical attributes present in the images that make up the agents environment.

Word Sense Disambiguation using a Bidirectional LSTM

In this paper we present a clean, yet effective, model for word sense disambiguation. Our approach leverage a bidirectional long short-term memory network which is shared between all words. This enables the model to share statistical strength and to scale well with vocabulary size. The model is trained end-to-end, directly from the raw text to sense labels, and makes effective use of word order. We evaluate our approach on two standard datasets, using identical hyperparameter settings, which are in turn tuned on a third set of held out data. We employ no external resources (e.g. knowledge graphs, part-of-speech tagging, etc), language specific features, or hand crafted rules, but still achieve statistically equivalent results to the best state-of-the-art systems, that employ no such limitations.

Extractive Summarization by Aggregating Multiple Similarities

News reports, social media streams, blogs, digitized archives and books are part of a plethora of reading sources that people face every day. This raises the question of how to best generate automatic summaries. Many existing methods for extracting summaries rely on comparing the similarity of two sentences in some way. We present new ways of measuring this similarity, based on sentiment analysis and continuous vector space representations, and show that combining these together with similarity measures from existing methods, helps to create better summaries. The finding is demonstrated with MULTSUM, a novel summarization method that uses ideas from kernel methods to combine sentence similarity measures. Submodular optimization is then used to produce summaries that take several different similarity measures into account. Our method improves over the state-of-the-art on standard benchmark datasets; it is also fast and scale to large document collections, and the results are statistically significant.

Visions and open challenges for a knowledge-based culturomics

The concept of culturomics was born out of the availability of massive amounts of textual data and the interest to make sense of cultural and language phenomena over time. Thus far however, culturomics has only made use of, and shown the great potential of, statistical methods. In this paper, we present a vision for a knowledge-based culturomics that complements traditional culturomics. We discuss the possibilities and challenges of combining knowledge-based methods with statistical methods and address major challenges that arise due to the nature of the data; diversity of sources, changes in language over time as well as temporal dynamics of information in general. We address all layers needed for knowledge-based culturomics, from natural language processing and relations to summaries and opinions.

Extractive summarization using continuous vector space models

Automatic summarization can help users extract the most important pieces of information from the vast amount of text digitized into electronic form everyday. Central to automatic summarization is the notion of similarity between sentences in text. In this paper we propose the use of continuous vector representations for semantically aware representations of sentences as a basis for measuring similarity. We evaluate different compositions for sentence representation on a standard dataset using the ROUGE evaluation measures. Our experiments show that the evaluated methods improve the performance of a state-of-the-art summarization framework and strongly indicate the benefits of continuous word vector representations for automatic summarization.

Download paper


Our implementation of submodular optimization is available here and the recursive neural network used in the paper is based on the code made available by Richard Socher on his webpage.

Robust Face Recognition on Adverse 3D Data

The emerging field of high resolution mobile and inexpensive depth cameras, promise to revolutionize many parts of computer vision. One area in particular where 3D data has been shown to improve performance, is face recognition. Using a combination of local and global pattern matching and a committee of neural networks, this thesis present a robust 3D face recognition approach, decisively outperforming current methods. The system is evaluated on the Bosphorus database, a challenging benchmarking dataset that include face scans with both facial expressions and partial occlusions, captured in angles of up to 90◦ rotation. The proposed system achieves a recognition rate of 98.9%, which is the highest recognition rate ever reported on the Bosphorus database, improving the state of the art by 5.2%.

Download thesis