Representation learning for Conversational AI

"Representation learning for Conversational AI" is a newly funded collaboration project between KTH and Chalmers. The project is financed by the Swedish AI-program WASP (Wallenberg AI, Autonomous Systems and Software Program), which offers a graduate school with research visits, partner universities, and visiting lecturers.

We now seek 2 PhD students, one which will be employed at KTH (supervised by Prof. Gabriel Skantze) and one at Chalmers (supervised by Senior Lecturer Richard Johansson).

Follow this link to read more and apply for the position at KTH.

Follow this link to read more and apply for the position at Chalmers.

Project description

Being able to communicate with machines through spoken conversation, in the same way we naturally communicate with each other, has been a long-standing vision in both science fiction and research labs, and it has been considered a hallmark of human intelligence. In recent years, so-called Conversational AI has started to become a reality. For example, smart speakers and voice assistants have broken all previous benchmarks for fast market penetration of new technology. In a near future, social robots, that can interact with people through spoken interaction, are expected to appear in receptions, schools, manufacturing industries, and the homes of people.

Conversational systems rely on many different components, such as natural language understanding and generation (NLU/NLG), speech recognition (ASR), and speech synthesis (TTS). In the case of social robotics, computer vision (e.g., face tracking) is also important. Thanks to the last decade’s breakthroughs in deep learning, all these areas have seen rapid improvements. However, when it comes to the “core” conversational ability of the system, i.e., the ability to manage the flow of the interaction and generate a meaningful response given a certain conversational context, current solutions are clearly limited. Although generic transformer-based language models, such as GPT-3 (Brown et al., 2020), and chatbots based on such models ‒ such as Google’s Meena (Adiwardana et al., 2020) and Facebook’s Blender (Roller et al., 2020) ‒ have gained significant attention from media, they have so far only been able to generate superficial “small talk” without a clear purpose.

In order to generate and understand purposeful dialog ‒ such as teaching children, helping elderly people to plan their day, or communicating with humans on how to solve a collaborative task ‒ the system needs to involve some kind of external knowledge source beyond the language model itself. Thus, such systems must be defined in a more modular fashion. The various modules in such systems (NLU, interaction management, action selection, etc.) must then be trained using task-specific data. At the same time, such data is typically only available in very limited quantities. This problem can be addressed using representation learning, where generic models are trained in a self-supervised fashion on larger quantities of data, and then fine-tuned towards the specific domain. When it comes to representation learning for text processing, great progress has been made in the form of language models, such as BERT (Devlin et al., 2018). However, it is not clear how well these models can be applied to the problem of conversational systems, since they are primarily trained on written monologue (such as Wikipedia), and not spoken dialog. There are some fundamental differences between language used in written text and spoken interaction:

The overall aim of this project is to investigate how general representations of spoken conversation can be learned in a self-supervised fashion. These representations can be thought of as encoding the current dialog state. With a rich enough representation of the dialog state, the conversational agent should be able to learn domain-specific conversational skills, based on much fewer examples. This could open up the path towards for example imitation learning in conversational settings.


We will divide the work into four main tasks or work packages.

Task 1: Data consolidation

Since the prediction models developed in the project will be learned directly from human-human interaction data, it is important that we gather and consolidate a wide range of different datasets to work with. We will primarily use generally available corpora representing various properties (Serban et al., 2018), including written, spoken and multi-modal (face-to-face) interaction, and different interaction styles (e.g., unrestricted social interaction, interviews, task-oriented problem-solving). To limit the scope of the project, we will mainly focus on datasets in English and Swedish. A problem we will have to address the scarcity of spoken dialog data. An interesting recent spoken dialog dataset that we will explore is the Spotify Podcast Dataset with 50.000 hours of spoken dialog data (Clifton et al., 2020). However, we will have to curate the dataset (possibly also applying better diarization than the one that is currently used), and develop methods for selecting subsets of the dataset that reflect the different interaction styles. We will also collaborate with Språkbanken Tal at KTH ‒ a center working on the collection of spoken language data.

Task 2: Representation learning

A central task in the project will be to define the right type of pre-training tasks, which are relevant for conversational data. It is important that we take into account the scarcity of spoken dialog data available, compared to the massive amounts of written language corpora available. Thus, our plan is to start by training transformer-based language models on written text with similar training objectives as standard models such as BERT, GPT, etc. After this, we will use written dialog data (which exist in larger quantities than spoken dialog data) and define new training objectives that are relevant for dialog (as listed in the introduction). For example, whereas contrastive learning through next sentence prediction might be relevant for BERT, the coherence of dialog likely needs to consider at least the two preceding turns, and the turns also need to be complemented with speaker embeddings. Another relevant training objective could be to predict speech acts. After this, we will continue to train the models using spoken dialog data with similar and additional objectives. Here, we will also include the prosody (and potentially other modalities) of the speakers and include training objectives such as predicting the voice activity of the two speakers in a future time window (cf. Skantze, 2017), or Contrastive Predictive Coding, which has so far only been applied to monologue speech (Baevski et al., 2020). Training language representation models is computationally demanding, and we will rely on the recent WASP investments in compute infrastructure such as the Alvis and Berzelius clusters.

Task 3: Model analysis

NLP models based on large-scale deep learning models are powerful but opaque, and in recent years a large variety of approaches have been proposed to generate explanations of predictions or an intuitive understanding of the inner mechanisms of a model. Well-known methods of this kind include probing methods testing for the presence of linguistic properties in latent representations in neural models, saliency methods that try to determine what parts of the input or the intermediate representations that contributed the most to a model’s prediction, and benchmarks designed to measure various capabilities of a model (Belinkov and Glass, 2019). However, previous research in analysis methods for neural NLP models has exclusively focused on static written language and to understand the dynamic behavior of models designed for spoken interactions, the project will design a suite of new analysis methods allowing us to understand relevant aspects of such models.

Task 4: Evaluation on downstream tasks

The usefulness of the learned representations will be evaluated on downstream tasks which are relevant for conversational systems, and where domain-specific few-shot learning is relevant. One example of this is Natural Language Understanding (NLU), in the form of intent classification (Madureira & Schlangen, 2020), which typically relies on pre-trained language models such as BERT, but where we expect our conversational representations to lead to better performance, or require fewer training examples in few-shot learning settings. Another example is turn-taking prediction, i.e., the prediction of where it is suitable for a conversational agent to take turns (Skantze, 2021). Yet another example is action selection, where the representation learning could be used for either supervised imitation learning, or reinforcement learning (Li et al., 2019). These models will be partly evaluated off-line on benchmark datasets, but also tested in conversational systems interacting with users. Large scale evaluations will be performed using crowdsourcing.


Adiwardana, D., Luong, M.-T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y. et al. (2020). Towards a Human-like Open-Domain Chatbot. ArXiv 2001.09977.

Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. ArXiv 2006.11477.

Belinkov, Y. & Glass, J. (2019). Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., et al. (2020). Language models are few-shot learners. ArXiv 2005.14165.

Clifton, A., Reddy, S., Yu, Y., Pappu, A., Rezapour, R., Bonab, H., Eskevich, M., Jones, G., et al. (2020). 100,000 Podcasts: A Spoken English Document Corpus. In Proceedings of COLING 2020.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805.

Ekstedt, E., & Skantze, G. (2020). TurnGPT: a Transformer-based Language Model for Predicting Turn-taking in Spoken Dialog. Findings of the Association for Computational Linguistics: EMNLP 2020, 2981–2990.

Hagström, L. & Johansson, R. (2021). Knowledge Distillation for Swedish NER models: A Search for Performance and Efficiency. To appear in Proceedings of NoDaLiDa 2021.

Li, Z., Kiseleva, J., & de Rijke, M. (2019). Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning. Proceedings AAAI, 33(01), 6722-6729.

Madureira, B., & Schlangen, D. (2020). Incremental processing in the age of non-incremental encoders: An empirical assessment of bidirectional models for incremental NLU. In Proceedings of EMNLP, 357–374.

Norlund, T. & Stenbom, A. (2021). Building a Swedish Open-Domain Conversational Language Model. To appear in Proceedings of NoDaLiDa 2021.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ArXiv 2103.00020.

Roddy, M., Skantze, G., & Harte, N. (2018). Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs. Proceedings of ICMI, 186–190.

Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M., Liu, Y., Xu, J., Ott, M., Shuster, K., Smith, E. M., Boureau, Y. L., & Weston, J. (2020). Recipes for building an open-domain chatbot. ArXiv 2004.13637.

Serban, I. V., Lowe, R., Henderson, P., Charlin, L., & Pineau, J. (2018). A survey of available corpora for building data-driven dialogue systems: The journal version. Dialogue and Discourse, 9(1), 1–49.

Skantze, G. (2017). Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks. In Proceedings of SIGDIAL. Saarbrucken, Germany.

Skantze, G. (2021). Turn-taking in Conversational Systems and Human-Robot Interaction: A Review. Computer Speech & Language, 67.