GENIE project: Analysis of gender bias

Analysis of gender bias in texts from Chalmers and other Swedish texts

This is a short project financed by GENIE (Chalmers Gender Initiative for Excellence), running in the years 2020–2021, and led by Peter Ljunglöf. In the project we will use state of the art natural language processing (NLP) technologies for investigating gender bias in different text genres, both formal and informal. This will be done for both English and Swedish texts, so that we can compare the results with previous research which are almost exclusively done for English text.

Project description

The past few years have seen an increasing interest in gendered language, and what effects it can have, especially in terms of job recruitment. This can be exemplified by searching the web for "gender neutral language in recruitment": the topmost search results are articles from recruitment companies or newspapers, giving tips of how to write better job ads. However, most research has been done in an English-speaking context (primarily American), and it is not clear what results can be transferred to the Swedish language and society.

It is not trivial how to utilise AI techniques for decreasing gender bias. A well-known example showing the difficulties is the company Amazon who started using neural networks to rank job applications, with an intention of decreasing discrimination. But instead the algorithms learned to downgrade female applicants and female-coded applications. Ultimately, Amazon decided to withdraw their AI system because of its inherent gender bias. (See, e.g., Reuters 2018 for more information)

In this project we will use state of the art natural language processing (NLP) technologies for investigating gender bias in different text genres, both formal and informal. This will be done for both English and Swedish texts, so that we can compare the results with previous research which are almost exclusively done for English text.

Previous work

There has been some research about unconscious gender bias in texts, mainly done for English, and in the context of job recruiting. Most of the research try to find words that are coded masculine or feminine, and then use these lists for analysing texts such as job ads (Gaucher et al 2011). Some research focus on the pronouns he/she (Sendén et al 2014; Twenge et al 2012).

Most data-driven research have used traditional statistical techniques, such as counting the number of occurrences of a word in different kinds of texts, or how often two words co-occur in texts. Some related research using more advanced NLP techniques such as word embeddings or language modelling, includes author gender identification (Cheng 2011), gender-neutral language models (Kaneko & Bollegala 2019; Zhao et al 2019), diachronic analyses (Garg et al 2018; Moricz 2019), coreference resolution (Zhao et al 2018), machine translation (Vanmassenhove 2018). As can be seen from the references, the research interest has exploded the last few years, and this year has even shown the first Workshop on Gender Bias in Natural Language Processing.

In a Swedish context, there have been almost no research at all using data science or NLP for analysing gendered language. Apart from the occasional exception (such as Moricz 2019, who uses an LSTM neural network to analyse the effects of the #metoo campaign on tweets), most Swedish research have been psychological or linguistic, using surveys, interviews and psycho-linguistic analyses to analyse attitudes towards gendered and gender-neutral language (see, e.g., the research project "Gender Fair Language"). There is an ongoing discussion in Sweden about gender-neutral language, which can be seen by a simple web search for terms such as "jämlikt språk", "jämställt språk", or "könsneutralt språk", but these discussions do not have much Swedish data on which to base their arguments.

Research questions

Our main research question is if different words, phrases or text styles are gendered, and in what way? This is a very generic question, and here are some more specific questions that we will look into in this project.

Investigations on Swedish texts:

Investigations on texts published by Chalmers:

Comparison between similar Swedish and English texts:

Data for training and evaluation

For the Swedish data we will use existing text corpora from Språkbanken at GU, which has large text collections from all time periods and genres, including journalistic texts, governmental texts, student essays, novels, social media, online chats, and many more. In total Språkbanken has collected around 13 billions words, which is one of the largest non-English text collections in the world.

In addition to the material from Språkbanken, we will collect official and semi-official documents from Chalmers. From these we will create a new corpus resource, which will be categorised in different ways, such as publication date, document type (e.g., education, research, administrative, student, informal, etc), department or educational program, etc.

Some of Språkbanken’s corpora are manually annotated for linguistic features such as lemma, word sense, part of speech, morphology, and syntactic structure. The rest of the corpora, comprising the vast majority of data, are automatically annotated by Språkbanken’s annotation pipeline. We will do the same for the texts we collect in the project, so that all the data we work on will be annotated for linguistic features.

For the English data we will use openly available corpora for different genres, so that we can do comparisons with our Swedish experiments and with previous research on gender bias in English text. Since many of the documents that are produced at Chalmers are written in English, the corpus that is created in the project will be bilingual Swedish-English.

Technologies and tools

We will use state of the art NLP technologies such as text similarity measures, sentiment analysis, language models (probabilistic and neural network-based), and distributional models (e.g., word and sense embeddings). Whenever possible we will make use of manually annotated reference data, in other cases we will use unsupervised methods for analysing texts.

Relation between gender equality and scientific excellence

The research that we plan to conduct in the project is novel because very little data-oriented research have been done on gendered language in Swedish texts. This pilot project will give new insights on genderedness in Swedish as well as comparisons between Swedish and English texts. The results can benefit society, to help people better understand how language can influence perceptions of gender, and how different traits are perceived as connected to a specific gender.

In addition, the project will result in analyses of texts that are being produced at Chalmers, such as web texts, job advertisements, student recruitment, and will hopefully give valuable insights. For example, the results could be used to formulate guidelines on how to write and analyse texts at Chalmers, e.g., when it comes to writing job advertisements and recruiting students.

Because there have been so little research in this area, we hope that this project will open the way to a larger scientific collaboration nationally and internationally, and that we will be able to apply for further funding in this interesting and important scientific area.

References