Publications

Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models

July, 2022
Hate speech detection models are typically evaluated on held-out test sets. However, this risks painting an incomplete and potentially …

HATE-ITA: Hate Speech Detection in Italian Social Media Text

July, 2022
Online hate speech is a dangerous phenomenon that can (and should) be promptly counteracted properly. While Natural Language Processing …

MilaNLP at SemEval-2022 Task 5: Using Perceiver IO for Detecting Misogynous Memes with Text and Image Modalities

April, 2022
In this paper, we describe the system proposed by the MilaNLP team for the Multimedia Automatic Misogyny Identification (MAMI) …

Language Invariant Properties in Natural Language Processing

April, 2022
Meaning is context-dependent, but many properties of language (should) remain the same even if we transform the context. For example, …

XLM-EMO: Multilingual Emotion Prediction in Social Media Text

April, 2022
Detecting emotion in text allows social and computational scientists to study how people behave and react to online events. However, …

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks

April, 2022
Labelled data is the foundation of most natural language processing tasks. However, labelling data is difficult and there often are …

Pipelines for Social Bias Testing of Large Language Models

April, 2022
The maturity level of language models is now at a stage in which many companies rely on them to solve various tasks. However, while …

Measuring Harmful Sentence Completion in Language Models for LGBTQIA+ Individuals

April, 2022
Current language technology is ubiquitous and directly influences individuals' lives worldwide. Given the recent trend in AI on …

Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection

April, 2022
Transformer-based Natural Language Processing models have become the standard for hate speech detection. However, the unconscious use …

Fair and Argumentative Language Modeling for Computational Argumentation

April, 2022
Although much work in NLP has focused on measuring and mitigating stereotypical bias in semantic spaces, research addressing bias in …

DS-TOD: Efficient Domain Specialization for Task Oriented Dialog

April, 2022
Recent work has shown that self-supervised dialog-specific pretraining on large conversational datasets yields substantial gains over …

SAFETYKIT: First Aid for Measuring Safety in Open-domain Conversational Systems

March, 2022
The social impact of natural language processing and its applications has received increasing attention. In this position paper, we …

Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

March, 2022
Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, …

Text Analysis in Python for Social Scientists – Prediction and Classification

January, 2022
Text contains a wealth of information about about a wide variety of sociocultural constructs. Automated prediction methods can infer …

Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

August, 2021
Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the …

On the Gap between Adoption and Understanding in NLP

August, 2021
There are some issues with current research trends in NLP that can hamper the free development of scientific research. We identify five …

Five sources of bias in natural language processing

August, 2021
Recently, there has been an increased interest in demographically grounded bias in natural language processing (NLP) applications. Much …

Exposing the limits of Zero-shot Cross-lingual Hate Speech Detection

August, 2021
Reducing and counter-acting hate speech on Social Media is a significant concern. Most of the proposed automatic methods are conducted …

'We will Reduce Taxes' - Identifying Election Pledges with Language Models

August, 2021
In an election campaign, political parties pledge to implement various projects–should they be elected. But do they follow …

The Importance of Modeling Social Factors of Language: Theory and Practice

June, 2021
Natural language processing (NLP) applications are now more powerful and ubiquitous than ever before. With rapidly developing (neural) …

HONEST: Measuring Hurtful Sentence Completion in Language Models

June, 2021
Language models have revolutionized the field of NLP. However, language models capture and proliferate hurtful stereotypes, especially …

Language in a (Search) Box: Grounding Language Learning in Real-World Human-Machine Interaction

June, 2021
We investigate grounded language learning through real-world data, by modelling a teacher-learner dynamics through the natural interactions occurring between users and search engines.

MilaNLP @ WASSA: Does BERT Feel Sad When You Cry?

May, 2021
The paper describes the MilaNLP team’s submission (Bocconi University, Milan) in the WASSA 2021 Shared Task on Empathy Detection and …

FEEL-IT: Emotion and Sentiment Classification for the Italian Language

May, 2021
Sentiment analysis is a common task to understand people’s reactions online. Still, we often need more nuanced information: is …

Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning

May, 2021
Supervised learning assumes that a ground truth label exists. However, the reliability of this ground truth depends on human …

Universal Joy A Data Set and Results for Classifying Emotions Across Languages

April, 2021
While emotions are universal aspects of human psychology, they are expressed differently across different languages and cultures. We …

BERTective: Language Models and Contextual Information for Deception Detection

April, 2021
Spotting a lie is challenging but has an enormous potential impact on security as well as private and public safety. Several NLP …

Cross-lingual Contextualized Topic Models with Zero-shot Learning

March, 2021
We introduce a novel topic modeling method that can make use of contextulized embeddings (e.g., BERT) to do zero-shot cross-lingual topic modeling.

Text Analysis in Python for Social Scientists – Discovery and Exploration

December, 2020
Text is everywhere, and it is a fantastic resource for social scientists. However, because it is so abundant, and because language is …

“You Sound Just Like Your Father” Commercial Machine Translation Systems Include Stylistic Biases

July, 2020
The main goal of machine translation has been to convey the correct content. Stylistic considerations have been at best secondary. We …

Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview

July, 2020
An increasing number of natural language processing papers address the effect of bias on predictions, introducing mitigation techniques …

Visualizing Regional Language Variation Across Europe on Twitter

March, 2020
Geotagged Twitter data allows us to investigate correlations of geographic language variation, both at an interlingual and intralingual …

What the [MASK]? Making Sense of Language-Specific BERT Models

March, 2020
Recently, Natural Language Processing (NLP) has witnessed an impressive progress in many areas, due to the advent of novel, pretrained …

Helpful or Hierarchical? Predicting the Communicative Strategies of Chat Participants, and their Impact on Success

March, 2020
When interacting with each other, we motivate, advise, inform, show love or power towards our peers. However, the way we interact may …

Fake opinion detection: how similar are crowdsourced datasets to real data?

January, 2020
Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is …

A Case for Soft Loss Functions

January, 2020
Recently, Peterson et al. provided evidence of the benefits of using probabilistic soft labels generated from crowd annotations for …

Identifying Linguistic Areas for Geolocation

November, 2019
Geolocating social media posts relies on the assumption that language carries sufficient geographic information. However, locations are …

Hey Siri. Ok Google. Alexa: A topic modeling of user reviews for smart speakers

November, 2019
User reviews provide a significant source of information for companies to understand their market and audience. In order to discover …

Geolocation with Attention-Based Multitask Learning Models

November, 2019
Geolocation, predicting the location of a post based on text and other information, has a huge potential for several social media …

Dense Node Representation for Geolocation

November, 2019
Prior research has shown that geolocation can be substantially improved by including user network information. While effective, it …

Women’s Syntactic Resilience and Men’s Grammatical Luck: Gender-Bias in Part-of-Speech Tagging and Dependency Parsing

July, 2019
Several linguistic studies have shown the prevalence of various lexical and grammatical patterns in texts authored by a person of a …

Peer networks and entrepreneurship: A Pan-African RCT

January, 2019
Can large-scale peer interaction foster entrepreneurship and innovation? We conducted an RCT involving almost 5,000 entrepreneurs from …

Predicting News Headline Popularity with Syntactic and Semantic Knowledge Using Multi-Task Learning

October, 2018
Newspapers need to attract readers with headlines, anticipating their readers’ preferences. These preferences rely on topical, …

Comparing Bayesian Models of Annotation

October, 2018
The analysis of crowdsourced annotations in natural language processing is concerned with identifying (1) gold standard labels, (2) …

Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting

October, 2018
Dialects are one of the main drivers of language variation, a major challenge for natural language processing tools. In most languages, …

The Social and the Neural Network: How to Make Natural Language Processing about People again

June, 2018
Over the years, natural language processing has increasingly focused on tasks that can be solved by statistical models, but ignored the …