TOLD

Photo by Oleg Laptev on Unsplash

TOLD – Thinking Out Loud: A Speech-Based Data Collection Framework
Funded by: Tech Europe Foundation (TEF)

This project is in collaboration with Dr. Giuseppe Attanasio

Recent advances in language modeling show that the quality of training data matters more than its quantity. Yet collecting meaningful and representative language data remains costly and slow because annotation still relies almost entirely on written text. Voice-based feedback offers a powerful alternative: it elicits richer and more natural descriptions, reflects personal experiences and subjective perspectives, and conveys paralinguistic cues such as prosody and timing that written text cannot capture. Despite this potential, voice is still largely underused in annotation.

TOLD aims to show that a voice-based annotation paradigm can outperform traditional written feedback in NLP. Speaking rather than typing produces more informative and expressive data, enables faster and more efficient collection, and can lead to models that learn more effectively from the resulting annotations.

By shifting data collection from text to voice, TOLD introduces a new way of capturing how people think, react, and interpret information. The project seeks to establish voice as a natural, scalable, and more powerful medium for annotating language data.