2025-150 Scalable Spontaneous Speech Dataset (SSSD)
Abstract
The Scalable Spontaneous Speech Dataset (SSSD) consists of 727 hours of spontaneous English conversations among randomly paired anonymous people located in the United States. These conversations cover a variety of topics and were conducted on the Amazon Mechanical Turk crowdsourcing platform. The dataset will enable the training of expressive models for dialog with semantic, on topic, turn taking. One major benefit of this research is its ability to address and encapsulate casual and expressive speech that is scalable to the majority of languages, including those without large textual resources for text-based LM training.
The dataset opens up the possibility of scalable data collection for building models of casual, expressive, and conversational speech for virtually all languages. It addresses the gap in existing large-scale datasets that are limited to transcription data and formal speech for model training. Incorporating data directly addressing casual and expressive aspects of conversation will improve the authenticity of dialogue models. It will also expand accessibility to languages lacking large textual resources that many conventional datasets depend on for model training.
Benefit
Most current speech datasets rely on scripted, formal, or read-aloud content – failing to encapsulate the variability and dynamic, expressive nature of genuine human conversation. The Scalable Spontaneous Speech Dataset (SSSD) fills this gap by enabling large-scale collection of casual, spontaneous, emotionally nuanced speech data across diverse languages and dialects via a mobile-first protocol. Developed through a collaboration between Carnegie Mellon University and Meta Inc., this technology uniquely empowers researchers and developers to train expressive, textless AI models that better understand tone, turn-taking, and informal dialogue. With a mobile application-based architecture, the platform further allows for seamless global scaling to regions with limited language resources.
Market Application
Conversational AI & Assistants
Emotionally Aware Systems
Multilingual Voice Interfaces
Accessibility & Inclusion Tools
Low-resource and Endangered Language Preservation.
Accessibility Tools for Speech Impairments
Social Robotics
Real-time Human to AI Interaction Systems
Speech-centric Educational Platforms
Other Information
Please note:
- You must be 18 years of age or older to request this dataset
- To access the dataset you will will need to fill out a request form requiring your name, address and institution affiliation.
- Your request is usually processed in 3 business days, but it may take longer if the form is incomplete or additional screening is needed. If you do not receive a response in 3 or 4 business days, you may contact CMU CTTEC by clicking the Contact button.
- The total dataset is 132 Gigabits in size. It may take hours to download depending on your connection. Also, please have the available space on your device.
Images