Event in Puno about the creacion of speech dataset for Quechua communities.
Giving a Voice to Quechua: Breaking Digital Barriers
Breaking Digital Barriers for Under-Resourced Languages (25.01.2026)
Förderjahr 2024 / Stipendium Call #19 / ProjektID: 7335 / Projekt: Building Knowledge Graphs for Under Resourced Languages

Empowering Quechua Languages Through Open Speech Data

The digital divide has long been a linguistic one. While voice assistants and speech recognition technologies have become very relevant of modern life, they have primarily served a small handful of global languages, e.g. Spanish, Englilsh, Russian, and a few more. Under-resourced languages, particularly indigenous ones like the Quechua family, face severe data scarcity that hinders their development in the AI era.

Breaking the Data Barrier

Quechua isn't just one language; it is a rich family of over 40 varieties. In our project, we didn't want to settle for a "one-size-fits-all" approach that ignores regional diversity. Therefore, we have supported the integration of varios Quechua languages into Common Voice. 

Common Voice (https://commonvoice.mozilla.org/) grew from 1,368 hours across 19 languages in its first version to 33,815 hours across 137 languages by version 22. In the context of Quechua langauges, it serves as a vital platform for hosting 191.1 hours of Quechua speech data. Here some details:

  • Total Quechua Speech Data: Mozilla Common Voice successfully integrated 17 distinct Quechua languages into the Common Voice platform, each with its own ISO code to ensure linguistic accuracy.

  • Community Contribution: The contribution in Quechua languages have resulted in a total of 191.1 hours of recorded Quechua speech.

  • Quality Validation: The Quality of the Quechua datasets hosted on Common Voice have reached 86% of validation by the community.

Why This Matters

For under-resourced languages, the lack of linguistic resources is a well-known challenge. Most speech tech requires massive datasets of recorded voices paired with text in order to develop Natural Language processing applications. By using a collaborative framework, Common Voice allows anyone with an internet connection to contribute sentences, record their voice, or validate others' recordings. This decentralized approach bypasses the need for expensive, proprietary data collection.

A New Research Agenda

We propose a rigorous research agenda to ensure these datasets are high-quality and culturally relevant:

  • Orthographic Standards: Ensuring that collected sentences adhere to established spelling and writing standards.
  • Diverse Corpora: Building text datasets across 11 thematic domains (like Healthcare and Finance) to mitigate bias in AI models.
  • Hybrid Collection Models: Overcoming barriers like expensive internet and limited digital literacy by combining online tools with offline recording campaigns.

This work is more than a technical achievement; it is a step toward digital empowerment and the preservation of indigenous traditions in a world that is increasingly voice-activated and in natural language interactions.

Huaman, E., Huaman, W., Huaman, J. L., & Quispe, N. (2025). Quechua speech datasets in Common Voice: The case of Puno Quechua (arXiv:2510.13871v1). arXiv. https://arxiv.org/pdf/2510.13871

Tags:

Speech datasets Under-Resourced Languages Quechua Minoritized Languages digital divide

Elwin Huaman

Profile picture for user elwin.huaman
Elwin Huaman is a Researcher, Project Manager, and an Activist for Under-Resourced Languages. He has experience creating Knowledge Graphs and applying Semantic Web technologies in academia and industry. He has co-authored the book: Knowledge Graphs - Methodology, Tools and Selected Use Cases, and leads the QICHWABASE Knowledge Graph that supports a harmonization process of the language and knowledge of Quechua communities across the world.

Skills:

Knowledge Graphs
,
Natural Language Processing
,
Machine Learning
,
Databases
,
Project Management
,
Data Sovereignty and Consent
,
Responsible AI
CAPTCHA
Diese Frage dient der Überprüfung, ob Sie ein menschlicher Besucher sind und um automatisierten SPAM zu verhindern.