Förderjahr 2024 / Stipendium Call #19 / ProjektID: 7335 / Projekt: Building Knowledge Graphs for Under Resourced Languages
We focus on the specific integration of Puno Quechua (qxp), a language that has seen a significant decline in speakers over the last century. It details the technical and social steps taken to record 12 hours of scripted and spontaneous speech.
In the Puno region of Peru, the Quechua language (specifically Puno Quechua, ISO 639-3: qxp) has faced a steep decline, with speakers dropping from over 87% in 1940 to roughly 38% in 2007. To combat this decline of speakers and to support the documentation of the Puno Quechua language, we have successfully integrated Puno Quechua into the global Common Voice ecosystem.
The contributed datasets are categorized into two primary types:
-
Reading (Scripted) Speech: This portion of the dataset consists of volunteers reading aloud a set of pre-approved, CCO-licensed sentences.
-
Corpus Size: You collected and uploaded 2,065 sentences for the platform, which represents 11.6% of the total Quechua sentences available on Common Voice.
-
Objective: These recordings are made in noise-free environments with clear pronunciation to provide high-quality training data for standard voice applications.
-
-
Spontaneous Speech: This represents a more advanced contribution type designed to capture how the language is used in real-world, casual conversation.
-
Methodology: We uploaded 150 open-ended questions specifically focused on the Agricultural and Food domain for community members to answer freely.
-
Linguistic Features: These recordings capture natural speaking patterns, including accents, hesitations, repetitions, and interjections such as "mm" and "ahh".
-
Code-Switching Mitigation: To minimize the frequent mixing of Quechua and Spanish (code-switching), we encouraged and centered these topics around culturally vibrant subjects like farming and daily life.
-
Integrating Puno Quechua (qxp)
The process of bringing Puno Quechua online involved two critical phases:
- Language Onboarding: The team localized the Common Voice interface, translating 466 strings (a 27% translation rate) to allow native speakers to interact with the platform in their own tongue.
- Corpus Collection: We gathered 2,065 CCO-licensed sentences, representing 11.6% of the total Quechua sentences on the platform.
Through these efforts, 12 hours of Puno Quechua speech have been recorded by 14 contributors, with 77% of that data already validated.
Beyond Reading: Spontaneous Speech
One of the most innovative aspects of this project is the focus on spontaneous speech. While reading sentences is helpful, real-world voice assistants need to understand how people actually talk, including hesitations, interjections (like "mm" or "ahh"), and informal phrasing.
- We uploaded 150 open-ended questions specifically within the Agricultural and Food domain to capture natural, unscripted responses.
- We focused on minimizing "code-switching" (mixing Spanish and Quechua) by centering topics on culturally vibrant subjects like daily life and farming.
- Furthermore, we emphasize that technical success is impossible without a robust social framework. This includes:
- Informed Consent: Ensuring processes are culturally sensitive and that communities understand how their data will be used.
- Data Sovereignty: Moving beyond simple regulatory compliance to explore community-centric licenses and benefit-sharing agreements that align with indigenous values.
The Path Forward
The future of Quechua in tech looks bright. The goal is now to expand to all remaining Quechua varieties and enable offline contributions via instant messaging apps. By putting the tools of data creation back into the hands of the community, Puno Quechua is proving that under-resourced does not mean under-valued.
Huaman, E., Huaman, W., Huaman, J. L., & Quispe, N. (2025). Quechua speech datasets in Common Voice: The case of Puno Quechua (arXiv:2510.13871v1). arXiv. https://arxiv.org/pdf/2510.13871