Case Study: Puno Quechua language
Speech dataset collection for Puno Quechua (26.01.2026)
Förderjahr 2024 / Stipendium Call #19 / ProjektID: 7335 / Projekt: Building Knowledge Graphs for Under Resourced Languages

We focus on the specific integration of Puno Quechua (qxp), a language that has seen a significant decline in speakers over the last century. It details the technical and social steps taken to record 12 hours of scripted and spontaneous speech.

In the Puno region of Peru, the Quechua language (specifically Puno Quechua, ISO 639-3: qxp) has faced a steep decline, with speakers dropping from over 87% in 1940 to roughly 38% in 2007. To combat this decline of speakers and to support the documentation of the Puno Quechua language, we have successfully integrated Puno Quechua into the global Common Voice ecosystem.

The contributed datasets are categorized into two primary types:

  • Reading (Scripted) Speech: This portion of the dataset consists of volunteers reading aloud a set of pre-approved, CCO-licensed sentences.

    • Corpus Size: You collected and uploaded 2,065 sentences for the platform, which represents 11.6% of the total Quechua sentences available on Common Voice.

    • Objective: These recordings are made in noise-free environments with clear pronunciation to provide high-quality training data for standard voice applications.

  • Spontaneous Speech: This represents a more advanced contribution type designed to capture how the language is used in real-world, casual conversation.

    • Methodology: We uploaded 150 open-ended questions specifically focused on the Agricultural and Food domain for community members to answer freely.

    • Linguistic Features: These recordings capture natural speaking patterns, including accents, hesitations, repetitions, and interjections such as "mm" and "ahh".

    • Code-Switching Mitigation: To minimize the frequent mixing of Quechua and Spanish (code-switching), we encouraged and centered these topics around culturally vibrant subjects like farming and daily life.

Integrating Puno Quechua (qxp)

The process of bringing Puno Quechua online involved two critical phases:

  • Language Onboarding: The team localized the Common Voice interface, translating 466 strings (a 27% translation rate) to allow native speakers to interact with the platform in their own tongue.
  • Corpus Collection: We gathered 2,065 CCO-licensed sentences, representing 11.6% of the total Quechua sentences on the platform.

Through these efforts, 12 hours of Puno Quechua speech have been recorded by 14 contributors, with 77% of that data already validated.

Beyond Reading: Spontaneous Speech

One of the most innovative aspects of this project is the focus on spontaneous speech. While reading sentences is helpful, real-world voice assistants need to understand how people actually talk, including hesitations, interjections (like "mm" or "ahh"), and informal phrasing.

  • We uploaded 150 open-ended questions specifically within the Agricultural and Food domain to capture natural, unscripted responses.
  • We focused on minimizing "code-switching" (mixing Spanish and Quechua) by centering topics on culturally vibrant subjects like daily life and farming.
  • Furthermore, we emphasize that technical success is impossible without a robust social framework. This includes:
    • Informed Consent: Ensuring processes are culturally sensitive and that communities understand how their data will be used.
    • Data Sovereignty: Moving beyond simple regulatory compliance to explore community-centric licenses and benefit-sharing agreements that align with indigenous values.

The Path Forward

The future of Quechua in tech looks bright. The goal is now to expand to all remaining Quechua varieties and enable offline contributions via instant messaging apps. By putting the tools of data creation back into the hands of the community, Puno Quechua is proving that under-resourced does not mean under-valued.

Huaman, E., Huaman, W., Huaman, J. L., & Quispe, N. (2025). Quechua speech datasets in Common Voice: The case of Puno Quechua (arXiv:2510.13871v1). arXiv. https://arxiv.org/pdf/2510.13871

Tags:

Under-Resourced Languages Quechua data provenance Minoritized Languages

Elwin Huaman

Profile picture for user elwin.huaman
Elwin Huaman is a Researcher, Project Manager, and an Activist for Under-Resourced Languages. He has experience creating Knowledge Graphs and applying Semantic Web technologies in academia and industry. He has co-authored the book: Knowledge Graphs - Methodology, Tools and Selected Use Cases, and leads the QICHWABASE Knowledge Graph that supports a harmonization process of the language and knowledge of Quechua communities across the world.

Skills:

Knowledge Graphs
,
Natural Language Processing
,
Machine Learning
,
Databases
,
Project Management
,
Data Sovereignty and Consent
,
Responsible AI
CAPTCHA
Diese Frage dient der Überprüfung, ob Sie ein menschlicher Besucher sind und um automatisierten SPAM zu verhindern.