Unlocking the Future of Speech data for Under-Resourced Languages

Speech data collection event in Puno, Peru

Unlocking the Future of Speech data for Under-Resourced Languages

Building Ethical and Powerful Speech AI (21.08.2025)

Förderjahr 2024 / Stipendium Call #19 / ProjektID: 7335 / Projekt: Building Knowledge Graphs for Under Resourced Languages

Imagine asking a voice assistant for the weather forecast in your mother tongue, it sounds possible for well established languages. For millions of minoritized language speakers across the world, this isn't possible.

My research tackles a fundamental challenge in AI: how to build intelligent, understanding systems for the world's diverse languages. I specialize in constructing knowledge graphs for under-resourced languages, a process that begins at the most vital stage: acquiring high-quality, ethical data.

Right now, my work is on the ground, collaborating with language communities to create rich speech datasets. This isn't just about recording audio; it's about merging cutting-edge technology with deep cultural understanding and strong ethical principles. Here are some highlights:

The Technical Hurdles: More Than Just Microphones

Language Support & Orthography: The digital world must correctly "see" the language, e.g., Latin-script script for some languages and characters for other languages. This means ensuring digital platforms support all languages, regardless of their official coding on linguistic databases.

Diverse and Relevant Content: An AI is only as good as the data it eats. If a speech dataset only contains formal news broadcasts collected in urban areas, it will fail to understand casual conversation and cultural nuances of rural areas. We need a rich and balanced text corpus that covers all aspects of life, from farming and cooking to rituals and jokes, to avoid building biased AI.

Capturing Real, Spontaneous Speech: Carefully read text is clean and easy to process, but it's not how we talk. Natural speech is full of pauses, repetitions, false starts, and background noise. Furthermore, a key complexity for many under-resourced language communities, where their speakers are bilingual, is code-switching, seamlessly blending more than one language within a single conversation.

Bridging the Digital Divide: The Socio-Economic Gap

A major barrier isn't linguistic, it's infrastructural. During speech data collection, researchers identified practical hurdles like expensive internet access and limited digital literacy, which can exclude entire communities from participating.

The future of inclusive data collection combines online platforms with organized offline recording campaigns. This means providing devices, internet access, and offline tools, paired with localized training workshops. This strategy is essential to overcome socio-economic barriers and ensure everyone can have a voice.

The essential ethical foundation goes beyond the technical checklist. True success is built on a robust ethical framework, where we place community first and data sovereignty and consent.

Community First: It starts with prioritizing localization and building genuine, long-term relationships with under-resourced language communities, linguists, and local institutions. This isn't a one-off data extraction; it's about sustainable engagement.
Data Sovereignty and Consent: Informed consent must be culturally sensitive and truly understood. Crucially, communities must be engaged in conversations about data sovereignty; who owns the voice data and how it can be used. This means exploring community-centric data licenses and benefit-sharing agreements that respect and align with Indigenous values.

A Foundation for the Future

By addressing this integrated research agenda; tackling both the technical challenges and the ethical considerations, we can do more than just build speech datasets. We can establish a robust foundation for the future of under-resourced languages in the digital age.

Elwin Huaman

Elwin Huaman is a Researcher, Project Manager, and an Activist for Under-Resourced Languages. He has experience creating Knowledge Graphs and applying Semantic Web technologies in academia and industry. He has co-authored the book: Knowledge Graphs - Methodology, Tools and Selected Use Cases, and leads the QICHWABASE Knowledge Graph that supports a harmonization process of the language and knowledge of Quechua communities across the world.

Skills:

Knowledge Graphs

Natural Language Processing

Machine Learning

Databases

Project Management

Data Sovereignty and Consent

Responsible AI

Weitere Blogbeiträge

Förderjahr 2024 / Stipendium Call #19 / ProjektID: 7335 / Projekt: Building Knowledge Graphs for Under Resourced Languages

The Technical Hurdles: More Than Just Microphones

Bridging the Digital Divide: The Socio-Economic Gap

A Foundation for the Future

Tags:

Elwin Huaman

Skills:

Weitere Blogbeiträge

Case Study: Puno Quechua language

Giving a Voice to Quechua: Breaking Digital Barriers

Multilingual Awareness in Knowledge Graphs

Bridging the Digital Gap