
Förderjahr 2024 / Stipendium Call #19 / ProjektID: 7335 / Projekt: Building Knowledge Graphs for Under Resourced Languages
Imagine asking a voice assistant for the weather forecast in your mother tongue, it sounds possible for well established languages. For millions of minoritized language speakers across the world, this isn't possible.
My research tackles a fundamental challenge in AI: how to build intelligent, understanding systems for the world's diverse languages. I specialize in constructing knowledge graphs for under-resourced languages, a process that begins at the most vital stage: acquiring high-quality, ethical data.
Right now, my work is on the ground, collaborating with language communities to create rich speech datasets. This isn't just about recording audio; it's about merging cutting-edge technology with deep cultural understanding and strong ethical principles. Here are some highlights:
The Technical Hurdles: More Than Just Microphones
Language Support & Orthography: The digital world must correctly "see" the language, e.g., Latin-script script for some languages and characters for other languages. This means ensuring digital platforms support all languages, regardless of their official coding on linguistic databases.
Diverse and Relevant Content: An AI is only as good as the data it eats. If a speech dataset only contains formal news broadcasts collected in urban areas, it will fail to understand casual conversation and cultural nuances of rural areas. We need a rich and balanced text corpus that covers all aspects of life, from farming and cooking to rituals and jokes, to avoid building biased AI.
Capturing Real, Spontaneous Speech: Carefully read text is clean and easy to process, but it's not how we talk. Natural speech is full of pauses, repetitions, false starts, and background noise. Furthermore, a key complexity for many under-resourced language communities, where their speakers are bilingual, is code-switching, seamlessly blending more than one language within a single conversation.
Bridging the Digital Divide: The Socio-Economic Gap
A major barrier isn't linguistic, it's infrastructural. During speech data collection, researchers identified practical hurdles like expensive internet access and limited digital literacy, which can exclude entire communities from participating.
The future of inclusive data collection combines online platforms with organized offline recording campaigns. This means providing devices, internet access, and offline tools, paired with localized training workshops. This strategy is essential to overcome socio-economic barriers and ensure everyone can have a voice.
The essential ethical foundation goes beyond the technical checklist. True success is built on a robust ethical framework, where we place community first and data sovereignty and consent.
- Community First: It starts with prioritizing localization and building genuine, long-term relationships with under-resourced language communities, linguists, and local institutions. This isn't a one-off data extraction; it's about sustainable engagement.
- Data Sovereignty and Consent: Informed consent must be culturally sensitive and truly understood. Crucially, communities must be engaged in conversations about data sovereignty; who owns the voice data and how it can be used. This means exploring community-centric data licenses and benefit-sharing agreements that respect and align with Indigenous values.
A Foundation for the Future
By addressing this integrated research agenda; tackling both the technical challenges and the ethical considerations, we can do more than just build speech datasets. We can establish a robust foundation for the future of under-resourced languages in the digital age.
Elwin Huaman
