Policy Brief No. 216 — November 2025
AI and Language Data Flaring in Africa: Addressing the Low-Resource Challenge Ife Adebara
Key Points → African languages are under-represented in artificial intelligence (AI) systems due to limited language data, excluding millions from digital participation in their native languages. → Factors such as multilingual complexity, foreignlanguage-dominant policies, weak institutional backing and lack of digital infrastructure contribute to the low-resource classification of African languages. → “Language data flaring” — paralleling gas flaring — captures the systemic neglect and poor management of African language data leading to data undercollection, poor storage and limited use in AI. → Addressing the gap requires policies that integrate African languages into national digital agendas, support documentation, fund projects and foster inclusive, collaborative AI development. → Community-led documentation, open-source tools and growing recognition of linguistic diversity in AI offer promising paths forward.
Introduction Modern AI systems, built using deep-learning techniques, require massive amounts of data to function effectively to produce realistic outputs that reflect the patterns and structures within the training data. For language technologies, this data is sourced from news, books, blogs, social media and other digital platforms that host linguistic content. However, only a small number of languages possess sufficient data to support the development of robust AI technologies. Despite having millions of speakers, most African languages are low resource, meaning they lack the data necessary to build robust AI models. Many Africans are therefore excluded from digital tools, online resources and AI-driven services because they are not available in indigenous languages (Adams et al. 2024). Figure 1 shows the lack of cultural and language diversity of AI in government frameworks, government actions and non-state actors for Africa compared to other regions. According to Pratik Joshi et al. (2020), more than 95 percent of African languages are classified as “leftbehind” — languages for which it is nearly impossible to build AI-powered language technologies due to insufficient data. This classification is not based on the number of speakers but rather on the availability