AI’s English Problem—and Why We Should Care
Sushant Kumar, Ananya Mukherjee / Apr 28, 2026Sushant Kumar and Ananya Mukherjee are external consultants working with Bhashini, India’s sovereign AI language infrastructure.

Visitors during the India AI Impact Summit 2026 at Bharat Mandapam, in New Delhi, Monday, Feb. 16, 2026. (Ravi Choudhary/PTI via AP)
Bhashini’s Delhi office was abuzz in the lead-up to February’s India AI Impact Summit. Building multilingual AI systems from the ground up is no easy task. The process of creating multilingual language datasets for 22+ Indic languages is often referred to by Bhasini CEO Amitabh Nag, as “stitching together by brute force.” The phrase captures the scale of the task: assembling vast volumes of speech, text, and translation data across dozens of languages and dialects, to create a pipeline for indigenous model development.
Launched in July 2022 under the aegis of the Digital India Corporation in the Ministry of Electronics and Information Technology, Bhashini’s core mission is to break language barriers across India. Since then, it has rapidly developed over 350+ open source AI models, 4,500+ language training datasets, and launched a co-creation platform bringing together governments, startups, and academia to ideate and build multilingual AI systems. Its models covering voice-first solutions and neural machine translation capabilities are already integrated across multiple government services.
Initiatives like Bhashini are a response to a deeper structural challenge in the development of AI: the dominance of English as the primary language of training data.
Most global AI models are trained in English. With an estimated 1.5 to 2 billion English speakers worldwide, this guarantees reach. The reality, however, is vastly different in the Global South. Approximately 10% of India’s 1.4 billion population speaks English. This is the language gap.
The language gap is not restricted to India. For instance, Africa has at least 75 languages with more than 1 million speakers. Yet, few large language models (LLMs) provide full support for African languages. For instance, while ChatGPT claims to support Hausa, its models are only able to correctly identify 20% of Hausa sentences.
Haves and have nots
The use of essential AI tools when limited to a certain section of the population, the urban elite, runs the risk of reinforcing digital divides. Public sector initiatives in India like Bhashini, alongside private sector startups like Sarvam, Krutrim, Gnani, and CoRover are working to create multilingual AI solutions to this problem. These systems allow users to interact with AI in a language of their choosing.
The development of LLMs is heavily reliant on the availability of language data. Most AI systems work on pattern matching and require vast volumes of training data. This data is abundant for high-resource languages like English, French, Spanish, German, and even Hindi for that matter. But Indic languages such as Odia, Gujarati, and Assamese—with close to 50 million, 90 million, and 15 million native speakers respectively—are still sidelined as low-resource languages. Building AI ready datasets for less digitally prevalent languages, means digitizing texts, collecting speech samples, and creating translation materials.
New language capabilities integrated in existing high-resource models are prone to errors. True digital access requires more than just translation. It requires systems that understand the cultural realities of the communities using them. When researchers asked ChatGPT and Google Gemini how many ‘seasons’ there are, these systems replied four. Yet, in West Africa, there are only two major seasons: wet and dry.
Rather, multilingual AI systems have the capacity to touch the lives of millions of non-English speakers. Digital tools that have in the past remained out of the hands of a tribal woman in a remote village, a doctor in a local hospital, a farmer in rural India, may suddenly be understandable and available in a manner that is affordable, light-touch, and easy to use. With language as essential AI infrastructure, not a design feature, the compounding effects of this access could be felt by beneficiaries’ across government services.
Different strategies to build language datasets
Bhashini, as a government initiative, has taken a community driven approach to language collection. ‘BhashaDaan’ is Bhashini’s crowdsourced language initiative—inviting Indian citizens to donate texts, voice, and linguistic knowledge in their native languages to the Bhashini platform. This data is then used to train Bhashini’s AI models. Civic participation sits at the heart of the BhashaDaan initiative, giving an opportunity to everyday citizens to shape the design and development of AI models.
The Mozilla Common Voice initiative takes a similar approach. The initiative is a free, global, open-source, language data creation platform. The platform seeks colloquial contexts, written texts, and speech data from the global community. It features over 180 languages, including many that are oral-first languages.
In order to create high-quality, rights respecting language datasets, Bhashini also leverages partnerships with linguistic experts, academic institutions, and publishers to curate datasets through structured collaborations rather than unpermitted web scraping.
Those familiar with debates around responsible AI will recognize the value in beneficiaries’ having a say in how AI is designed. Most AI systems are still developed behind closed doors by a handful of people in the Global North. This demonstrates a different path—addressing questions of power and resource concentration by empowering local communities in the design process.
Indian private sector applications like Sarvam and Krutrim have tackled the problem from a different perspective, focusing on building full-stack AI systems. Sarvam has, for example, built state of the art foundation models alongside conversational AI agents that operate across India’s varied languages.
Countries with diverse populations and high multilingualism are being pushed to think differently about AI and culture. AI, when designed inclusively, has the power to reduce digital divides and extend digital services to last mile users.
This requires treating language not as a design feature, but as essential AI infrastructure.
A new class of efforts emerges
Similar efforts are emerging elsewhere in the Global South. Kalpa Impact’s Report “Opening Up Computational Resources for New AI Futures” provides an overview of multilingual data initiatives. The report highlights Lelapa AI, Masakhane, and African Language Lab, which are building language technologies centered on African linguistic contexts. Lelapa AI’s Vulavula initiative converts call center conversations in local African languages and regional dialects into datasets. These datasets are used to train models from the ground up rather than adapting existing high-resource systems. The organization centers languages such as Swahili, Yoruba, Afrikaans, Sesotho, and Hausa.
Masakhane is a grassroots research collective focused on natural language processing (NLP) research on African languages working across Umbundu, Swazi, Kabiya, and Sepedi. The African Language Lab focuses on validation and developing speech repositories for several African languages.
The red thread of language as training datasets, to model training, to context instructions ensures that indigenous multilingual AI systems are accurate, culturally relevant, and genuinely useful to diverse communities.
Authors

