Generative AI is trained on just a few of the world’s 7,000 languages. Here’s why that’s a problem – and what’s being done about it

May 17, 2024

Companies are embedding more languages in their AI models.

Image: Unsplash/Solen Feyissa

Madeleine North

Senior Writer, Forum Stories

Our Impact

What's the World Economic Forum doing to accelerate action on Emerging Technologies?

Stay up to date:

Horizon Scan: Nita Farahany

Generative AI is mainly trained on the English language, leading to bias and, in some cases, errors with serious consequences.
Companies and governments are taking action and creating their own AI models to ensure more of the world’s 7,000 languages are embedded in the technology.
Preserving cultural heritage is one of the suggested actions put forward in the World Economic Forum’s Presidio Recommendations on Responsible Generative AI.

"Ka pai te AI Whakaputanga i ngā reo?"

According to ChatGPT – and hopefully anyone Māori – the above sentence means, “Is Generative AI good at languages?”.

The answer: yes and no.

With the majority of large language models (LLMs) trained on English text, if you are, say, a student in Odisha, India, using AI to analyze a research paper in your native Odia language, the likes of ChatGPT, Claude and Google Bard may let you down.

Have you read?

This may have serious consequences in some cases. A translator in the US told Reuters Context that four in ten of their Afghan asylum cases derailed in 2023 due to inaccurate AI-driven translation apps.

So what is going on here? There are over 7,000 languages spoken in the world, yet most AI chatbots are trained on around 100 of them. And English, despite being spoken by less than 20% of the world’s population, accounts for almost two-thirds of websites and is the main driver of LLMs, says the Center for Democracy & Technology (CDT).

The English language dominates the internet, and therefore generative AI models too. Image: Reuters Context

Generative AI and its language bias

Inevitably, this linguistic imbalance is leading to issues.

The “insane mistakes” spotted by the asylum application translators included names becoming months, crucial details missing, even immigration sentences being reversed. "The machines themselves are not operating with even a fraction of the quality they need to be able to do casework that's acceptable for someone in a high-stakes situation," Ariel Koren, founder of Respond Crisis Translation, told Reuters Context.

It’s a view shared by CDT’s Gabriel Nicholas and Aliya Bhatia, who point out that, despite the gradual emergence of Multilingual Language Models (MLMs), they “are still usually trained disproportionately on English language text and thus end up transferring values and assumptions encoded in English into other language contexts where they may not belong”. They give the example of the word “dove”, which an MLM might interpret in various languages as being associated with peace, but the Basque equivalent (“uso”) is in fact an insult.

What’s needed is the development of non-English Natural Language Processing (NLP) applications, say experts, to help reduce the language bias in generative AI and “preserve cultural heritage”. The latter is one of 30 suggested actions put forward in the World Economic Forum’s Presidio Recommendations on Responsible Generative AI. “Public and private sector should invest in creating curated datasets and developing language models for underrepresented languages, leveraging the expertise of local communities and researchers and making them available,” it says.

Discover

How is the World Economic Forum creating guardrails for Artificial Intelligence?

Addressing the AI language bias

There are signs that governments, the tech community and even individuals are taking steps to resolve the AI language issue.

The Indian government is building Bhashini, an AI translation system trained on local languages. There are 22 official ones, but few are currently captured by NLP applications. Indian tech firm Karya is also trying to redress the balance by building datasets for firms like Microsoft and Google to use in AI models. It’s a painstaking process, involving people reading words in their native language into an app.

Launched in the UAE in 2023, Jais AI is an Arabic language model capable of generating high-quality text in Arabic, including regional dialects, says Digital Watch. The developers, G42, next plan to bring out the world’s first Arabic robot assistant.

In New Zealand, local broadcaster Te Hiku Media is harnessing AI to aid the “preservation, promotion and revitalization of te reo Māori,” its chief technology officer told Nvidia, which helped create the automatic speech recognition models it says can transcribe te reo with 92% accuracy.

0 seconds of 0 secondsVolume 90%

00:00

In a similar endeavour, grassroots organization Masakhane is working to “strengthen and spur NLP research in African languages”. There are around 2,000 languages spoken across Africa, yet they are “barely represented in technology”, it says.

Nigeria's government is also taking action, recently launching its first multilingual LLM. “The LLM will be trained on five low-resource languages and accented English to ensure stronger language representation in existing datasets for the development of artificial intelligence solutions,” Dr 'Bosun Tijani, the Minister of Communications, Innovation and Digital Economy, announced on LinkedIn.

In the Brazilian Amazon, 300 languages are spoken by indigenous people, but only a few of the major ones are recognized by LLMs.

0 seconds of 29 minutes, 39 secondsVolume 90%

00:00

29:39

After being unable to communicate with the Amazonian community he was living and working with, Turkish artist Refik Anadol – who co-created the indigenous digital artwork Winds of Yawanawa – turned his frustration into action. Anadol has spearheaded the creation of an open-source AI tool “for any indigenous people” to “preserve their language with technology”, he told the World Economic Forum at this year’s Annual Meeting in Davos.

“How on Earth can we create an AI that doesn’t know the whole of humanity?” he asked.

With a language “disappearing” at a rate of one every fortnight, according to UNESCO, generative AI could prove to be the death knell, or the saviour, of many of them.

Accept our marketing cookies to access this content.

These cookies are currently disabled in your browser.

Don't miss any update on this topic

Create a free account and access your personalized content collection with our latest publications and analyses.

License and Republishing

World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.

The views expressed in this article are those of the author alone and not the World Economic Forum.