AI isn’t built for all languages and cultures. There’s a push to fix that

Egyptian coder Assem Sabry has long wanted an AI model that represents his culture. The problem is he hasn’t been able to find one. “The AI industry in Egypt . . . doesn’t exist,” Sabry says. So he built his own: Horus, named after the ancient Egyptian god of the sky.

Sabry says the goal was to stop “relying on other models, like the American or Chinese models,” and instead ask what a more Egyptian-focused model might look like. To make Horus work, he trained it using GPUs from Google Colab and other cloud providers, alongside open-source datasets. The model, released in early April, drew more than 800 downloads in its first week on Hugging Face.

Sabry is one of a growing number of developers trying to correct a long-standing imbalance in AI. Models are fluent in English and, to a lesser extent, Chinese, but far less capable in most other languages. So-called minority languages are, in reality, spoken by the global majority. Yet thanks to the way models are trained (on massive scrapes of the web), combined with the economics of the tech industry, English remains dominant.

In 2023, researcher Aliya Bhatia, alongside a colleague at the Center for Democracy & Technology, published a study arguing that nonstandard languages were “Lost in Translation” because of the smoothing effects and commercial incentives shaping Big Tech. In the rush to capitalize on AI, companies prioritized English-language support—in part due to limited training data—and did little to address the gap.

For years, the economics have reinforced the problem. Training AI models is expensive, and companies have little incentive to build for smaller language groups without a clear return.

That dynamic has finally begun to shift. The rise of local LLMs, along with big AI companies tightening token limits, has opened space for smaller players. “Two years ago, AI wasn’t as good as now, and the LLMs weren’t open-source,” Sabry says. “Now we can really build our AI models from scratch.”

Yet obstacles remain. Bhatia notes that “some barriers still exist in terms of compute, in terms of underlying infrastructure, and in terms of funding,” which collectively “remains a huge barrier.” Still, progress is visible.

What’s emerging is less a formal ecosystem than a loose, global patchwork of locally focused models: Switzerland’s Apertus, Latin America’s Latam-GPT, Nigeria’s N-ATLaS, Indonesia’s Sahabat-AI, AI Singapore’s SEA-LION, Vietnam’s GreenMind, Thailand’s OpenThaiGPT, and Europe’s Teuken 7B. Each offers an alternative to the dominant models from OpenAI, Anthropic, and Alibaba.

Some efforts remain grassroots, like Sabry’s. Others have institutional backing. Apertus, for instance, is a collaboration between two Swiss universities and the Swiss National Supercomputing Center, which contributed more than 10 million GPU hours, equivalent to tens of millions of dollars in commercial compute.

Most projects operate far below that scale. Still, the ability to train and deploy local models at a relatively low cost is changing the calculus. A fine-tuned version of Meta’s Llama 3.2, trained on 14,500 pairs of Indian legal-language examples, has logged just over 1,000 downloads since early April. That’s a niche audience, but a meaningful one, and one that would have been difficult to justify economically until recently.

The early uptake suggests a market beyond the mainstream. It also raises a question for the largest AI companies. “What these alternatives offer is a demonstration that it’s possible to build systems that better represent global majority users and languages,” Bhatia says, “as long as major AI companies actually want to take a page out of this book and learn from them.”