Why AI still can’t translate South African languages

Why AI still can't translate South African languages (Updated May 2026)

Why AI still can't translate South African languages in 2026

Try it yourself. Open any AI tool — ChatGPT, Google Translate, DeepL — and ask it to translate a few sentences into isiZulu, Sepedi or Xitsonga. Then show the output to a first-language speaker. The reaction is usually the same: disbelief, then a flat "this is completely wrong."

This isn't a subtle quality gap. The output is not "good enough for a first draft." The words may look like the language. The structure may resemble a sentence. But the meaning is broken — and any speaker can see it immediately.

We've been testing AI translation against South Africa's indigenous languages since these tools became widely available. So have every translator we work with. The results are consistent.


The problem isn't just that there's not enough data. It's that most of the data that exists is bad.

AI language models learn from text. The more high-quality text exists in a language, the better a model can learn its grammar, vocabulary and idiom. For English, that training pool is effectively unlimited. For most of South Africa's indigenous languages, it is a fraction of what any model needs — and the gap is not marginal.

Wikipedia article counts offer a useful proxy for this massive gap. As one of the primary sources of training data for AI, the difference in scale is startling:

Language Wikipedia Articles (May 2026) % of English
English7,178,523100%
Afrikaans128,8741.80%
IsiZulu12,2450.17%
Sepedi8,9200.12%
Setswana4,1400.06%
IsiXhosa2,4100.03%
Sesotho1,6590.02%
Siswati1,1390.02%
Xitsonga1,0850.02%
Tshivenḓa8940.01%
Southern IsiNdebele180+<0.01%

Sourced from our dedicated FAQ page

But raw scarcity is only half the problem. Consider where most of the available isiZulu, Xitsonga or Setswana text actually comes from.

Old religious texts. Government documents from decades past. And a significant volume of translations produced purely as afterthoughts — documents that needed to tick a language requirement, where the budget went to a volume-focused agency rather than a specialist one. Many of these translations were produced without any genuine expertise in the language, by translators under commercial pressure, with no meaningful quality control. The text exists in the right language. The content is frequently wrong.

Add to that the large volumes of social media text that AI scrapes as training data. Social media in any South African language is code-switched, abbreviated and spelled however it needs to be to make the point. That's how informal language lives, and there's nothing wrong with it. But it is entirely the wrong register for a professional, medical or legal document — and the AI has no way to distinguish between the two.

The result is that AI doesn't just have too little data to learn from. It has been trained on data that actively teaches it the wrong things: outdated register, incorrect terminology, vocabulary borrowed from neighbouring languages to fill gaps, and patterns inherited from poor-quality source translations. As AI gets access to more data, if the underlying data quality doesn't improve, the models will simply become more confidently wrong.

What the output actually looks like

Languages like isiZulu and isiXhosa are agglutinative — they build complex meaning through layered word structures governed by noun class systems and concord rules that have no real equivalent in English. When AI attempts these languages, it typically produces output that looks structurally plausible but is linguistically broken underneath: noun class agreement is wrong, verb forms are wrong, and a word from isiZulu will appear mid-sentence in what was supposed to be Setswana. The model had a gap and filled it with whatever was adjacent in its training data.

No speaker can un-see these errors. Whatever the document was trying to say, its credibility is gone the moment a reader encounters them.


What this means for organisations

The international standard governing AI-assisted translation — ISO 18587:2017 — requires explicit disclosure whenever machine translation post-editing has been used. No agency can credibly offer ISO 18587-compliant post-editing for South Africa's indigenous languages in 2026. The AI output is not of sufficient quality to post-edit — correcting hallucinated grammar takes longer than translating the source text correctly from the start.

For an organisation that has used AI translation without realising this, the exposure is real. On a medical informed consent form, a mining community engagement document, or a legal notice, a translation that misrepresents meaning to a first-language speaker is not an accurate translation — regardless of what process produced it or what the cover page says.

This isn't an argument against technology. It's an argument for understanding what the technology can and cannot do. For South Africa's indigenous languages in 2026, it cannot do this.


For a full breakdown of the specific linguistic hurdles, see our FAQ on AI Translation Accuracy.

Popular posts

How to say "Hello" to every South African

Check Spelling for South African Languages in MS Word 2013

Specialist Translation Services for Official South African Languages

Why ISO-Certified and Sworn SA Translations Fail Quality Tests

Translating Documents for Cape Town's Language Landscape

Why Sepedi (Northern Sotho) is Essential for Connecting with Limpopo

Reach Up To 70 Times More South Africans with Setswana Translation in North West and Northern Cape