How did this happen?
Firstly, tools like ChatGPT are large language models (LLM) which work with words by using different statistical patterns learned from portions of text taken from available resources including books and the internet. Due to the dominance of English in these sources – in part due to historic events such as the British Empire and the economic dominance of the USA in the 20th century – most LLMs are trained in English. In a close second place, Chinese is suitable for LLMs due to China’s huge population of native speakers. There are also sizeable populations that speak English and Chinese as additional languages, so it does have a huge population that can at least in part understand the tools to a certain degree.
In a competitive business world where uptake in a tool is essential for survival, it makes perfect sense to prioritise focusing on the two largest languages as this gives a tool the best chance to thrive. However, it does not allow for non-native speakers to extract the most from the tool. Text data sets used to train LLMs do also include other languages as they are often naturally mixed in with the available sets, but the knowledge and capability of these tools to function in other languages just is not there.
Furthermore, there is then the issue of regional differences in languages. The clearest example in English is the difference between British and US English where there are differences in spellings, turns of phrase and how common specific words are. This can then be expanded to include Australian English, Indian English and South African English where the differences exist but in different ways. The Spanish spoken in Spain is quite different to the Spanish spoken in Latin America, and within that group the Spanish spoken in Costa Rica is different to the Spanish of Chile.
What can be done?
It seems to be the case that all data professionals understand and appreciate the benefits gen AI can provide – so this means data professionals need to be aware of the biases and limitations of English-centric tools. For example, Brazil has the second largest market and second largest population in the western hemisphere, but as it is a Portuguese-speaking nation it falls down the list of most common languages. Gen AI tools can be used in Brazil, but if they are not trained in Portuguese, they will not be able to provide the same quality insights as in English.
If a US-based business has an operation in Laos, they must appreciate that the gen AI tools being used in US English will provide better results than the ones being used in the Lao language simply because the data sets for the LLM to learn from are vastly imbalanced.
Investment into LLMs in other languages is key. Data professionals must appreciate that their international counterparts may not be receiving the same quality data-led insights from AI tools that they are receiving purely because of a language barrier. If data professionals fall into the trap of just going with the flow of English-only tools, it will stifle diversity and eventually innovation.
Half the battle data professionals have had in recent years is getting people to trust data and believe in the value of data, but if suddenly the most popular data tool is unable to accurately decipher a few sentences into a widely spoken language in a tech-driven country, like Korean, why should people trust it? By allowing LLMs to continue to thrive in one or two languages, the risk of undermining trust in data-driven recommendations exponentially grows.
Data leaders must of course be ready to accept and incorporate gen AI tools into their operations, but they must also be aware of the limitations and problems that can arise. By understanding where issues may take place for international organisations, data leaders can more accurately prepare and provide support where needed.
To better understand your own AI journey, click here to take the free DataIQ AI Indicator Assessment.
Click here to download the DataIQ advisory report on gen AI.