How do we preserve the Latvian language in an era of artificial intelligence and large language models?
Team Tilde October 14, 2024Nowadays, artificial intelligence (AI) is becoming an integral part of our daily routine technologies, and the use of AI is spreading like wildfire. For many people, the first association is with tools such as ChatGPT, founded on large language models trained on huge amounts of text and other data. The languages of small communities, including the Latvian language, however, are very often neglected. For example, more than 90% of the data used in training the ChatGPT model is in English, whereas the remaining part is mostly comprised of data in big languages such as German, French, Portuguese, Spanish and Mandarin. This is just one of the reasons why Latvia has to develop its own national large language model that is capable of ensuring the preservation and development of the Latvian language in AI era. Hopefully, this was one of the issues discussed at the recent meeting between the President of Latvia, Edgars Rinkēvičs, and the CEO of OpenAI, Sam Altman.
AI solutions are increasingly based on large language model technologies, for example, ChatGPT, Microsoft Copilot, and Gemini. It is very likely that in the long term, this technology could replace all other technologies currently used, such as machine translation, speech recognition, text analysis, computer vision, etc. AI could aggregate textual data and images, making it all available in a large language model. This would be the base technology for all future solutions that we can only start to imagine at the moment or see in science fiction films.
The US is dominated by Tech Giants, whereas Europe has chosen another path
Currently, the development of AI tools is dominated by the US Tech Giants: Microsoft, Google, Meta, and Amazon. These companies have access to huge computing power, intellectual capacity and also abundant financial resources. With English being the primary working and data language in the US, the solutions developed by those giants are of high quality, widely applicable, and rapidly spreading in the market. At the same time, these US companies are also keeping a close watch on processes and developments in the rest of the world. They are well aware of the potential of the European market and are ready to swiftly and effectively fill the existing gap with large language models adjusted to European languages. This is suggested by the recent 665 million US dollar deal in which the Tech Giant AMD acquired the Finnish Silo.ai, considered to be the leading large language model developer for the Nordic languages.
The European Union (EU) has chosen a different path. Technologies are not in the hands of the industry giants here. The implementation of large language models could be compared to the Industrial Revolution: it will be the future of automation and robotization but already on an advanced level. European countries are well aware of this and have acted accordingly, jointly developing several supercomputers that industry players will be able to access through various innovation programmes. As a winner of the European Commission’s Large AI Grand Challenge, the Latvian company Tilde will be one of the first four companies to have the opportunity to use the most powerful European supercomputer, LUMI. It will help Tilde to develop a multilingual large language model for Latvian, Lithuanian and other languages of Europe’s small nations, similar to ChatGPT. The amount of data used will be so huge that none of the data hubs previously available in the Baltics or elsewhere in Europe will be able to handle such large language model training. This fundamental multilingual model will serve as a basis for the further development of national large language models and the adaptation of AI solutions.
The need for political initiative
To preserve the future ability to use and develop AI tools in Latvian and successfully compete with other world economies, Latvia has to create a national language model. Nearly every European country already has such initiatives in place. For example, the Netherlands has just started to implement a national programme and has granted financing in the amount of several million euros for the development of a national language model. Poland also launched a one-year project in November 2023 with a view to building a national language model. Our neighbours, the Lithuanians, finalised procurement for the development and implementation of Lithuania’s national language model at the end of June. The Estonian government has just granted funding to the University of Tartu for the first step of the development, namely, data identification and collection to serve as a basis for the future training of large language models. Big countries such as Germany, France and Spain already even have several versions of the national language model in place.
What should be done by Latvia? First, government initiative, a dedicated budget and the removal of administrative obstacles is required, as data do have certain limitations. They may also contain confidential information and, therefore, have to be anonymised. Secondly, the involvement of academia and other organisations or data custodians, such as the National Library, the Archives, and also the media, is required. Thirdly, the readiness of Tilde, as well as other industry players, to contribute with their expertise and the developed technological solutions is clearly also essential. The Latvian language, with its vast lexicological, morphological and syntactical diversity, deserves a special approach to AI development. The fact that the development of a national large language model is not just a technology project but also a matter of preserving the culture and language should also be considered.
Future prospects and Latvia’s advantages
The power and practical benefits of AI technology in both individual and business applications, for example, in data aggregation, response generation and text analysis, have already been demonstrated beyond doubt. This technology significantly boosts the capacity and productivity of human resources, enabling people to focus on larger value-added tasks.
Being a small country, Latvia is in a position to swiftly adopt and implement new technologies. By developing its national large language model, Latvia can create a major technological breakthrough that will preserve the Latvian language in the digital world of the future while also providing economic benefits and increasing global competitiveness.
Therefore, it is essential for Latvia to be aware of this opportunity and take the required steps towards the creation of the national language model, strengthening the position of the Latvian language and culture in an era of artificial intelligence.
Artūrs Vasiļevskis, Chairman of the Board at Tilde