Since the emergence of statistical language models in the 1980s, the field has undergone a remarkable transformation. Over the decades, methodologies for building these AI models have evolved, the volume of processed data has increased, and the models themselves have grown in complexity. The number of parameters – mathematical coefficients in the formulas – has continuously expanded. By 2020, with the introduction of ChatGPT, the term “large language model” had become widely recognised. Today, it is difficult to find someone who hasn’t heard of it. But what does “large“ mean in the context of language models? At one point, a model with 3 billion parameters was considered enormous, yet today, such numbers are no longer surprising. OpenAI has not disclosed the exact number of parameters in GPT-4, but estimates suggest it could range from 1 to 10 trillion.
A scientific revolution is underway, with an unprecedented number of publications on large language models. Researchers worldwide are racing to develop increasingly powerful, precise and versatile models. These models must be capable of answering questions, summarising information, translating texts, and handling a variety of language-related tasks. However, a large number of parameters alone does not guarantee success. Each parameter must be fine-tuned, meaning it needs an appropriate mathematical value, which is only possible with sufficient training data. The process begins with designing the model’s architecture – its mathematical formulas and coefficients, which are initially set to random values. As the model processes training data, these coefficients are refined. If the architecture is too large and the dataset too small, the model will be inaccurate and prone to “hallucinations”. Conversely, if the model is too small but the dataset is vast, it will lack the capacity to absorb all the information effectively.
As user expectations continue to rise, the scarcity of training data is becoming an increasingly pressing issue. Compared to English, German, or Polish, Lithuanian has significantly fewer available texts. Researchers are actively working to address this challenge. For instance, the company Tilde is collaborating with Vytautas Magnus University and Vilnius University to collect more data, which will serve as the foundation for developing more accurate models. Major tech companies such as OpenAI, Meta, and Google DeepMind also face the challenge of limited data for smaller languages. However, their models are multilingual, making them more adaptable. By leveraging knowledge from dominant languages and utilising cross-linguistic connections, multilingual models can better support smaller languages. This is precisely why Tilde is developing its own multilingual model, TildeLM, with a distinct focus on smaller languages like Lithuanian, Latvian and Estonian.

The competition among language models is fierce. Models like GPT, Mistral, Llama, Gemma, Claude, Bloom, or Solar continuously compete for dominance. However, one major challenge remains – there is still a lack of information on how these models perform with smaller languages. For the average user, the high accuracy of these models may seem impressive, but in certain fields, such as medicine or law, even minor errors can have serious consequences. Additionally, many of the most advanced models (e.g., GPT) are proprietary and controlled by private companies that regulate data access. This raises concerns about the security of sensitive data and complicates their application in critical domains.
Are there any alternatives? Yes, open-weight models! Users can download them directly onto their computers and either: 1) fine-tune with their own data, or 2) use them as they are with carefully crafted prompts. The first approach requires additional training data and significant computational resources, making it inaccessible to many users. Therefore, we focused on evaluating the second approach in our recent study, which has been accepted for presentation at the NoDaLiDa & Baltic HLT conference.
The study tested 12 different language models to assess their ability to understand and generate text in Lithuanian, Latvian and Estonian. It included both proprietary models (GPT-3.5 Turbo, GPT-4, and GPT-4o) and open-weighted models (Llama 3, 3.1, and 3.2 with 3, 8, and 70 billion parameters; Mistral with 12 billion; Gemma2 with 9 and 27 billion; and Phi with 3 and 14 billion).
The first experiment focused on machine translation accuracy, comparing translations between English and the three languages of the Baltic states. Unsurprisingly, GPT models performed best, but Gemma2 (27 billion) and Llama 3.1 (70 billion) delivered translation quality comparable to GPT models. Meanwhile, Phi models performed the worst. The results were also compared with DeepL, one of the most advanced machine translation systems, which showed that its translation quality matched that of GPT-4o. This indicates that large language models are now capable of producing translations that rival specialised translation systems.
In another task, the models had to answer multiple-choice questions in Lithuanian, Latvian and Estonian. This required not only comprehension but also the ability to present the correct answer in the appropriate format. Once again, the top-performing models were GPT-4o, Llama 3.1 (70 billion), and Gemma2 (27 billion). However, when compared to English, the accuracy in smaller languages was significantly lower.
The third experiment assessed how well these models could answer open questions in Lithuanian and Latvian across different domains. The best models achieved 80–90% accuracy. The text generation fluency was also evaluated. The top 3 leading models remained the same here, as well. Additionally, we tested monolingual Llama2 models (7 billion and 13 billion) developed by Neurotechnology, specifically adjusted for Lithuanian. These models generated exceptionally fluent Lithuanian text, but their accuracy was still significantly lower than that of the large multilingual models.
This research reaffirmed a crucial fact: the quality and diversity of training data are paramount. If we want language models to achieve the same level of accuracy in smaller languages as they do in larger ones, active collaboration within the scientific community is essential. This includes continuous data collection and the development of specialised models. Instead of competing, let’s collaborate!