TildeLM: Transforming AI for a multilingual Europe

We are developing TildeLM, an open foundational LLM (large language model) with over 30 billion parameters covering all European languages, with a focus on Baltic and Eastern European languages. Supported by the European Commission, TildeLM is set to revolutionise the AI landscape, ensuring our region benefit from cutting-edge technology.

THE CHALLENGE

Championing language equity

Most AI models focus on major languages, with over 90% of data in English, leaving Baltic and Eastern European languages underrepresented. This imbalance results in lower-quality AI outcomes and limited access to advanced technologies for speakers of these languages. TildeLM addresses this by aiming to represent all supported languages equally throughout the TildeLM training process.

THE SOLUTION

Building an open model for Europe

TildeLM is being developed to represent a broad range of European languages, including Bulgarian, Latvian, Ukrainian, and others. This model is more than just a technological achievement; it’s a commitment to creating a resource that is fully open and serves as the foundation for a wide array of AI applications, benefiting over 155 million Europeans.

billion parameters
focus languages
GPU hours on LUMI

USE CASES AND APPLICATIONS

Powering meaningful innovations across sectors

National Language Models
Governments can leverage TildeLM to create tailored language models that improve public service accessibility for all citizens.
Research and Development

Researchers can use TildeLM to study languages, enhance translation systems, and create novel language technology applications.

Technological Innovation
Businesses can use TildeLM to advance multilingual AI applications, like virtual assistants, text generation, and speech technologies.
Industry-Specific Solutions
Healthcare and legal industries can use TildeLM for accurate multilingual processing and translation.

COMPUTING RESOURCES

Excellence driven by Europe’s most advanced supercomputer

The development of TildeLM is being accelerated by the LUMI supercomputer, awarded as part of the Large AI Grand Challenge. With 2 million GPU hours at our disposal, LUMI’s immense computational power is crucial for efficiently executing this ambitious project.

OUR PROMISE

Committing to open collaboration

We are dedicated to open science principles and ethical data handling, making TildeLM freely available. We believe that collaboration and shared knowledge are key to innovation, and we invite researchers, developers, and data providers to join us in this mission.

Open access

TildeLM will be available for both commercial and non-commercial use under a permissive license, published in Hugging Face and ELRC-SHARE.

Integrity and security

We guarantee that TildeLM is safe and free from harmful or inaccurate content, ensuring its reliability for a variety of public use cases.
Knowledge sharing
We are committed to collaboration and sharing insights, inviting partners to work with us in advancing TildeLM for the benefit of all.

Contribute to a multilingual future

To build a robust multilingual language model with over 30B parametrs, we need contributions of language data from across Europe. We welcome involvement from authors, publishers, state libraries, and others who can provide valuable content, with flexible terms to accommodate your needs. This platform is where we share our progress and invite you to be part of this groundbreaking initiative.

Your involvement is essential to ensuring that every language has a voice in the digital age.

Data providers that have already contributed to the project:

Frequently asked questions

What is the TildeLM?
The TildeLM project aims to create a multilingual foundational large language model that focuses on underrepresented Baltic and Eastern European languages to promote digital equity and enhance access to advanced AI technologies for these communities.
Why is language equity in LLMs important?
This imbalance has efficiency and cost consequences. For instance, longer sequences are required to encode the same amount of information in lower-resourced languages compared to English, making models less efficient and more expensive to run. Additionally, the English-centricity of these models can introduce undesirable cultural biases. TildeLM will be trained to ensure equity for all supported languages.
What languages does the TildeLM project focus on?

The project targets Eastern European and Baltic languages such as Bulgarian, Croatian, Czech, Estonian, Finnish, Latvian, Lithuanian, Macedonian, Montenegrin, Polish, Serbian, Slovak, Slovene, and Ukrainian. The model will also support bigger languages such as English, French, German and Russian in balanced proportions to support translation and related multilingual tasks. 

What is the LUMI supercomputer?
The LUMI (Large Unified Modern Infrastructure) supercomputer is the fifth fastest supercomputer globally and the fastest in Europe. It is part of the EuroHPC Joint Undertaking, a collaborative effort involving the European Union and European countries to create a world-class high-performance computing (HPC) ecosystem in Europe. The LUMI supercomputer is located in Kajaani, Finland. 
What is the Large AI Grand Challenge?
The purpose of the Large AI Grand Challenge, funded by the European Commission, is to expand European AI frontiers by harnessing the potential of large-scale AI models. The participants in the competition were innovative startups and SMEs with the technical capacity to develop AI models that boost Europe’s competitiveness in Generative AI. The European Commission has announced the winners of the Large AI Grand Challenge. Four innovative AI companies from Europe, including Tilde, will share a prize of €1 million and 8 million computational hours to advance Europe's leadership in AI development. 
What is Tilde?
Tilde is a leading European language technology innovator and service provider with a mission to promote language diversity in the digital age. Tilde has over 150 employees in three offices located in Riga, Vilnius, and Tallinn. Tilde’s research team is comprised of nine PhDs and their research associates and has authored over 260 scientific publications. Over the years, Tilde has developed a vast R&D partnership network with leading EU research centres and universities and serves as a language technology research hub for the Baltic region. Most recent research and development activities of Tilde are focused on foundational large language models (LLMs), fine-tuning of LLMs for downstream applications, and integration of instruction-tuned LLMs in natural language processing applications (e.g., machine translation, virtual assistants, retrieval-augmented generation systems, processing of spoken language, summarisation, etc.).