We’re all used to language technologies helping us in everyday activities, but what about those times when they fail? In this blog article, our Chief AI Officer, Mārcis Pinnis, explains how we develop language technologies, why they sometimes struggle, and the reasons behind it.
But first… what are language technologies?
A language technology is any solution that analyses, produces, modifies or responds to human texts and speech. If you have a smartphone or a computer, then you use language technologies. All our modern gadgets feature language technologies that help us access information faster or be more productive. For instance, smartphones have language technology capabilities to recognize your speech, perform a document or web search, perform optical character recognition (or in other words – recognize text within a digital image), etc.
How do we develop language technologies?
First, we need to get access to language data that we can use to train models. Without data, we can’t possibly develop anything. Put simply, language data can be any document containing text, or any audio or video file containing speech.
Once we have our language data, the next step is to train models using it. Nowadays, most language technologies are developed using machine learning and artificial neural networks. For instance, our machine translation systems are trained using transformer-based encoder-decoder models from scratch. Our named entity recognition, sentiment analysis, and intent detection models are trained by fine-tuning foundation models for specific downstream tasks.
And finally, we deploy models for use. Depending on customer requirements, the models can be deployed in local infrastructure or the cloud, and made accessible either through APIs, third-party tool plugins, or custom-built user interfaces. For instance, our machine translation systems are available for our customers in various computer-assisted translation tools using plugins, on the translate.tilde.com platform, allowing users to translate text snippets, documents and web pages, and provides a simple online computer-assisted translation tool that can be used easily by people who aren’t involved in the translation industry; it can also be accessed through API.
Language is not constant
The problem that arises with this process is that when the model has been trained, it already starts to become obsolete as it won’t have seen any current and future data. Everyone who has used ChatGPT has probably come across the disclaimer that it only knows about data till 2021 (or in the recent models till April 2023). The model is not up to date on current language use.