Making DeepL API machine translations a little bit better
We cannot expect good translations from a machine when it does not know the context
Update: Using reserved words is only working if you keep the words in place. I first substituted them for non-translatable tags but then DeepL does not know what it is doing anymore.
Of course nothing compares to a translation by a professional translator. Machine translation is difficult. DeepL tries to improve translations by machine learning but they are still far away from using the proper context.
Let me give you an example. This website is about Python, Flask, computer network, etc. DeepL does not know this because we cannot tell it. But it should know! The term 'Flask' may be machine-translated to 'Flasche' (bottle) in German. The term 'Python has wheels' may be machine-translated to: 'Python hat Räder'. The term 'port 25' may be machine-translated to: 'Hafen 25'. All are wrong. DeepL should not have translated these but how can it know this?
There is only one way to get a better translation and that is by supplying a translation context to DeepL. In this case for example: translate_for_context: 'Python,Flask'. Another way would be to give DeepL some pages of the website and let it determine the context itself. Unfortunately both are not possible at the moment.
I read about people comparing DeepL translate versus Google Translate. Yes, DeepL seems a better context translator (sometimes) but is it always a good idea to try to translate within a context? And what is the context in this case? And does a human translator then still knows what is the intention of the original text?
So how can we avoid the above problems and get a better translation, or no translation at all? DeepL suggests that you put words that you do not want to be translated between tags, e.g. <x> and </x>, and then add this tag to the ignore_tags parameters. This will work but is a lot of work. In every text we must insert tags for words we do not want to be translated. I do not want to pollute my texts with do-not-translate tags.
This website has blog posts about Python, Flask, etc. So instead of putting tags in all the blog posts we can create a list of reserved words, meaning: do not translate these words. This is still a lot of work because we must also check our blog post text for words we do not want to be translated. But it seems a reasonable way to get a first version of a translated text. And the work we do now does not have to be done with next blog posts.
A list of reserved words
I created a list of reserved words and replaced them with untranslatable tags before feeding the text to the DeepL API. This meant I had to go over the default language (English) texts sentence by sentence, word by word. This way I created a list of some 100 reserved words, terms, including 'many-to-many', 'port 25', 'multistage', 'netmask', etc. I also had to make sure that I did not had different versions of these terms, like MariaDB, Mariadb, MariaDb. The resulting translations are now much better but the texts are still far away from acceptable. What I really need are lists. A list that contains all keywords for Python, a list that contains all keywords for Flask, a list that contains all keywords for computer networking terms, and some of these lists also must have translations because not all terms may be in English in another language. And then I still must exclude words myself because they are related to the blog post itself and are not in a list.
The results of machine translation may be acceptable for natural speech but they are certainly not good enough for technical texts like on this website. It would be nice if we could supply DeepL with a context (Python, Flask, networking, etc.) or be capable to let DeepL determine this context by itself e.g. by supplying some related web pages. Although my website is more like a showcase, I certainly want the translations to be acceptable. For next blog posts I will keep adding reserved words to my list but also will follow the DeepL advice to put tags around words I do not want to be translated. But not too much.
Links / credits
- How to set the timezone when using the Python Alpine Docker image
- Flask multilanguage processing, switching and the 404 Page Not Found exception
- Two Flask apps, frontend and admin, on one domain using DispatcherMiddleware
- Adding url_for() links to Jinja templates of a Flask multilanguage website
- ISPConfig: running a Python Flask Docker container as a jailed Shell User
- Converting to a multilanguage database