Converting to a multilanguage database
We add a table with translations for every table that has fields that must be shown in multiple languages.
When I started this project, this website, I kept in mind that it had to support multiple languages. But of course I did not really look into database design because of all the other things I had to learn, add and implement. After a certain moment I could display and select multiple languages but that was for the texts in the code and templates and not for the database content items like blog posts and pages.
To make the content multilanguage I researched and found that there are several ways to do this, each having its pros and cons. If you are new to the subject I recommend looking at the links below. I decided there is only one right way to do this and that is by adding a 'translation table' for every table that contains fields that must be available in multiple languages, or, if you prefer, adding a 'translation object' for every object that contains attributes that must be available in multiple languages. I know that I now will face more complex queries, like fallback to the default language, and using counting, pagination, limit.
Default language, selected language and language fallback
Language fallback is the mechanism where we show the content items in the default language if it is not available for the selected language. Language fallback is a decision you have to make.
Without language fallback you just show the content items in the selected language, if not available then too bad (403). This is relatively easy. If the selected language is German show only the German items, if the selected language is English show only the English items.
With language fallback you show the content items in the selected language, if not available then show them in the default language. The default language is the language that always must be present. For this website I decided to implement language fallback and in the future I would like to be able to not show certain fallback items at all for a selected language.
Adding the languages to the database
An extra table must be added for all multilanguage content items. I already defined languages in the config but now created a language table:
class Language(Base): __tablename__ = 'language' id = Column(Integer, primary_key=True) # enable/disable language is_active = Column(Boolean, default=False, index=True) # name in your own language name = Column(String(100), server_default='', index=True) # short name in your own language short_name = Column(String(100), server_default='', index=True) # name in native language native_name = Column(String(100), server_default='', index=True) # short name in native language native_short_name = Column(String(100), server_default='', index=True) # language_region_code: en_US, de_DE, ... language_region_code = Column(String(16), server_default='', index=True) # lang_code: actual code shown to visitor, en, es, de (but also can be en_US, es_ES, etc.) language_code = Column(String(16), server_default='', index=True)
This also allows for enabling and disabling a language in future. Then I used the content item category so see how this works, I just added the category translation table which is almost a copy of the category table:
class ContentItemCategory(Base): __tablename__ = 'content_item_category' id = Column(Integer, primary_key=True) name = Column(String(50), server_default='') description = Column(String(200), server_default='') # one-to-many relationship with translation # we only use this relationship to append translations content_item_category_translations = relationship( 'ContentItemCategoryTranslation', backref='content_item_category', lazy='dynamic') class ContentItemCategoryTranslation(Base): __tablename__ = 'content_item_category_translation' id = Column(Integer, primary_key=True) # name, slug, description in specified language name = Column(String(100), server_default='') slug = Column(String(100), server_default='') description = Column(String(200), server_default='') # one-to-many relationship with language language_id = Column(Integer, ForeignKey('language.id')) # one-to-many relationship with content_item_category content_item_category_id = Column(Integer, ForeignKey('content_item_category.id'))
After implementing the changes and being able to create and edit categories and translations it was time for some testing. My main concern was how to get the data from this new setup, including how to implement fallback to a default language.
Moving to the translations did not was much work as I kept the initial version of the translation tables almost identical to the tables. For the translation tables I added the language_id, set to the default language, and content_item_id, set to the id field.
INSERT INTO content_item_translation (id, ..., language_id, content_item_id) select id, ..., 1, id from content_item;
The SQL to get the categories does not look to complicated. The translation table is aliased into two tables, the first one, tr_selected, used to query categories for the selected language, and the second one, tr_default, to query categories for the default language. Below, we select the name and slug fields, language_id=3 means German and language_id=1 means English.
If the translation for the selected language is not found then the default language is used.
Important: note that the title field in the model has the empty string as default value. So the query below below works but only if language record was not specified.
SELECT category.id, category.name, IFNULL(tr_selected.name, tr_default.name) name, IFNULL(tr_selected.slug, tr_default.slug) slug FROM content_item_category category LEFT OUTER JOIN content_item_category_translation tr_selected ON category.id = tr_selected.content_item_category_id AND tr_selected.language_id = 3 LEFT OUTER JOIN content_item_category_translation tr_default ON category.id = tr_default.content_item_category_id AND tr_default.language_id = 1 WHERE category.id = 3;
To do this with SQLAlchemy we first we alias the translation table. Next we use func.coalesce replacing the IFNULL, and the remainder looks vary much like the SQL query:
tr_selected, tr_default = aliased(ContentItemCategoryTranslation), aliased(ContentItemCategoryTranslation) r = db.query( func.coalesce(tr_selected.name, tr_default.name), func.coalesce(tr_selected.slug, tr_default.slug)).\ select_from(ContentItemCategory).\ outerjoin(tr_selected, and_((ContentItemCategory.id == tr_selected.content_item_category_id), (tr_selected.language_id == 3)) ).\ outerjoin(tr_default, and_((ContentItemCategory.id == tr_default.content_item_category_id), (tr_default.language_id == 1)) ).all()
Should we select attributes (fields) or objects (records)? Translation table records are only present if a translation has been added. It seems logical to select objects instead of attributes but this is not possible with SQLAlchemy, see also below.
We want the blog posts selected by creation date. Also because we filter the joined sub results, using published = 1, we get many NULL values for title and slug. The reason is of course that the result returns all records of the content_item table. We can remove these NULL records by adding a HAVING statement add the end. Now the result is correct.
But there is a catch. At the moment the number of records in the content_item table is 230. The number of published blog posts is 12. This means that the HAVING part must filter over 200 record, 95%!
The WHERE part was added to filter the content_item_type, published and content_item_parent_id fields. This is done to reduce the number of records for the HAVING part. I also added an ORDER BY clause to show the most recent blog posts first.
SELECT DISTINCT content_item.id, content_item.title, IFNULL(tr_selected.created_on, tr_default.created_on) created_on, IFNULL(tr_selected.title, tr_default.title) tr_title, IFNULL(tr_selected.slug, tr_default.slug) slug FROM content_item_translation tr_where, content_item LEFT OUTER JOIN content_item_translation tr_selected ON ( content_item.id = tr_selected.content_item_id AND tr_selected.published = 1 AND tr_selected.content_item_parent_id = 0 AND tr_selected.language_id = 3) LEFT OUTER JOIN content_item_translation tr_default ON ( content_item.id = tr_default.content_item_id AND tr_default.published = 1 AND tr_default.content_item_parent_id = 0 AND tr_default.language_id = 1) WHERE content_item.id = tr_where.content_item_id AND content_item.content_item_type = 1 AND tr_where.published = 1 AND tr_default.content_item_parent_id = 0 HAVING tr_title IS NOT NULL ORDER BY created_on desc;
The query above is based on an example from the internet. I am still not sure about the HAVING part and also had to add DISTINCT to remove duplicate rows. Anyway, the SQLAlchemy query equivalent of the above SQL query is:
tr_selected, tr_default, tr_where = aliased(ContentItemTranslation), aliased(ContentItemTranslation), aliased(ContentItemTranslation) result_tuples = db.query( ContentItem.id, ContentItem.title, func.coalesce(tr_selected.created_on, tr_default.created_on).label('tr_created_on'), func.coalesce(tr_selected.title, tr_default.title).label('tr_title'), func.coalesce(tr_selected.slug, tr_default.slug).label('tr_slug')).\ select_from(tr_where, ContentItem).\ outerjoin(tr_selected, and_( (ContentItem.id == tr_selected.content_item_id), (tr_selected.published == 1), (tr_selected.content_item_parent_id == 0), (tr_selected.language_id == 3))).\ outerjoin(tr_default, and_( (ContentItem.id == tr_default.content_item_id), (tr_default.published == 1), (tr_default.content_item_parent_id == 0), (tr_default.language_id == 1))).\ filter(and_( (ContentItem.id == tr_where.content_item_id), (ContentItem.content_item_type == 1), (tr_where.published == 1), (tr_where.content_item_parent_id == 0))).\ having(literal_column('tr_title').isnot(None)).\ distinct().\ order_by(desc('tr_created_on')).all()
I had to search the internet why HAVING was working with MariaDB but not working with SQLAlchemy, see also links below. We need the literal_column here, this only seems to work for MySQL / MariaDB (?). I cannot confirm this, I only use MariaDB.
Let us assume that when a language translation is published, everything is ok meaning all fields have been assigned a value. In this case we only need to check for the existence of a translation record. Fortunately the above query is doing this for us. If the translation model defines a title field with a default value of '' (empty string), then the query responds with NULL for the title if the translation record does not exists. Exactly what we want.
The above query returns attributes, not objects. I do not know a way to return objects as the following changes fail because they can only return literal values:
func.coalesce(tr_selected, tr_default) case([(ContentItem.title == None, tr_default)], else_ = tr_selected),
For now I do not know another way then to add all attributes to the query(?).
Another way to implement language fallback
I love simple select queries and the above outerjoin query is complex so let us try to look at another way of doing this. I would like to have full objects returned, a list of (content_item, content_item_translation) tuples, for easy processing in the template. Assume the selected language is German and the default language is English. Our published content items can be as follows:
+-----------------+-----------------------------+-----------------------------+ | content_item.id | content_item_translation.id | content_item_translation.id | | | EN | DE | +-----------------+-----------------------------+-----------------------------+ | 7 | | | | 6 | 6 | | | 5 | 4 | 5 | | 4 | | | | 3 | 2 | 3 | | 2 | 1 | | | 1 | | | +-----------------+-----------------------------+-----------------------------+
Let us assume these are blog posts. Then we have 4 blog posts in English (language_id=1), and 2 in German (language_id=3). Also assume language fallback is always done and the default language items are always present. Then the total number of blog posts is nothing more then counting the number of blog posts for the default language. We need to use both the content_item and the content_item_translation tables.
# select EN titles SELECT ci.id as ci_id, ci_tr.id as ci_tr_id, ci_tr.title as ci_tr_title FROM content_item ci, content_item_translation ci_tr WHERE ci.content_item_type = 1 AND ci_tr.content_item_id = ci.id AND ci_tr.language_id = 1 AND ci_tr.published = 1 AND ci_tr.content_item_parent_id = 0; # count EN titles SELECT count( distinct ci_tr.id ) FROM content_item ci, content_item_translation ci_tr WHERE ci.content_item_type = 1 AND ci_tr.content_item_id = ci.id AND ci_tr.language_id = 1 AND ci_tr.published = 1 AND ci_tr.content_item_parent_id = 0; # get content_item.id list for DE SELECT distinct ci.id FROM content_item ci, content_item_translation ci_tr WHERE ci.content_item_type = 1 AND ci_tr.content_item_id = ci.id AND ci_tr.language_id = 3 AND ci_tr.published = 1 AND ci_tr.content_item_parent_id = 0; # get content_item.id list for EN without DE SELECT distinct ci.id FROM content_item ci, content_item_translation ci_tr WHERE ci.content_item_type = 1 AND (ci_tr.content_item_id = ci.id AND ci_tr.language_id = 1 AND ci_tr.published = 1 AND ci_tr.content_item_parent_id = 0) AND ci.id NOT IN ( SELECT distinct cis.id FROM content_item cis, content_item_translation cis_tr WHERE cis.content_item_type = 1 AND (cis_tr.content_item_id = cis.id AND cis_tr.language_id = 3 AND cis_tr.published = 1 AND cis_tr.content_item_parent_id = 0) );
What follows from the above is that when the selected language is German, fallback must be done when the content_item.id has English but not German translations. Using a union we can get, merge and order the items for German and English. We can use UNION_ALL here because we know the German and English items are separated. At the end we put a ORDER BY to sort the results.
# get german items SELECT ci.id as ci_id, ci_tr.id as ci_tr_id, ci_tr.created_on as ci_tr_created_on, ci_tr.title as ci_tr_title FROM content_item ci, content_item_translation ci_tr WHERE ci.content_item_type = 1 AND ci_tr.content_item_id = ci.id AND ci_tr.language_id = 3 AND ci_tr.published = 1 AND ci_tr.content_item_parent_id = 0 UNION ALL # add english items SELECT ci.id as ci_id, ci_tr.id as ci_tr_id, ci_tr.created_on as ci_tr_created_on, ci_tr.title as ci_tr_title FROM content_item ci, content_item_translation ci_tr WHERE ci.content_item_type = 1 AND ci_tr.content_item_id = ci.id AND ci.id IN ( SELECT distinct cid.id FROM content_item cid, content_item_translation cid_tr WHERE cid.content_item_type = 1 AND (cid_tr.content_item_id = cid.id AND cid_tr.language_id = 1 AND cid_tr.published = 1 AND cid_tr.content_item_parent_id = 0) AND cid.id NOT IN ( SELECT distinct cis.id FROM content_item cis, content_item_translation cis_tr WHERE cis.content_item_type = 1 AND (cis_tr.content_item_id = cis.id AND cis_tr.language_id = 3 AND cis_tr.published = 1 AND cis_tr.content_item_parent_id = 0) ) ) ORDER BY ci_tr_created_on DESC;
We can also add pagination by adding LIMIT 2 OFFSET 1 to the query. This works. Is it better than the outerjoin query? Some tests showed it is at least two times slower. But it is simple which also is a big plus. And it is now possible to get objects from the SQLAlchemy query instead of attributes. Just what I wanted, that's why I went 'object'. To improve performance we can cache the query results, they do not change often. But that can be done later. The equivalent SQLAlchemy query is:
# content_item_type is a constant, 1 = blog post # default language is english (id = 1) # language_id is the id of the selected language ci, ci_tr = aliased(ContentItem), aliased(ContentItemTranslation) s1 = db.query(ci, ci_tr).\ filter(and_(\ (ci.content_item_type == content_item_type), (ci_tr.content_item_id == ci.id), (ci_tr.language_id == language.id), (ci_tr.published == 1), (ci_tr.content_item_parent_id == 0), )) cisub, cisub_tr = aliased(ContentItem), aliased(ContentItemTranslation) s2_subquery = db.query(cisub.id).\ filter(and_(\ (cisub.content_item_type == content_item_type), (cisub_tr.content_item_id == cisub.id), (cisub_tr.language_id == language.id), (cisub_tr.published == 1), (cisub_tr.content_item_parent_id == 0))) s2 = db.query(ci, ci_tr).\ filter(and_(\ (ci.content_item_type == content_item_type), (ci_tr.content_item_id == ci.id), (ci_tr.language_id == 1), (ci_tr.published == 1), (ci_tr.content_item_parent_id == 0), (ci.id.notin_( s2_subquery )))) q = s1.union(s2).order_by(desc(ci.created_on)) content_item_content_item_translation_tuples = q.all()
I am happy because the union query returns tuples with objects that can be passed without manipulation to the Jinja template. And to get the blog posts for pagination we simply add the offset and limit functions to the query. Getting the total number of blog posts is simply getting the number of blog posts for the default language.
There also is something confusing here. You can order by the created_on of the content item or the created on of the content item translation created_on. If you use the latter one, your blog posts can get mixed up. So for the moment I order by the created_on of the blog_post, not of the translated versions.
To have proof that it works I added two German blog post and one Spanish blog post. For the moment I am still in the process of converting, at time of writing this post I only translated the title, subtitle and meta data. Check German and Spanish to see changes.
This post describes a way to add multiple language support for content items like blog posts and pages. The added translation table for every table that needs translations makes things more complicated. I am a little bit worried about memory usage (and performance) when the content items table grows. But we can use query result caching (e.g. 1-5 minutes) as these values do not change often.
When creating queries I often start using SQL, and then later transform this to SQLAlchemy, but I am not the only one.
In one of the (very) next posts I may show a diagram of the models involved. If you have been reading previous posts this will not be surprising. The content-item has a one-to-many relationship with its translation model content-item-translation. The content-item model has many-to-many relationships with the content-item-category and content-item-tag models, both also having translation tables. And the content-item-category model has a one-to-many relationship with its translation model content-item-category-translation. Etc.
Links / credits
Aliasing field names in SQLAlchemy model or underlying SQL table
How to Design a Localization-Ready System
Multi language database, with default fallback
Multilanguage Database Design in MySQL
python sqlalchemy label usage
Schema for a multilanguage database
sqlalchemy IS NOT NULL select
SQLAlchemy reference label in Having clause
Storing multilingual records in the MySQL database
Using Language Identifiers (RFC 3066)
What's the best database structure to keep multilingual data? [duplicate]
- Migrating from Bootstrap 4 to Bootstrap 5
- Using Python's pyOpenSSL to verify SSL certificates downloaded from a host
- Why your website canonical name must be 'www' (or 'app' or something else)
- Flask's SERVER_NAME, subdomains and 404 errors
- Flask, Celery, Redis and Docker
- Celery, Redis and the (in)famous email task example
- Using UUIDs instead of Integer Autoincrement Primary Keys with SQLAlchemy and MariaDb
- Flask with multiple forms on a page posted using AJAX and returning a rendered form result
- Two Flask apps, frontend and admin, on one domain using DispatcherMiddleware
- SLQAlchemy dynamic query building and filtering including soft deletes
- SQLAlchemy server-side datetime calculations
- OWASP Zed Attack Proxy (ZAP) security testing