Converting to a multilanguage database

25 August 2019 Updated 30 August 2019 by Peter

We add a table with translations for every table that has fields that must be shown in multiple languages.

post main image
unsplash.com/@kevnbhagat

When I started this project, this website, I kept in mind that it had to support multiple languages. But of course I did not really look into database design because of all the other things I had to learn, add and implement. After a certain moment I could display and select multiple languages but that was for the texts in the code and templates and not for the database content items like blog posts and pages.

To make the content multilanguage I researched and found that there are several ways to do this, each having its pros and cons. If you are new to the subject I recommend looking at the links below. I decided there is only one right way to do this and that is by adding a 'translation table' for every table that contains fields that must be available in multiple languages, or, if you prefer, adding a 'translation object' for every object that contains attributes that must be available in multiple languages. I know that I now will face more complex queries, like fallback to the default language, and using counting, pagination, limit.

Default language, selected language and language fallback

Language fallback is the mechanism where we show the content items in the default language if it is not available for the selected language. Language fallback is a decision you have to make.

Without language fallback you just show the content items in the selected language, if not available then too bad (403). This is relatively easy. If the selected language is German show only the German items, if the selected language is English show only the English items.

With language fallback you show the content items in the selected language, if not available then show them in the default language. The default language is the language that always must be present. For this website I decided to implement language fallback and in the future I would like to be able to not show certain fallback items at all for a selected language.

Adding the languages to the database

An extra table must be added for all multilanguage content items. I already defined languages in the config but now created a language table:

class Language(Base):

    __tablename__ = 'language'

    id = Column(Integer, primary_key=True)

    # enable/disable language
    is_active = Column(Boolean, default=False, index=True)

    # name in your own language
    name = Column(String(100), server_default='', index=True)

    # short name in your own language
    short_name = Column(String(100), server_default='', index=True)

    # name in native language
    native_name = Column(String(100), server_default='', index=True)

    # short name in native language
    native_short_name = Column(String(100), server_default='', index=True)

    # language_region_code: en_US, de_DE, ...
    language_region_code = Column(String(16), server_default='', index=True)

    # lang_code: actual code shown to visitor, en, es, de (but also can be en_US, es_ES, etc.)
    language_code = Column(String(16), server_default='', index=True)

This also allows for enabling and disabling a language in future. Then I used the content item category so see how this works, I just added the category translation table which is almost a copy of the category table:

class ContentItemCategory(Base):

    __tablename__ = 'content_item_category'

    id = Column(Integer, primary_key=True)

    name = Column(String(50), server_default='')
    description = Column(String(200), server_default='')

    # one-to-many relationship with translation
    # we only use this relationship to append translations
    content_item_category_translations = relationship(
        'ContentItemCategoryTranslation', 
        backref='content_item_category',
        lazy='dynamic')


class ContentItemCategoryTranslation(Base):

    __tablename__ = 'content_item_category_translation'

    id = Column(Integer, primary_key=True)

    # name, slug, description in specified language
    name = Column(String(100), server_default='')
    slug = Column(String(100), server_default='')
    description = Column(String(200), server_default='')

    # one-to-many relationship with language
    language_id = Column(Integer, ForeignKey('language.id'))

    # one-to-many relationship with content_item_category
    content_item_category_id = Column(Integer, ForeignKey('content_item_category.id'))

After implementing the changes and being able to create and edit categories and translations it was time for some testing. My main concern was how to get the data from this new setup, including how to implement fallback to a default language.

Moving to the translations did not was much work as I kept the initial version of the translation tables almost identical to the tables. For the translation tables I added the language_id, set to the default language, and content_item_id, set to the id field.

INSERT INTO content_item_translation (id, ..., language_id, content_item_id) select id, ..., 1, id from content_item;

The SQL to get the categories does not look to complicated. The translation table is aliased into two tables, the first one, tr_selected, used to query categories for the selected language, and the second one, tr_default, to query categories for the default language. Below, we select the name and slug fields, language_id=3 means German and language_id=1 means English.
If the translation for the selected language is not found then the default language is used.

Important: note that the title field in the model has the empty string as default value. So the query below below works but only if language record was not specified.

SELECT
    category.id,
    category.name,
    IFNULL(tr_selected.name, tr_default.name) name,
    IFNULL(tr_selected.slug, tr_default.slug) slug
  FROM content_item_category category
  LEFT OUTER JOIN content_item_category_translation tr_selected
    ON category.id = tr_selected.content_item_category_id AND tr_selected.language_id = 3
  LEFT OUTER JOIN content_item_category_translation tr_default
    ON category.id = tr_default.content_item_category_id AND tr_default.language_id = 1
  WHERE category.id = 3;

To do this with SQLAlchemy we first we alias the translation table. Next we use func.coalesce replacing the IFNULL, and the remainder looks vary much like the SQL query:

tr_selected, tr_default = aliased(ContentItemCategoryTranslation), aliased(ContentItemCategoryTranslation)
r = db.query(
  func.coalesce(tr_selected.name, tr_default.name),
  func.coalesce(tr_selected.slug, tr_default.slug)).\
    select_from(ContentItemCategory).\
    outerjoin(tr_selected, and_((ContentItemCategory.id == tr_selected.content_item_category_id), (tr_selected.language_id == 3))  ).\
    outerjoin(tr_default, and_((ContentItemCategory.id == tr_default.content_item_category_id), (tr_default.language_id == 1))  ).all()

Should we select attributes (fields) or objects (records)? Translation table records are only present if a translation has been added. It seems logical to select objects instead of attributes but this is not possible with SQLAlchemy, see also below.

We want the blog posts selected by creation date. Also because we filter the joined sub results, using published = 1, we get many NULL values for title and slug. The reason is of course that the result returns all records of the content_item table. We can remove these NULL records by adding a HAVING statement add the end. Now the result is correct.
But there is a catch. At the moment the number of records in the content_item table is 230. The number of published blog posts is 12. This means that the HAVING part must filter over 200 record, 95%!

The WHERE part was added to filter the content_item_type, published and content_item_parent_id fields. This is done to reduce the number of records for the HAVING part. I also added an ORDER BY clause to show the most recent blog posts first.

SELECT
    DISTINCT
    content_item.id,
    content_item.title,
    IFNULL(tr_selected.created_on, tr_default.created_on) created_on,
    IFNULL(tr_selected.title, tr_default.title) tr_title,
    IFNULL(tr_selected.slug, tr_default.slug) slug
  FROM content_item_translation tr_where, content_item
  LEFT OUTER JOIN content_item_translation tr_selected
    ON (
      content_item.id = tr_selected.content_item_id 
      AND tr_selected.published = 1 
      AND tr_selected.content_item_parent_id = 0 
      AND tr_selected.language_id = 3)
  LEFT OUTER JOIN content_item_translation tr_default
    ON (
      content_item.id = tr_default.content_item_id
      AND tr_default.published = 1 
      AND tr_default.content_item_parent_id = 0 
      AND tr_default.language_id = 1)
  WHERE
    content_item.id = tr_where.content_item_id
    AND content_item.content_item_type = 1
    AND tr_where.published = 1
    AND tr_default.content_item_parent_id = 0
  HAVING 
    tr_title IS NOT NULL
  ORDER BY created_on desc;

The query above is based on an example from the internet. I am still not sure about the HAVING part and also had to add DISTINCT to remove duplicate rows. Anyway, the SQLAlchemy query equivalent of the above SQL query is:

tr_selected, tr_default, tr_where = aliased(ContentItemTranslation), aliased(ContentItemTranslation), aliased(ContentItemTranslation)
result_tuples = db.query(
    ContentItem.id, 
    ContentItem.title, 
    func.coalesce(tr_selected.created_on, tr_default.created_on).label('tr_created_on'),
    func.coalesce(tr_selected.title, tr_default.title).label('tr_title'),
    func.coalesce(tr_selected.slug, tr_default.slug).label('tr_slug')).\
    select_from(tr_where, ContentItem).\
    outerjoin(tr_selected, and_(
            (ContentItem.id == tr_selected.content_item_id),
            (tr_selected.published == 1),
            (tr_selected.content_item_parent_id == 0),
            (tr_selected.language_id == 3))).\
    outerjoin(tr_default, and_(
            (ContentItem.id == tr_default.content_item_id), 
            (tr_default.published == 1),
            (tr_default.content_item_parent_id == 0),
            (tr_default.language_id == 1))).\
    filter(and_(
        (ContentItem.id == tr_where.content_item_id),
        (ContentItem.content_item_type == 1),
        (tr_where.published == 1),
        (tr_where.content_item_parent_id == 0))).\
    having(literal_column('tr_title').isnot(None)).\
    distinct().\
    order_by(desc('tr_created_on')).all()

I had to search the internet why HAVING was working with MariaDB but not working with SQLAlchemy, see also links below. We need the literal_column here, this only seems to work for MySQL / MariaDB (?). I cannot confirm this, I only use MariaDB.

Let us assume that when a language translation is published, everything is ok meaning all fields have been assigned a value. In this case we only need to check for the existence of a translation record. Fortunately the above query is doing this for us. If the translation model defines a title field with a default value of '' (empty string), then the query responds with NULL for the title if the translation record does not exists. Exactly what we want.

The above query returns attributes, not objects. I do not know a way to return objects as the following changes fail because they can only return literal values:

    func.coalesce(tr_selected, tr_default)

    case([(ContentItem.title == None, tr_default)],
        else_ = tr_selected),

For now I do not know another way then to add all attributes to the query(?).

Another way to implement language fallback

I love simple select queries and the above outerjoin query is complex so let us try to look at another way of doing this. I would like to have full objects returned, a list of (content_item, content_item_translation) tuples, for easy processing in the template. Assume the selected language is German and the default language is English. Our published content items can be as follows:

+-----------------+-----------------------------+-----------------------------+
| content_item.id | content_item_translation.id | content_item_translation.id | 
|                 |            EN               |            DE               |
+-----------------+-----------------------------+-----------------------------+
|      7          |                             |                             |
|      6          |            6                |                             |
|      5          |            4                |            5                |
|      4          |                             |                             |
|      3          |            2                |            3                |
|      2          |            1                |                             |
|      1          |                             |                             |
+-----------------+-----------------------------+-----------------------------+

Let us assume these are blog posts. Then we have 4 blog posts in English (language_id=1), and 2 in German (language_id=3). Also assume language fallback is always done and the default language items are always present. Then the total number of blog posts is nothing more then counting the number of blog posts for the default language. We need to use both the content_item and the content_item_translation tables.

# select EN titles
SELECT 
  ci.id as ci_id, ci_tr.id as ci_tr_id, ci_tr.title as ci_tr_title
  FROM content_item ci, content_item_translation ci_tr
  WHERE
    ci.content_item_type = 1
    AND ci_tr.content_item_id = ci.id
    AND ci_tr.language_id = 1
    AND ci_tr.published = 1
    AND ci_tr.content_item_parent_id = 0;

# count EN titles
SELECT 
  count( distinct ci_tr.id ) 
  FROM content_item ci, content_item_translation ci_tr
  WHERE
    ci.content_item_type = 1
    AND ci_tr.content_item_id = ci.id
    AND ci_tr.language_id = 1
    AND ci_tr.published = 1
    AND ci_tr.content_item_parent_id = 0;

# get content_item.id list for DE
SELECT 
  distinct ci.id
  FROM content_item ci, content_item_translation ci_tr
  WHERE
    ci.content_item_type = 1
    AND ci_tr.content_item_id = ci.id
    AND ci_tr.language_id = 3
    AND ci_tr.published = 1
    AND ci_tr.content_item_parent_id = 0;

# get content_item.id list for EN without DE
SELECT
  distinct ci.id
  FROM content_item ci, content_item_translation ci_tr
  WHERE
    ci.content_item_type = 1
    AND (ci_tr.content_item_id = ci.id
    AND ci_tr.language_id = 1
    AND ci_tr.published = 1
    AND ci_tr.content_item_parent_id = 0)
    AND ci.id NOT IN (
      SELECT
        distinct cis.id
        FROM content_item cis, content_item_translation cis_tr
        WHERE 
          cis.content_item_type = 1
          AND (cis_tr.content_item_id = cis.id
          AND cis_tr.language_id = 3
          AND cis_tr.published = 1
          AND cis_tr.content_item_parent_id = 0)
    );

What follows from the above is that when the selected language is German, fallback must be done when the content_item.id has English but not German translations. Using a union we can get, merge and order the items for German and English. We can use UNION_ALL here because we know the German and English items are separated. At the end we put a ORDER BY to sort the results.

# get german items
SELECT ci.id as ci_id, 
        ci_tr.id as ci_tr_id, 
        ci_tr.created_on as ci_tr_created_on, 
        ci_tr.title as ci_tr_title
  FROM content_item ci, content_item_translation ci_tr
  WHERE
    ci.content_item_type = 1
    AND ci_tr.content_item_id = ci.id
    AND ci_tr.language_id = 3
    AND ci_tr.published = 1
    AND ci_tr.content_item_parent_id = 0

UNION ALL

# add english items
SELECT ci.id as ci_id, 
        ci_tr.id as ci_tr_id, 
        ci_tr.created_on as ci_tr_created_on, 
        ci_tr.title as ci_tr_title
  FROM content_item ci, content_item_translation ci_tr
  WHERE
    ci.content_item_type = 1
    AND ci_tr.content_item_id = ci.id
  AND ci.id IN (
    SELECT
      distinct cid.id
      FROM content_item cid, content_item_translation cid_tr
      WHERE
      cid.content_item_type = 1
      AND (cid_tr.content_item_id = cid.id
      AND cid_tr.language_id = 1
      AND cid_tr.published = 1
      AND cid_tr.content_item_parent_id = 0)
      AND cid.id NOT IN (
        SELECT
        distinct cis.id
        FROM content_item cis, content_item_translation cis_tr
        WHERE 
          cis.content_item_type = 1
          AND (cis_tr.content_item_id = cis.id
          AND cis_tr.language_id = 3
          AND cis_tr.published = 1
          AND cis_tr.content_item_parent_id = 0)
      )
  )

ORDER BY ci_tr_created_on DESC;

We can also add pagination by adding LIMIT 2 OFFSET 1 to the query. This works. Is it better than the outerjoin query? Some tests showed it is at least two times slower. But it is simple which also is a big plus. And it is now possible to get objects from the SQLAlchemy query instead of attributes. Just what I wanted, that's why I went 'object'. To improve performance we can cache the query results, they do not change often. But that can be done later. The equivalent SQLAlchemy query is:

    # content_item_type is a constant, 1 = blog post
    # default language is english (id = 1)
    # language_id is the id of the selected language
    ci, ci_tr = aliased(ContentItem), aliased(ContentItemTranslation)
    s1 = db.query(ci, ci_tr).\
            filter(and_(\
                (ci.content_item_type == content_item_type),
                (ci_tr.content_item_id == ci.id),
                (ci_tr.language_id == language.id),
                (ci_tr.published == 1),
                (ci_tr.content_item_parent_id == 0),
                ))

    cisub, cisub_tr = aliased(ContentItem), aliased(ContentItemTranslation)
    s2_subquery = db.query(cisub.id).\
            filter(and_(\
                (cisub.content_item_type == content_item_type),
                (cisub_tr.content_item_id == cisub.id),
                (cisub_tr.language_id == language.id),
                (cisub_tr.published == 1),
                (cisub_tr.content_item_parent_id == 0)))
    
    s2 = db.query(ci, ci_tr).\
            filter(and_(\
                (ci.content_item_type == content_item_type),
                (ci_tr.content_item_id == ci.id),
                (ci_tr.language_id == 1),
                (ci_tr.published == 1),
                (ci_tr.content_item_parent_id == 0),
                (ci.id.notin_( s2_subquery ))))

    q = s1.union(s2).order_by(desc(ci.created_on))

    content_item_content_item_translation_tuples = q.all()

I am happy because the union query returns tuples with objects that can be passed without manipulation to the Jinja template. And to get the blog posts for pagination we simply add the offset and limit functions to the query. Getting the total number of blog posts is simply getting the number of blog posts for the default language.

There also is something confusing here. You can order by the created_on of the content item or the created on of the content item translation created_on. If you use the latter one, your blog posts can get mixed up. So for the moment I order by the created_on of the blog_post, not of the translated versions.

To have proof that it works I added two German blog post and one Spanish blog post. For the moment I am still in the process of converting, at time of writing this post I only translated the title, subtitle and meta data. Check German and Spanish to see changes.

Summary

This post describes a way to add multiple language support for content items like blog posts and pages. The added translation table for every table that needs translations makes things more complicated. I am a little bit worried about memory usage (and performance) when the content items table grows. But we can use query result caching (e.g. 1-5 minutes) as these values do not change often.

When creating queries I often start using SQL, and then later transform this to SQLAlchemy, but I am not the only one.

In one of the (very) next posts I may show a diagram of the models involved. If you have been reading previous posts this will not be surprising. The content-item has a one-to-many relationship with its translation model content-item-translation. The content-item model has many-to-many relationships with the content-item-category and content-item-tag models, both also having translation tables. And the content-item-category model has a one-to-many relationship with its translation model content-item-category-translation. Etc.

Links / credits

Aliasing field names in SQLAlchemy model or underlying SQL table
https://stackoverflow.com/questions/37420135/aliasing-field-names-in-sqlalchemy-model-or-underlying-sql-table?rq=1

How to Design a Localization-Ready System
https://www.vertabelo.com/blog/technical-articles/data-modeling-for-multiple-languages-how-to-design-a-localization-ready-system

Multi language database, with default fallback
https://stackoverflow.com/questions/26765175/multi-language-database-with-default-fallback

Multilanguage Database Design in MySQL
https://www.apphp.com/tutorials/index.php?page=multilanguage-database-design-in-mysql

python sqlalchemy label usage
https://stackoverflow.com/questions/15555920/python-sqlalchemy-label-usage

Schema for a multilanguage database
https://stackoverflow.com/questions/316780/schema-for-a-multilanguage-database

sqlalchemy IS NOT NULL select
https://stackoverflow.com/questions/21784851/sqlalchemy-is-not-null-select/37196866

SQLAlchemy reference label in Having clause
https://stackoverflow.com/questions/51793704/sqlalchemy-reference-label-in-having-clause

Storing multilingual records in the MySQL database
https://php.vrana.cz/storing-multilingual-records-in-the-mysql-database.php

Using Language Identifiers (RFC 3066)
http://www.i18nguy.com/unicode/language-identifiers.html

What's the best database structure to keep multilingual data? [duplicate]
https://stackoverflow.com/questions/2227985/whats-the-best-database-structure-to-keep-multilingual-data