A complete guide to i18n in Python

This is a start to finish guide showing how to do internationalization (i18n) for a Python application. When I added i18n to handroll, I struggled to find clear advice for supporting other languages. This is one opinionated view explaining how I got there.

Table of Contents:

Leer este articulo en Español.

Overview

To internationalize code, you have to treat user text strings in a certain way. All text strings must be wrapped with a special function call. This special function marks the strings as something that needs translation. Once all the strings are marked, i18n tools can scan your code to make a master list of everything. With the master list available, translators can produce a list of translated strings for each desired language. The translated strings are added back to the Python code and it’s all bundled up into a nice translated final product. Then you may want to test that out.

That’s a lot of stuff, but I’ll explain each part of that process in detail. For the purposes of this guide, all of my example code will refer to the handroll package. If this is your Python code, replace handroll with the name of your top level Python package.

Marking strings

The special function that I mentioned in the overview looks like _('Hello World'). This function comes from the gettext module. Python uses GNU gettext to do translation so let’s look at the handroll code that creates the _ function in handroll/i18n.py.

import gettext
import os

localedir = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'locale')
translate = gettext.translation('handroll', localedir, fallback=True)
_ = translate.gettext

This little chunk of code makes a translation object using a locale directory as the source of translated strings. Your locale directory doesn’t exist yet, but it is where your application will look for translations at run time. From the translation object, I made an _ alias for translate.gettext. This underscore is not particularly special, however, it is the conventional name of this function that is used by the Python community.

I hope there is enough here now to understand what will happen at run time. The important idea is that 'Hello World' acts like a key. When the code executes, that “key” will be used to look in the locale data to find a matching translation. The return value of _ will be the translated string or the original if the translation is missing.

_ and format strings

One weirdness that you will quickly encounter if you use format for your strings is where to close the _ parentheses. Here is an example to make it clear.

_('Close the underscore function first. A {number}').format(number=42)

Extracting the master list

With all strings marked with _(), it is time to extract them into a master list. Gettext refers to this master list as the Portable Object Template (POT) file. To generate the POT file, I turned to a tool called Babel. Babel is designed to help developers handle i18n issues in a wide variety of ways. It’s cool with a lot of features, but for this guide, I’ll focus on its gettext support.

Babel extends a setup.py file with new commands. (Python packaging is out of scope for this guide so if setup.py is completely foreign to you, please check out Distributing Python Modules for more information.)

To generate a POT file, run python setup.py extract_messages. You will need some settings similar to:

[extract_messages]
input_dirs = handroll
output_file = handroll/locale/handroll.pot

Getting translations for other languages

Now that you have generated a POT file, you are ready to get translated versions of your strings. Remember that the T in POT is for template. The translated files are called PO files, and translators use the POT file as a starting point to generate their PO file. Doing translation by manually creating the PO file is possible, but it is tedious. At this point, I turned to a service called Transifex.

Transifex is focused on making translation easy. It’s free for open source projects and really easy to use. I set up a project for handroll, configured Transifex to “watch” the POT file for changes, and translators can easily translate all the strings in the project. Transifex has an API that enabled me to pull PO files back into the repository whenever I needed. The script is a little too long to put in this post, but you can look at it on GitHub.

One important thing that the script does is store PO files in a specific structure in the locale directory. Gettext expects an order like <language>/LC_MESSAGES/handroll.po.

Packaging it together

After the translated PO files are in place, it is time to package your translation data. The setup process includes extra data in MANIFEST.in. Add a line similar to the one below to ensure all locale data is put into the package archive.

recursive-include handroll/locale *

So far I’ve lied a little bit for clarity in explanation. Gettext does not use PO files directly for looking up translations. The truth is that the PO file must be compiled into a binary Machine Object (MO) file (for faster speed). Again, we can turn to Babel for help. Babel has a compile_catalog command that can compile PO into MO. You can run it with python setup.py compile_catalog.

It will need settings like:

[compile_catalog]
domain = handroll
directory = handroll/locale

Because translations will not work without MO files, I extended setup.py so that the sdist command will always run compile_catalog.

from setuptools.command.sdist import sdist


class Sdist(sdist):
    """Custom ``sdist`` command to ensure that mo files are always created."""

    def run(self):
        self.run_command('compile_catalog')
        # sdist is an old style class so super cannot be used.
        sdist.run(self)

If you do this, don’t forget to add cmdclass={'sdist': Sdist} and setup_requires=['Babel'] in your setup call.

At this point, running python setup.py sdist should create a tarball with translations for your project. You’re almost done!

Testing out the package

Testing translations is not easy. If you go the extra mile, you can know that people from all different cultures can enjoy your software. But translation testing is not easy because you need tests that run every string in your code. For handroll, that meant I had to get to 100% test coverage to get those really strange corner cases.

The reason to test is that incorrectly translated strings can destroy your code. If you have a format string like _('Hello {name}!').format(name='Johnny') and a translator makes a mistake like '¡Hola {nombre}!', then that code would break for Spanish users.

>>> '¡Hola {nombre}!'.format(name='Johnny')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'nombre'

To test out these strings, you need two things: MO files and the LANGUAGE environment variable. Let’s assume you use the nose test runner. To test out your Spanish support, you can run:

$ LANGUAGE=es nosetests
..................................
----------------------------------------------------------------------
Ran 34 tests in 1.440s

OK

In whatever continuous integration system you use, run your unit tests through each language you support to give yourself some confidence that translations have not broken the software. The handroll tox.ini provides an example to compare against. Tox is awesome for this kind of stuff.

Questions?

In this guide, I did my best to document all that I did to internationalize my project. I hope that all the concrete details help you see how to translate your own project. If something is missing or broken, let me know and I’ll be glad to update the post.