2014-01-14

Using PHP-Gettext to localize your web pages

This is what I am now using for the Rufus Homepage. As usual, it took way too long to find all the pieces needed to solve this specific problem, so I'm going to write a guide that has them all in a single place.

What we want:

  1. A web page that detects the language from the browser, and, if a translation exists, displays that translation. If not, it falls back to the English version.
  2. A menu somewhere, that lets users pick from a list of supported languages, independently of the one set by their browser.
  3. An easy to use process for translators, that relies on the well known tools of the trade (i.e. gettext and Poedit).
  4. All of the above in a single web page, so that can we keep all the common parts together, and don't have to duplicate changes.


Where we start:

  • A web server that we control fully, and that natively supports UTF-8. I'll only say this once: In 2014, if you still don't use UTF-8 everywhere you can, then you don't deserve to host a web page, let alone administer a web server.
  • An single index.html page, in English/UTF-8, that contains pure HTML (possibly with a little sprinkling of JavaScript, but not much else).
Aaaand, that's about it really.


Prerequisites:

Because we have complete control of the server, we're going to use PHP Gettext.
Why? Because it relies on gettext, which is a mature translation framework, with solid support (including a nice GUI translation application for Windows & Mac called Poedit) and also because the performance hit of using PHP Gettext seems to be minimal compared to the alternatives. Finally, using PHP gives us the ability to simply edit our existing HTML and insert PHP code wherever we need a translation, which should make the whole process a breeze.

Thus, the first two items you need to install on your server then, if you don't have them already, will be PHP (preferably v5 or later) as well as php-gettext, plus all dependencies those two packages may have.

Then, you will need to install is php5-intl, so that we can use the locale_accept_from_http() function call to detect the browser locale from our visitors.

Finally, you must ensure that your server serves ALL the locales you are planning to support, in UTF-8. Especially, issuing locale -a | grep utf8 on your server must return AN AWFUL LOT of entries (on mine, I get more than 150 of them, and that is the way it should be).
If issuing locale -a | grep utf8 | wc -l returns less than 100 entries, then, unless you are planning to restrict your site to only a small part of the world, you will need to first sort that out, for instance by installing the locales-all package. This is because gettext will not support a locale that is unknown to the system. For instance, if you don't see fr_CA.utf8 listed in your locale -a, then no matter what you do, even if you have other French locales listed, gettext will not know how to handle browsers that are set to Canadian French. You have been warned!


Testing PHP gettext support:

At this stage, I will assume that you have php5, php5-intl, php-gettext and possibly other dependencies such as libapache2-mod-php5, gettext and co. installed. If you are using Apache2, you may also have to enable the PHP5 module, by symlinking php5.conf and php5.load in your /etc/apache2/mods-enabled/, and possibly edit php5.conf to allow running PHP scripts in user directories (which is disabled by default).

The first thing we'll do, to check that everything is in order before starting with localization, is simply create an info.php, at the same location where you have your index.html, and that contains the following one liner:
<? phpinfo(); ?>

Now, you should navigate to <your_website>/info.php and confirm that:
  1. You get a whole bunch of PHP information from your server
  2. In this whole set of data, you see a line stating "GetText Support: enabled"
If you don't see any of the above, then you will need to sort your PHP settings before proceeding, as everything that follows relies on having at least the above working. For one, we want to confirm that both PHP and the short script form (<? rather than <?php), which is what we'll use in the code below, are working, and also, get some assurance that gettext is enabled. So make sure to edit your php.ini or conf settings, if you need to sort things out.

Once you got the above simple test going, you should delete that info.php file, as you don't want attackers to know too much about the PHP and server settings you're running under.


Let's get crackin'


With PHP now confirmed working, let's set our translation rolling with PHP-Gettext. For that I'm going to loosely follow this guide. I say loosely, because I found that it was woefully incomplete and left out the most crucial parts.
  1. Start by duplicate your existing index.html as index2.php. This will enable us to work on adding translations to index2.php without interfering with the existing site, until we're happy enough that we can replace index.html altogether. Of course we picked index2.php rather than index.php, to make sure our server doesn't try to serve the file we're testing over the live index.html that's assumed to already exist in that directory.

  2. In index2.php, and provided you want to test a French translation (you don't really have to speak French if you just want to test that things work), somewhere after the initial <html> tag, add the following PHP header:

    <?
    $langs = array(
      'en_US' => array('en', 'English (International)'),
      'fr_FR' => array('fr', 'French (Français)'),
    );
    $locale = "en_US";
    if (isset($_SERVER["HTTP_ACCEPT_LANGUAGE"]))
      $locale = locale_accept_from_http($_SERVER["HTTP_ACCEPT_LANGUAGE"]);
    if (isSet($_GET["locale"])) {
      $locale = $_GET["locale"];
    }
    $locale = preg_replace("/[^a-zA-Z_]/", "", substr($locale,0,5));
    foreach($langs as $code => $lang) {
      if(substr($locale,0,strlen($lang[0])) == $lang[0]) {
        $locale = $code;
        break;
      }
    }
    // Must append ".utf8" suffix here, else languages such as Azerbaijani won't work
    setlocale(LC_MESSAGES, $locale . ".utf8");
    // Also set the LANGUAGE variable, which may be needed on some systems
    putenv("LANGUAGE=" . $locale);
    bindtextdomain("index", "./locale");
    bind_textdomain_codeset("index", "UTF-8");
    textdomain("index");
    ?>

    What this code does is:
    • Create an array of languages that we will support from the language selection menu (here English and French). You'll notice that this is actually an array of arrays, but more about this later.
    • After setting the default to English, read the preferred locale from the browser, if HTTP_ACCEPT_LANGUAGE is defined (isset(...)), using locale_accept_from_http(). If that locale is not overridden with a ?locale= parameter passed on the URL, it's the one that will be used throughout the rest of the file.
    • Find if a locale parameter was passed on the URL and set the $locale variable to it if that's the case.
    • Sanitize the locale parameter to ensure that it only contains only alphabetical or underscore, and is no more than 5 characters long (anything that can be entered by users must be considered potentially harmful and SHOULD BE SANITIZED!).
    • Ensure that if we get a short locale (eg. fr rather than fr_FR), or if we get a locale for a language we support, but for a region that we don't (eg. fr_CA), we convert it to the closest locale_REGION form we support. This is very important, as the browser may only provide us with fr or fr_CA when invoking locale_accept_from_http and want to have these locales mapped to fr_FR for subsequent processing.
    • Tell gettext that it should use UTF-8 and look for index.mo in a ./locale/<LOCALE>/LC_MESSAGES/ for translations (eg. ./locale/fr/LC_MESSAGES/index.mo).

  3. Somewhere in a div (eg. the one for a right sidebar) add the following code for the language selection menu:

    <select onchange="self.location='?locale='+this.options[this.selectedIndex].value">
    <? foreach($langs as $code => $lang): ?>
      <option &lt? if(substr($locale,0,strlen($lang[0])) == $lang[0]) echo "selected=\"selected\"";?> value="<?= $code;?>">
      <?= $lang[1]; ?>
    </option>
    <? endforeach; ?>
    </select>

    What this code does is:
    • Create a dropdown with all the languages from our $langs array.
    • Check out if the first characters of our $locale matches the short language code from our array, and set the dropdown entry as the selected one if that is the case. This ensures that "French" will be selected in our dropdown, regardless of whether the locale is fr_CA, fr_FR or any of the other fr_XX locales.
    • When a user selects an entry from the dropdown, add a ?locale=en_US or ?locale=fr_FR to the URL, to force the page to be refreshed using that language.

  4. For every place where you want to translate a string, use something like <?= _("Hello, world");?>, where <?= is the short version of <?php echo and _( is the actual call to gettext. What gettext does then is, find out if a translation exists for the string being passed as parameter and either use that if it exists, or the original untranslated string otherwise.

  5. Of course, you can use the whole gamut of PHP function calls, and say, if you want to insert a variable in your translated string, such as a date, do something like:
    <? printf(_("Last updated %s:"), $last_date);?>.
    Also, if needed, and this is something that is very useful to know, you can insert translator notes using comments (/* ... */ within your PHP, before the _(...) calls. These comments will then be displayed for all translators to see in Poedit (as long as you used the -c option when creating your PO catalog with xgettext).

  6. Save your index2.php and confirm that you get to see the English strings, the dropdown with 2 entries, as well as ?locale=fr_FR or ?locale=en_US appended to the URL when you select an entry from the dropdown. Of course, since we haven't created any translation for French, the English text still displays when French is selected, as the default of gettext is to use the original if a translation is missing, but we will address that shortly.

  7. Create a ./locale/fr/LC_MESSAGES/ set of subdirectories, at the location where you have your index2.php page.

  8. Now we need to generate the gettext catalog, or POT, which is the file you will have to provide  translators with, in order for them to start creating a translation. Now, while Poedit is supposed to be able to process a PHP file to generate a .pot, I couldn't for the life of me figure out how to do just that with the Windows version. Moreover, the .pot creation is really something you want to do on the server anyway, so, to cut a long story short, we're just going to call xgettext, using a script, to produce our .pot on the server. Here is the content of that script:

    #!/bin/sh
    xgettext --package-version=1.0 --from-code=UTF-8 --copyright-holder="Pete Batard" --package-name="Rufus Homepage" --msgid-bugs-address=pete@akeo.ie -L PHP -c -d index -o ./locale/index.pot index2.php
    sed --in-place ./locale/index.pot --expression='s/SOME DESCRIPTIVE TITLE/Rufus Homepage/'
    sed --in-place ./locale/index.pot --expression='1,6s/YEAR/2014/'
    sed --in-place ./locale/index.pot --expression='1,6s/PACKAGE/Rufus/'
    sed --in-place ./locale/index.pot --expression='1,6s/FIRST AUTHOR/Pete Batard/'
    sed --in-place ./locale/index.pot --expression='1,6s/EMAIL@ADDRESS/pete@akeo.ie/'

    Running the above, in the directory where we have our PHP, creates our index.pot under the ./locale/ subdirectory, and fills in some important variables that xgettext mysteriously doesn't seem to provide any means to set. As you can see, we used the -c option so that any notes to translators that we added using PHP comments are carried over.

  9. Now, we're doing into the part that is generally meant to be done by a translator: download the index.pot, and open it in Poedit. From there, set your target language (here fr_FR) and translate the various strings (eg. "Hello, world""Bonjour, monde"). Save your translation as index.po/index.mo (Poedit will create both files) and upload index.mo in ./locale/fr/LC_MESSAGES/.

  10. Voilà! If you did all of the above properly and select French in the dropdown or use a browser that has French as its preferred language, then you should now see the relevant sections translated. "C'est magique, non?"

  11. From there, you will of course need to add PHP for all of the page content that you want to see translated, by enclosing the English text it into <? _(...);?> sections (don't worry about the constant switching between HTML and PHP mode - PHP is designed to be very efficient at doing just that!). Once you're happy, just rename your index2.php to index.php (but make sure to remove your index.html first, or you may run into weird issues), and you are fully ready to get your content localized. To do that, just run the POT creation script again (make sure you edit the script if needed, so that is applies to index.php now), and provide index.pot to your translators. Then wait for them to send your their .mo files, edit the code above to add a new array line for each extra language, and watch in awe as visitors experience your site in that new language. Now, it wasn't that hard after all, was it?


Additional remarks:

Can't we just do away with the double fr_FR and fr in our array?

Unfortunately, no. The short explanation is, even after you place your translation under a /fr/ subdirectory, so that it is used by default when your locale is fr_FR, fr_CA, fr_BE, fr_CH and so on, gettext still can't work with a locale that is just set to fr. This is because, as explained in the Prerequisites, if your system doesn't have an fr or fr.utf8 listed with locale -a, gettext just doesn't know how to handle it language.

Now. the long explanation as to why don't we couldn't just use a single fr_FR in our $langs array is: we want to smartly set our dropdown to French, even when fr_CA is provided, and we can't do something as simple as just picking the first two characters of the array locale, due to the fact that we will also want to support both pt_PT and pt_BR as well as zh_CN and zh_TW, as separate languages (because that's pretty much what they are). So, if we were to just try to isolate the substring up to the underscore, then if we had zh_CN defined before zh_TW in our array, Traditional Chinese speakers would see the dropdown set to Simplified Chinese and that's not what we want.

Thus, for our dropdown selection comparison, we must provide a value that is the lowest common denominator we want the language to apply to, which can be either a simple fr or es, or a longer pt_BR or zh_CN. But as we explained previously, we can't use that lowest common denominator for locale selection, as gettext might not know how to handle it. And that is why we need to duplicate part of the locale in two places in our array.

<rant>Of course, it would be oh so much simpler if OSes agreed that short locales without a region are perfectly valid entities by default, especially as gettext doesn't seem to have any issue accepting them when looking for .mo files, but hey, that's localization for you: no-one EVER manages to get it right...</rant>

How about a real-life example?

Alright... Since I'm all about Open Source, let me show you exactly how I am applying all of the above to the Rufus Homepage. You can click the following to access the current index.php source for the Rufus site, as well as the locale/ subdirectory. There's also this guide, that I provide to any translator who volunteered to create a translation for the homepage. Hopefully, these will help you fill any blanks, and allow you to provide an awesome multilingual web page!

What about right-to-left languages?

Look at the PHP source and look for the use of the $dir variable.