Encodings and Locales

Encodings and Locales

Encodings

Character encoding is how computer software translates the keys you press into data that is sent between computer programs, and how data from the program is converted back to what you see on the screen.

Old Style

The “old style” used a model where 1 key press == 1 byte == 1 column of data. But this limits you to 256 values, and so there were many overlapping (incompatible) “code pages”, and using the wrong code page was the reason you didn't see box graphics like you expected – or why you saw box graphics instead of your languages vowels.

Example of “old style” encodings:

CP437 (very common for scripts)
ISO-8859-1 (very common for non-English European languages)
KOI8-R (very common for Russian)

So a person who wanted to use KOI8-R to talk to Russian friends could not use a script that was encoded in CP437 with box drawing characters – because those encodings overlapped and were incompatible.

New Style

The “new style” uses a model where every character maps to a 32 bit integer (Unicode), and then is converted into 1 to 6 bytes which maximizes compatibility with the “old style” for ASCII. (UTF-8). Although there are many incompatible ways to convert Unicode to data, UTF-8 is the one used on Unix.

Encodings and Iconv

EPIC uses the iconv system to convert between encodings. You can use any encoding that is supported by your system's iconv! On my system, I can see all of the supported encodings with

   iconv --list

Locales

Now maybe you understand what encoding you are using. But you have to tell the software you run about it. In Unix, this is done with “locales”.

You can see the list of locales available on your system with the

    locale -a

command.

A locale looks like language_country.encoding.

Some examples:

Encoding Name	Encoding Explanation
en_US.ISO8859-15	English - US - ISO-8859-15 (old style, no box drawing)
en_US.UTF-8	English - US - UTF-8 (new style)
fi_FI.ISO8859-1	Finnish - Finland - ISO-8859-1 (old style)
fi_FI.UTF-8	Finnish - Finland - UTF-8 (new style)
ja_JP.SJIS	Japanese - Japan - Shift-JIS (old style)
ru_RU.KOI8-R	Russian - Russia - KOI8-R (old style, unix)
ru_RU.CP1251	Russian - Russia - CP1251 (old style, windows)
ru_RU.UTF-8	Russian - Russia - UTF-8 (new style)

You can set the locale you are using with the LC_ALL environment variable. This should be set in your login scripts, (ie, ~/.profile). For example, I use this:

   export LC_ALL=en_US.UTF-8

Then every program I run knows that I am using UTF-8 as my character encoding.

Some programs, like GNU Screen have problems with UTF-8. People have reported good success if you completely shut down your GNU Screen session (not just detach!) and set LC_ALL and then restart a new screen session. You could also use TMUX which appears to handle UTF-8 very well.

Some programs, like XTerm, can support either the old or new style, based on menu options. I really should create a good document discussing that, as well as other popular terminal emulators.

Your font also plays a role. It's one thing for the software to know what encoding you're using, but if you use an incorrect font for that encoding, you still won't see what you expect. I need to document what I know about using an appropriate UTF-8 font.

Table of Contents