Character sets and encodings

This topic discusses difference between characters, glyphs, character sets and encodings, their impact on implementation.

Character vs. glyph vs. encoding

In fonts, there are glyphs that are the visual representations of the related characters. Each character can be encoded in various ways, for example, for character « A«, the code value in Latin script depends on the encoding used (ISO 8859-1 or Windows code page 1252 or UTF-8 or UCS-2). The character « A« looks exactly the same in Cyrillic and Greek scripts, but its code points are different from the Latin one.

To clarify glyph: for the Latin letter A, the following are examples of glyphs - A, a, A, a, a, A.

To clarify encoding: the capital letter A belongs to Latin, Cyrillic and Greek writing systems, but it has different code points for each script depending on the encoding. For instance, the Unicode hex code for this character can be: U+0041 (Latin capital letter A), U+0410 (Cyrillic capital letter A) or U+0391 (Greek capital letter Alpha).

Character repertoire varies between different languages even if they should use the same writing system. Every single character may have different encodings depending on the character encoding scheme used by the system. This is important to note especially in information exchange as different messaging and browsing applications follow different standards as for the default encoding.

Different Encoding schemes

The encodings supported natively by the Symbian platform CharConv component are:

  • UTF-7 (becoming obsolete)

  • UTF-8 (global)

  • Windows code page 1252 (Latin 1)

  • ISO 8859-1 – ISO 8859-15 (= all ISO encodings from 1-15)

  • ASCII (7 bit encoding)

  • SMS 7-bit (GSM 3.38)

  • GB 2312 (Simplified Chinese)

  • HZ-GB 2312 (Simplified Chinese)

  • GB 12345 (Simplified Chinese)

  • GBK (Simplified Chinese)

  • Big 5 (Traditional Chinese)

  • Shift-JIS (Japanese)

  • ISO 2022-JP (Japanese)

  • EUC-JP (Japanese)

  • ISO 2022kr (Korean)

  • KCS 5601 (Korean)

In addition to these natively supported encodings, the Symbian platform has implemented the following CharConv plug-ins (Win cp here refers to Windows code page):

  • ISCII (Hindi)

  • KOI8-R (Russian)

  • KOI8-U (Ukrainian)

  • TIS 620 (Thai)

  • Win cp 874 (Thai)

  • Win cp 932 (Japanese)

  • Win cp 936 (Chinese, PRC)

  • Win cp 950 (Chinese, HK & TW)

  • Win cp 1250 (Eastern European)

  • Win cp 1251 (Cyrillic)

  • Win cp 1253 (Greek)

  • Win cp 1254 (Turkish)

  • Win cp 1255 (Hebrew)

  • Win cp 1256 (Arabic)

  • Win cp 1257 (Baltic)