Glossary

Basis for this glossary is taken from the glossary defined by the Unicode Consortium. All the terms included in this glossary are not used in this document, but they are included here to introduce a common baseline for terminology usage when discussing internationalization and localization related issues.

Term

Description

Alphabet

A writing system that consists of letters for the writing of both consonants and vowels. Consonants and vowels have equal status as letters in the Latin script based alphabet, but for instance, Arabic can be written without vowels. The Latin alphabet is the most widespread and well-known example of an alphabet. The correspondence between letters and sounds may be either more or less exact; most alphabets do not exhibit a one-to-one correspondence between distinct sounds (phonemes) and distinct letters (graphemes). The term "alphabet" is derived from the first two letters of the Greek script (alpha, beta).

Arabic-Indic digits

Forms of decimal digits used in many parts of the Arabic world (for instance, U+0660 (٠), U+0661(١), U+0662 (٢) ‚ U+0663 (٣). Although European digits (1, 2, 3…) derive historically from these forms, they are visually distinct and coded separately. (Arabic digits are sometimes called Indic numerals; however, this nomenclature leads to confusion with the digits currently used with the scripts of India.)

  • Arabic digits are referred to as Arabic-Indic digits in the Unicode Standard.

  • Variant forms of Arabic digits used chiefly in Iran and Pakistan are referred to as Eastern Arabic-Indic digits.

API

Application Programming Interface

ASCII

  • The American Standard Code for Information Interchange, a 7-bit coded character set for information interchange. It is the U.S. national variant of ISO/IEC 646, and is formally the U.S. standard ANSI X3.4. It was proposed by ANSI in 1963 and finalized in 1968.

  • The set of 128 Unicode characters from U+0000 to U+007F, including control codes, as well as graphic characters.

  • ASCII has been incorrectly used to refer to various 8-bit character encodings that include ASCII characters in the first 128 positions.

Bi-di

Abbreviation of bi-directional, in reference to mixed left-to-right and right-to-left text.

Big Endian

Computer architecture that stores multiple-byte numerical values with the most significant byte (MSB) values first.

Big5

Chinese encoding, used especially for Traditional Chinese (Hong Kong and Taiwan).

Block

A grouping of related characters within the Unicode encoding space. A block may contain unassigned positions, which are reserved for further purposes.

BOM

Acronym for byte order mark. See “Byte order mark” description below.

BoPoMoFo

An alphabetic script used primarily in Taiwan to write the sounds of Mandarin Chinese and some other dialects. Each symbol corresponds to either the syllable-initial or syllable-final sounds; it is therefore a sub-syllabic script in its primary usage. The name is derived from the names of its first four elements. More properly known as zhuyin zimu or zhuyin fuhao in Mandarin Chinese.

Byte order mark

The Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE when used to indicate the byte order of a text. Byte order mark is needed to indicate the Endian nature of the encoding, e.g. in messaging.

Case

1. Feature of certain alphabets where the letters have two distinct forms. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lowercase letter (also known as small or minuscule). 2. Normative property of characters, consisting of uppercase, lowercase, and title case (Lu, Ll, and Lt).

Case mapping

The association of the uppercase, lowercase, and title case forms of a letter.

Character

  • The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding.

  • Synonym for abstract character.

  • The basic unit of encoding for the Unicode character encoding.

  • The English name for the ideographic written elements of Chinese origin.

Character encoding form

Mapping from a character set definition to the actual code units used to represent the data.

Character encoding scheme

A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. (BE = Big Endian, LE = Little Endian).

Character repertoire

The collection of characters included in a character set. Character repertoire refers to the set of characters that are distinctive to a certain language. For instance, both English and Finnish are written using Latin script, but their character repertoires are different – in English the alphabet runs from A to Z, while in Finnish they run from A to Ä.

Character set

A collection of elements used to represent textual information.

CharConv

Symbian platform component handling the character conversion between UCS-2 and other encodings.

CJK

Abbreviation for Chinese, Japanese, and Korean. A variant, CJKV, means Chinese, Japanese, Korean, and Vietnamese. CJK deals with encodings and character repertoires related to Chinese, Japanese, and Korean.

Code page

A coded character set, often referring to a coded character set used by a personal computer. For example, Windows code page has been used for most of the Western European and American languages in Windows environment. Windows code page was also the default encoding of EPOC (the former name of the Symbian platform) up to ER5. In Symbian platform 6.0, this was changed to UCS-2 Little Endian.

Code point

Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16.

Codespace

A range of numerical values available for encoding characters.

Collation

The process of ordering units of textual information. Collation is usually specific to a particular language. Additionally, known as alphabetizing or alphabetic sorting. Unicode Technical Report #10, "Unicode Collation Algorithm," defines a complete, unambiguous, specified ordering for all characters in the Unicode Standard.

Compound word

A word (lexeme) that consists of more than one free morpheme (smallest language unit that carries a semantic interpretation).

Cp

Abbreviation for code page

Diacritic

  • A mark applied or attached to a symbol to create a new symbol that represents a modified or new value.
  • A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information). In addition, called diacritical mark or diacritical.

Diaresis

Two horizontal dots over a letter, as in “naïve” or “äiti”. The diaresis is not distinguished from the umlaut in the Unicode character encoding.

Directional property

A property of every graphic character that determines its horizontal ordering as specified in Unicode Standard Annex #9, “The Bidirectional Algorithm.

Dual form

Form used in some languages to indicate that there are two items or persons. This has impact on the inflation of adjectives and conjugation of verbs.

Encoded character

An abstract character together with its associated Unicode scalar value (code point). By itself, an abstract character has no numerical value, but the process of “encoding a character” associates a particular Unicode scalar value with a particular abstract character, thereby resulting in an “encoded character.”

ELOCL.dll

Symbian platform locale module that contains e.g. locale related classes (such as TLocale).

Encoding

A character encoding form plus byte serialization; Mapping from a character set definition to the actual code units used to represent the data.

European digits

Forms of decimal digits first used in Europe and now used worldwide. Historically, these digits were derived from the Arabic digits; they are sometimes called “Arabic numerals,” but this nomenclature leads to confusion with the real Arabic digits.

Extended ASCII

8-bit encoding that is an extension of the 7-bit ASCII. For instance, GSM 3.38 standard encoding (used in SMS) is one form of extended ASCII.

Font

A collection of glyphs used for the visual depiction of character data. A font is often associated with a set of parameters (for example, size, posture, weight, and serifness), which, when set to particular values, generates a collection of imageable glyphs.

Full-width

Characters of East Asian character sets whose glyph image extends across the entire character display cell. In legacy character sets, full-width characters are normally encoded in two or three bytes. The Japanese term for full-width characters is zenkaku.

GB 18030-2000

Chinese encoding scheme relying on surrogate pairs.

G11N

Abbreviation for globalization.

GBK

Chinese encoding scheme.

Globalization

In internationalization and localization industry sometimes used as the top level term to cover both localization and internationalization.

Software design that is applicable all over the world without localization or regionalization modifications.

Glyph

An abstract form that represents one or more glyph images.

A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. These glyphs are selected by a rendering engine during composition and layout processing.

Grapheme

A minimally distinctive unit of writing in the context of a particular writing system. For example, ‹b› and ‹d› are distinct graphemes in English writing systems because there are distinct words such as big and dig. Conversely, a lowercase italiform letter “a” and a lowercase Roman letter “a“ are not distinct graphemes because no word is distinguished on the basis of these two different forms.

What a user often thinks of as a character.

GSM 3.38

Standard related to SMS, defines also the character set and encoding for SMS.

Half-width

Characters of East Asian character sets whose glyph image occupies half of the character display cell. In legacy character sets, half-width characters are normally encoded in a single byte. The Japanese term for half-width characters is hankaku.

Han characters

Ideographic characters of Chinese origin.

Hangul

The name of the script used to write the Korean language.

Han unification

The process of identifying Han characters that are common among the writing systems of Chinese, Japanese, Korean, and Vietnamese.

Hanzi

The Mandarin Chinese name for Han characters.

Hiragana

One of two standard syllabaries associated with the Japanese writing system. Hiragana syllables are typically used in the representation of native Japanese words and grammatical particles.

HKSCS

Hong Kong Supplementary Character Set. Used mainly in Hong Kong. Utilizes some of the Unicode PUA code points also in textual data transfer.

IANA

Internet Assigned Numbers Authority. A standardization organization.

Ideograph

Any symbol that primarily denotes an idea (or meaning) in contrast to a sound (or pronunciation) – for example, a symbol showing a telephone.

An English term commonly used to refer to Han characters, equivalent to the borrowings hànzì, kanji, and hanja.

Indic digits

Forms of decimal digits used in various Indic scripts (for example, Devanagari: U+0966, U+0967, U+0968, and U+0969). Arabic digits (and, eventually, European digits) derive historically from these forms.

I18N

Abbreviation for internationalization

IA5

International Alphabet No. 5. ASCII based encoding and character repertoire. Used in CDMA SMS.

Internationalization

Designing software to be usable around the world. The process of enabling efficient development and creation of software variants in multiple languages from a single source. Must be done also at the operating system and platform levels as well as the UI level. Internationalization and localization together form a function that results in localized versions of a product.

ISCII

Acronym for Indian Script Code for Information Interchange. Used for instance for Hindi.

ISO

International Standardization Organization

ISO 8859-n

Encodings and character repertoires defined by ISO organization. A member of the ISO 8859 family of character set standards. The number at the end of the name signifies the variant of the character set standard (e.g. ISO 8859-1 is the Latin-1 variant of the set.)

Kana

The name of a primarily syllabic script used by the Japanese writing system. It comes in two forms, hiragana and katakana. The former is used to write particles, grammatical affixes, and words that have no kanji form; the latter is used primarily to write foreign words.

Kanji

The Japanese name for Han characters; derived from the Chinese word hànzì. Also romanized as kanzi.

Katakana

One of two standard syllabaries associated with the Japanese writing system. Katakana syllables are typically used in representation of borrowed vocabulary (other than that of Chinese origin), sound-symbolic interjections, or phonetic representation of “difficult” kanji characters in Japanese.

Kerning

Changing the space between certain pairs of letters to improve the appearance of the text.

Process of mapping from pairs of glyphs to a positioning offset used to change the space between letters.

L10N

Abbreviation for localization

LAF

Look and Feel specification that defines layouts for the UI.

Language

Spoken or written form of communication.

Letter

An element of an alphabet. In a broad sense, it includes elements of syllabaries and ideographs.

Informative property of characters that are used to write words.

Ligature

A glyph representing a combination of two or more characters. In the Latin script, there are only a few in modern use, such as the ligatures between “f” and “i” or “f” and “l”. Other scripts make use of many ligatures, depending on the font and style. Common in. e.g. Arabic and Devanagari scripts.

Little Endian

A computer architecture that stores multiple-byte numerical values with the least significant byte (LSB) values first.

.LOC file

Localization text file that contains the user visible text of an application.

Locale

Software component containing country and cultural specific data such as time and date formats. In Symbian platform relationship between a country and a language, containing culture and language-specific information and data processing rules.

Locale data markup language

The XML specification for the exchange of locale data, defined by Unicode Technical Standard #35, "Locale Data Markup Language (LDML)." See also Common Locale Data Repository.

Localizable

Item that can be localized without changing the source code or without the need of other types of re-engineering.

Localization

The adaptation of software and products to meet the requirements of local markets and different languages. In Symbian platform, localization is a function that enables the creation of language variants by providing a localized user interface usable in the language areas defined in the project.

LOCE32

Architectural component for locale handling in the Symbian platform.

Logical name

Unique identifier used to refer to items in LOC files, resource files and source code.

Logical order

The order in which text is typed on a keyboard. For the most part, logical order corresponds to phonetic order. For instance, Arabic text is stored in logical order, but displayed in visual order.

MIME

Multipurpose Internet Mail Extensions. MIME is a standard that allows the embedding of arbitrary documents and other binary data of known types (images, sound, video, etc.) into e-mail handled by ordinary Internet electronic mail interchange protocols.

Mirrored property

The property of characters whose images are mirrored horizontally in text that is laid out from right to left (versus left to right).

MMP file

Symbian platform project definition file that describes a project to be built.

Nonspacing mark

A combining character whose positioning in presentation is dependent on its base character. It generally does not consume space along the visual baseline in and of itself.

Normalization

A process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence. In the Unicode Standard, normalization refers specifically to processing to ensure that canonical-equivalent (and/or compatibility-equivalent) strings have unique representations.

Paragraph direction

The default direction (left or right) of the text of a paragraph. This direction does not change the display order of characters within an Arabic or English word. However, it does change the display order of adjacent Arabic and English words, and the display order of neutral characters, such as punctuation and spaces. For more details, see Unicode Standard Annex #9, “The Bidirectional Algorithm”.

Phoneme

A minimally distinct sound in the context of a particular spoken language. For example, in American English, /p/ and /b/ are distinct phonemes because pat and bat are distinct; however, the two different sounds of /t/ in tick and stick are not distinct in English, even though they are distinct in other languages such as Thai.

PinYin

Standard system for the romanization of Chinese on the basis of Mandarin pronunciation.

Presentation form

Ligature or variant glyph that has been encoded as a character for compatibility. (See also compatibility character).

Private use area

Refers to designated code points in the Unicode Standard or other character encoding standards whose interpretations are not specified in those standards and whose use may be determined by private agreement among cooperating users.

Private-use code point

Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. These code points are designated in the Unicode Standard for private use.

PUA

Abbreviation for Unicode term “Private Use Area”. Refers to designated code points in the Unicode Standard or other character encoding standards whose interpretations are not specified in those standards and whose use may be determined by private agreement among cooperating users.

Radical

A structural component of a Han character conventionally used for indexing. The traditional number of such radicals is 214.

Regionalization

Process of creating and customizing software that is used only in a certain region.

Rendering

The process of selecting and laying out glyphs for the purpose of depicting characters.

The process of making glyphs visible on a display device.

RSS file

Symbian platform resource source file. Resource source files contain data separate from executable code. Their main uses are for defining user interface components and for storing localizable data.

Shift-JIS

A shifted encoding of the Japanese character encoding standard, JIS X 0208, widely deployed in PCs. Most widely used encoding in Japan.

Sorting

See collation.

Supplementary character

A Unicode encoded character having a supplementary code point.

Supplementary code point

A Unicode code point between U+10000 and U+10FFFF.

Surrogate code point

A Unicode code point in the range U+D800 through U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.

Surrogate pair

A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit.

Script

A collection of letters and other written signs used to represent textual information in one or more writing systems. For example, Russian is written with a subset of the Cyrillic script; Ukrainian is written with a different subset of the same script. The Japanese writing system uses several scripts – hiragana, katakana, kanji, and romaji.

TBuf

Symbian platform descriptor class that provides a buffer of fixed length for containing, accessing and manipulating TText (general text character) data. For the use of TBuf, see reference document [1].

TLocale

A Symbian platform class that contains locale related data and member functions to get the data.

Tone mark

A diacritic or non-spacing mark that represents a phonemic tone. Tone languages are common in Southeast Asia and Africa. Because tones always accompany vowels (the syllabic nucleus), they are most frequently written using functionally independent marks attached to a vowel symbol. However, some writing systems such as Thai place tone marks on consonant symbols; Chinese does not use tone marks (except when it is written phonemically using Latin characters).

UCS

Acronym for Universal Character Set, which is specified by International Standard ISO/IEC 10646.

UCS-2

ISO/IEC 10646 encoding form: Universal Character Set coded in 2 octets.

UCS-4

ISO/IEC 10646 encoding form: Universal Character Set coded in 4 octets.

UI

User Interface. Interface via which a user can interact with software and peripheral equipment.

Unicode

The universal character encoding, maintained by the Unicode Consortium (http://www.unicode.org/). This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols.

Unicode Character database

A collection of files providing normative and informative Unicode character properties and mappings.

Unicode Encoding Form

A character encoding form that assigns each Unicode scalar value to a unique code unit sequence. The Unicode Standard defines three Unicode encoding forms: UTF-8, UTF-16, and UTF-32.

Unicode Encoding Scheme

A specified byte serialization for a Unicode encoding form, including the specification of the handling of a byte order mark (BOM), if allowed.

Unicode Transformation Format

An ambiguous synonym for either Unicode encoding form or Unicode encoding scheme. The latter terms are now preferred.

UTF-7

Unicode (or UCS) Transformation Format, 7-bit encoding form, specified by RFC-2152. UTF-7 is rarely used nowadays.

UTF-8

The UTF-8 encoding form.

The UTF-8 encoding scheme.

“UCS Transformation Format 8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard. UTF-8 is widely used nowadays.

UTF-8 Encoding Form

The Unicode encoding form which assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length.

UTF-8 Encoding Scheme

The Unicode encoding scheme that serializes a UTF-8 code unit sequence in exactly the same order as the code unit sequence itself.

UTF-16

The UTF-16 encoding form.

The UTF-16 encoding scheme.

“Transformation format for 16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard.

UTF-16 Encoding Form

The Unicode encoding form which assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and which assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-4, UTF-16 Bit Distribution.

UTF-16 Encoding Scheme

The UTF-16 encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either Big Endian or Little Endian formats.

UTF-32

The UTF-32 encoding form.

The UTF-32 encoding scheme.

UTF-32 Encoding Form

The Unicode encoding form which assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.

UTF-32 Encoding Scheme

The Unicode encoding scheme that serializes a UTF-32 code unit sequence as a byte sequence in either Big Endian or Little Endian formats.

Visual order

Characters ordered as they are presented for reading. See also logical order.

Writing direction

The direction or orientation of writing characters within lines of text in a writing system. Three directions are common in modern writing systems: left to right, right to left, and top to bottom. Top to bottom rendering is not supported in Symbian platform.

Writing system

A set of rules for using one or more scripts to write a particular language.

Zero width

Characteristic of some spaces or format control characters that do not advance text along the horizontal baseline. (See non-spacing mark.)