globalization gotchas l.
Skip this Video
Loading SlideShow in 5 Seconds..
Globalization Gotchas PowerPoint Presentation
Download Presentation
Globalization Gotchas

Loading in 2 Seconds...

play fullscreen
1 / 33

Globalization Gotchas - PowerPoint PPT Presentation

  • Uploaded on

Globalization Gotchas. Mark Davis. Unicode Basics. Unicode encodes characters, not glyphs: U+0067 → g g g g g g g g g g g g g. ... Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Globalization Gotchas' - bisa

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
unicode basics
Unicode Basics
  • Unicode encodes characters, not glyphs:
    • U+0067 → g gg ggg g ggggg g. ...
  • Unicode does not encode characters by language:
    • French, German, English j have the same code point even though all have different pronunciations
    • Chinese 大 (da) has the same code point as Japanese 大 (dai).
  • UTF-8, UTF-16, and UTF-32 are all Unicode.
  • The word character means different things to different people: make clear which one you mean.
    • glyphs, code points, bytes, code units, user-perceived characters (grapheme clusters),…
unicode in apis
Unicode in APIs
  • U+0000 to U+10FFFF: Be prepared to handle (at least not corrupt!) any incoming code points
    • A back-level system may get unassigned code points from later versions.
    • Watch for "UCS-2" implementations. They use UTF-16 text, but don't support characters above U+FFFF; they also may accidentally cause isolated surrogates.
  • Some APIs/protocols will count lengths in code points, and others in bytes (or other code units).
    • Make sure you don't mix them up.
  • Don't limit API parameters to a single character (and definitely not to a single code unit!).
    • What users think of as a single character (e.g. ẍ, ch) may be a sequence in Unicode.
  • Use the latest version of Unicode: supports new characters, corrections, more stability guarantees.
choice of characters
Choice of Characters
  • Character and block names may be misleading, eg,
    • U+034F COMBINING GRAPHEME JOINER doesn't join graphemes.►
    • Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for everything but the BOM function.
  • Never use unassigned code points; those will be used in future versions of Unicode.
    • Only use private use (PUA) or non-characters (and only if necessary)
    • If you do, minimize the opportunity for collision by picking an unusual range.
character conversion
Character Conversion
  • Always use "shortest form" UTF-8.
    • It's the Law.
    • And if that isn’t enough, consider security attacks.
  • If a protocol allows a choice of charsets, always tag correctly
    • Not all text is correctly tagged: character detection may be necessary. But remember, it's always a guess!
    • Converting a database of mixed, untagged data is extremely painful.
  • Bad assumptions:
    • Length [bytes] = N * length [code points]
    • 1 character [charset X] = 1 character [Unicode]
      • The ordering may also be different.
character conversion ii
Character Conversion II
  • IANA / MIME charset names are ill-defined: vendors often convert same charset different ways.
    • Shift-JIS: 0x5C → U+005C (\) or U+00A5 (¥)
  • Don’t simply omit unconvertable data; to reduce security problems, at least substitute:
    • U+FFFD (when converting to Unicode) or
    • 0x1A (when converting to bytes).



  • Use properties such as Alphabetic, not hard-coded lists:
    • isAlphabetic(x) regex: \p{Alphabetic} or [:Alphabetic:]
    • Not (“A” ≤ x ≤ “Z” OR “a” ≤ x ≤ “z”)
  • Some properties aren't what you think; use:
    • White_Space not General_Category=Zs
    • Alphabetic not General_Category=L
    • Lowercase not General_Category=Ll
    • Script=Greek not Block=Greek
  • Characters may change property values between versions of Unicode


identifiers tokens
Identifiers & Tokens
  • When designing syntax, use as a base:
    • Pattern_Syntax for operators / relations
    • Pattern_Whitespace for gaps
    • XID_Start and XID_Continue for identifiers.
    • All backwards compatible across versions
  • Profiles may expand or narrow from the base
  • Watch out for security attacks:
    • “” with a Cyrillic “a”

► See Unicode Security at this conference

comparison collation searching sorting matching
Comparison (Collation):Searching, Sorting, Matching
  • There are two binary orders:
    • code point order = UTF-8 order = UTF-32 order
    • ≠ UTF16 order
  • Don’t present users with binary order!
    • No users expect A < Z < a < z < Ç < ä.
    • Apply normalization to get a unique form, so Å = Å.
  • Security Issues: Protocols must precisely define the comparison operations:
    • Eg, LDAP doesn't, so lookup may fail (or falsely succeed!)
    • Aside from wrong results, opening for security attacks.
language sensitive comparison
Language-Sensitive Comparison
  • Use UCA Order as a base to meet user-expectations:
    • a < A < ä < Ç = C◌̧ < z < Z
  • Real language-sensitive order requires tailoring on top of UCA; ordering depends on context and language:
    • china < China < chinas < danish
    • ae < æ < af
    • z < æ (Danish)
    • c < d < ... h < ch < i (Slovak)
  • Follow UCA for substring match offsets – some gotchas here.
  • Don't mix up "stable" and "deterministic" sorting: they are very different.



normalization nfc
Normalization (NFC,…)
  • Standardized normalized forms defined by Unicode.
  • The ordering of accents in a normalization form may not be the typical type-in order.
    • Fonts should handle both orders.
  • Normalization is context independent
    • Don't assume NFC(x + y) = NFC(x) + NFC(y)
  • People assume that NFC always composes, but some characters decompose in NFC.
  • Trivia: In Unicode 4.1 there are exactly 3 characters that are different in all 4 normalization forms: ϓ, ϔ, ẛ
case conversion
Case Conversion
  • Not a simple 1:1 mapping
    • Title case: dz ↔ DZ ↔ Dz
    • Expansion: heiß → HEISS → heiss
    • Context-dependent: ΌΣΟΣ → όσος
    • Language-dependent: istanbul ↔ İSTANBUL
  • Warning: never use language-dependent casing for language-independent structures, like file-system B-Trees.
case conversion ii
Case Conversion II
  • Case folding was not stable.
    • Different results from toCaseFold(S) between two versions
    • Stability now guaranteed in Unicode 5.0
  • Don't use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of  General_Category
    • These were constrained to be in a partition.
    • Use the separate binary properties Lowercase and Uppercase instead.
lowercase uppercase form vs function
Lowercase / Uppercase:Form vs Function
  • Lowercase, the binary property:
    • The character is lowercase in form,but not necessarily in function.
  • Functionally Lowercase:
    • isCased(x) & isLowercase(x).
    • See Section 3.13 of TUS.
  • What a user thinks of as a characters is often a sequence.
  • Words are not just sequences of letters.
  • Lines don’t just break at spaces
  • All may be language-dependent
  • ► ►
  • Transliteration Ελληνικά ↔ Ellēniká≠ Translation Ελληνικά ↔ Greek
  • Transliteration may vary by language:

Путин ↔ Putin, Poutine, ...

Горбачёв ↔ Gorbachev, Gorbacev, Gorbatchev, Gorbačëv, Gorbachov, Gorbatsov, Gorbatschow, ...

  • Watch for terminology: “lossy” vs “lossless”
    • Lossy transliteration: Ελληνικά → Ellinika → Ελλινικα
    • In ISO terms: “transliteration” = lossless transliteration “transcription” = lossy transliteration.


rendering is contextual
Glyphs may change shape

Multiple characters → 1 glyph

One character → multiple glyphs

Rendering is Contextual

Processing character-by-character gives the wrong results!

rendering ii
Rendering II
  • Good rendering systems will handle customary type-in order for text plus canonical order.
    • Excellent ones will do any canonically-equivalent order, but those are rare.
  • There may be differences in the customary glyphs for different languages; specify the font or the language where they have to be distinguished
  • Security Issues:
    • Never render a missing glyph as "?“.
    • Don't simply overlay diacritics: it can cause security problems.



  • Unicode ≠ Globalization (aka Internationalization, Localizability)
    • Unicode provides the basis for software globalization, but there's more work to be done...
  • Use globalization APIs: Formatting and parsing of dates, times, numbers, currencies; comparison of text; calendar systems; ... are locale-dependent.
    • Where OS facilities are not adequate or cross-platform solutions are needed, use ICU (C, C++, Java)
  • Don't put any translatable strings into your code; separate into resource files.
    • Provide context to translators: is Mark a noun, a verb, or a name…
    • Don’t use the same string in different contexts unless the meaning is identical (including references).
  • Note:User-Interface language (menus, dialog, help-system,...) ≠Data language (body text, spreadsheet cells).
    • Programs need to handle, as data, more languages than in localized UI
common globalization mistakes
Common Globalization Mistakes
  • Never compile Windows apps as “ANSI” (the default!).
  • Don't simply concatenate strings to make messages:
    • Order of components differs by language: use Java MessageFormat, or structure UI as separate fields.
  • Don't assume icons and symbols mean the same around the world. Don't assume everyone can read the Latin alphabet.
  • Allocate space flexibly: “OK” in English → “Aceptar” in Spanish
    • English is a relatively compact language; others may require more characters (eg in database fields) and more screen real estate (in UIs).
  • Beware of discrepancies in “fallback” behavior:
    • Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face (JSF), Apache HTTP,...



neutral formats
Neutral Formats
  • Store and transmit neutral-format data wherever possible. Convert that data to the user's preferred formats as "close" to the user as possible.

Type Example Rec. Standard

Language/Locale* en-US (en_US) RFC 3066 bis / CLDR

Territory AU RFC 3066 bis

Currency EUR ISO 4217

Timezone Australia/Melbourne TZDB

Calendar islamic-civil CLDR Calendar ID

Custom Date yyyy-mmm-dd CLDR Pattern Format

Binary Time 8C80E9E3967A4B0 Windows File Time

  • Locale IDs are extensions of language IDs; use CLDR.►
  • Don't assume that everyone in country always uses that country’s currency. Always use an explicit currency ID (ISO 4217).
    • <RUR, 1.23457×10³> ↔ 1 234,57р. in Russian,
    • but Rub 1,234.57 in English.
  • Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names.►
  • If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. (eg, from browser settings) make sure the user can override that and pick an explicit value.
unicode guide
Authoritative but lightweight

Introduction, overview, and quick reference

Main principles of the Unicode Standard

Best practices in Software Globalization

Unicode Guide
other resources
Other Resources
  • Unicode Site:
  • An Overview of ICU:
  • Globalizing Software:
  • W3C Internationalization:
  • Microsoft Global Software Development
user input
User Input 
  • If you develop your own text editor, use the OS APIs to handle IMEs (Input Method Engines) for Chinese, Japanese, Korean,...
  • If you are using "type-ahead" to get to a position in a list (eg typing "Jo" gets to the first element starting with those characters), allow arbitrary input. This is often easiest with visible fields.
  • If your password field can contain characters that require an IME, a screen pop-up box may reveal the password to onlookers.
  • In MessageFormat, watch for words like can't, since ASCII ' has syntactic meaning. Use a real apostrophe (U+2019) where possible: can’t.
  • In Date and Calendar, the months are numbered from 0 (February is month number 1!). However, weeks and days are numbered from 1.
  • Java serialized text isn't UTF-8, though it's close. U+0000 and supplementary code points are encoded differently.
  • Java globalization support is pretty outdated: use ICU to supplement it.
  • Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face (JSF), Apache HTTP server, etc. all provide some locale determination mechanism and facility; but they all differ in details.
  • Always encode characters above U+007F with escapes (\uxxxx).
  • There is an HTML mechanism to specify the charset of the Javascript source, but it is not widely implemented.
  • The JDK tool native2ascii can be used to convert the files to use escapes