1 / 23

Unicode and Internationalization

Unicode and Internationalization. Draft only meaningful with voiceover!. Internationalization  Translation. localized = foreign language not required displayed text is translated all native conventions used dates, times, numbers, etc. display, editing, GUI

fauna
Download Presentation

Unicode and Internationalization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UnicodeandInternationalization Draft only meaningful with voiceover! mark.davis@us.ibm.com

  2. Internationalization  Translation • localized = foreign language not required • displayed text is translated • all native conventions used • dates, times, numbers, etc. • display, editing, GUI • internationalized = localizable w/o code changes mark.davis@us.ibm.com

  3. Internationalization Levels • Different levels of support for different environments • Server Side • low-level • no display • Client Side • high-level • display and editing mark.davis@us.ibm.com

  4. Server Side • Strings • storage and manipulation • character set conversion • collation, normalization, char/word boundaries • Locales • Formatting/parsing numbers, currencies, date/times, messages • Message cataloging (resources) mark.davis@us.ibm.com

  5. Client Side • Displaying, printing and editing Unicode text. • BIDI display (Arabic, Hebrew…) • character shaping (Arabic, Indic,...) • Inputting text (Japanese) • Full incorporation into the windowing and desktop interface. mark.davis@us.ibm.com

  6. Unicode • Key to modern internationalization • Enables robust interchange of text data • Encompasses all world characters • Supports legacy data mark.davis@us.ibm.com

  7. Unicode Design Principles • Unambiguous • same code unit = same interpretation • Universal • all national standards, new extensions • Unicode 0041 FF21 = “AA” = SJIS 41 82 60 • Efficient • no code-switching (ISO 2022) mark.davis@us.ibm.com

  8. Skipping Advantages • In the interests of time, we are skipping the details of Unicode advantages, and jumping on to... mark.davis@us.ibm.com

  9. Not a Magic Wand • Code required for Client Side, Server Side • Complex languages require special support • Detecting “hotspots” • Rest of the discussion are items to watch for in XML mark.davis@us.ibm.com

  10. Multiple Representations mark.davis@us.ibm.com

  11. Endians • Big vs. Little: “a” = 00 61 vs 61 00 • UTF-16: BOM (FE FF vs. FF FE) • UTF-16BE, UTF-16LE mark.davis@us.ibm.com

  12. Character Conversion • Many legacy sets • Names for sets not standard • IANA most accepted, but not comprehensive • JIS “¥” overloaded with “\” • Private Use Characters • “Best fit” mappings mark.davis@us.ibm.com

  13. Ambiguous Term—“Character” • UTF-16: “Character” can mean: • Code Units: 16 bits • Code Point: 1 or 2 code units • Graphemes: 1+ code units • Combining sequences • Hangul Jamo • Indic clusters mark.davis@us.ibm.com

  14. Comparison/Indexing • Index by which sense of character? • Canonical equivalence • Normalization mark.davis@us.ibm.com

  15. Collation • Large character sets • Incompatible languages, versions • Weak Equivalents: “a” ~ “ä”, “a”  “A” • Ignorable characters: “black-bird” • Contracting characters: “ch” • Expanding characters: “ä” • Separate key fields for phonetics mark.davis@us.ibm.com

  16. Case Conversion • May be 1 to many: “ß”  “SS” • May be locale-sensitive: “i”  “İ” • Does not round-trip: “vederLa” mark.davis@us.ibm.com

  17. Formatting/Parsing • Different separators: • “1’234,56”, “1,234.56” • Different order: • “2/23/99”, “23.2.99” • “Can’t find “ + X, X + “n’existe pas” • Different text: • “$”, “¥” mark.davis@us.ibm.com

  18. Display: Orientation • Characters: left-right, right-left, top-bottom,… • Lines: top-bottom, left-right, right-left • Specials: Japanese Ruby, etc. • GUI: scrollbars, menus, etc. mark.davis@us.ibm.com

  19. Display: Glyphs  Characters • Shaping: “X”  “Y” • contextual forms • ligatures • Indexing: m characters  n glyphs mark.davis@us.ibm.com

  20. Display: Editing • Editing: mapping glyph  char indices • Line breaking • Justification • Hyphenation • Hanging punctuation • Optical alignment • Baseline alignment mark.davis@us.ibm.com

  21. Input • Large character sets • Different keyboard mappings • Keys  characters • Typing events contain strings, not just singletons • Input methods • GUI for options • interaction with text editing mark.davis@us.ibm.com

  22. Current Issues for W3C • BOM • Use of “character” • Indexing/comparing • Normalization • Versions of Unicode (Euro,...) • Stateful format codes • Datatypes (date, time,…) • High-level layout (CSS…) mark.davis@us.ibm.com

  23. Summary • Internationalization  Translation • Unicode provides foundation • But not a magic wand! • Watch for hotspots • Work with int’l experts mark.davis@us.ibm.com

More Related