1 / 39

Unicode 4.0

Unicode 4.0. Mark Davis President, The Unicode Consortium Note: slides differ from proceedings. Overview. New Characters Conformance UAX: Unicode Standard Annexes UCD: Unicode Character Database UTS: Unicode Technical Standards Not part of the Standard, but can claim conformance.

ezra-perry
Download Presentation

Unicode 4.0

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings

  2. Overview • New Characters • Conformance • UAX: Unicode Standard Annexes • UCD: Unicode Character Database • UTS: Unicode Technical Standards • Not part of the Standard, but can claim conformance

  3. Properties and Behavior • Unicode is not just a list of characters • Properties and behavior are crucial • With them, new characters can work “out of the box” • Some are part of the standard (BIDI, Normalization), others are associated (Collation, Regular Expressions)

  4. New Characters: 1,228 • Modern Scripts • (additions to) Indic, Khmer, Latin, Greek, Arabic, Syriac • (minority scripts) Limbu, Tai Le, Osmanya • Historic Scripts • Linear B, Cypriot, Ugaritic, Shavian, Aegean Numbers • Symbols • Monograms, digrams, tetragrams, other symbols • modifier & combining characters

  5. New Characters (cont.) • Special Characters • additional variation selectors (for future CJK variants), double-diacritics for dictionary use • For a detailed list, see Derived Age in the UCD 4.0, and the beta Charts. • Character repertoire corresponds to ISO/IEC 10646:2003.

  6. Conformance • Substantially improved specification of conformance requirements • Incorporated UTR #17:Character Encoding Model, clearly separating encoding forms and encoding schemes • Tightened definitions of UTF-8, UTF-16, UTF-32 • Separate definition of Unicode String • Clarified conformance status of Unicode Standard Annexes • Formal definitions of properties & algorithms • Provisional properties

  7. UTF vs. Unicode String • Important Distinction • UTF • Unique representation for Code Point • All else illegal • C0 80 • D800 0061 • Unicode String • Sequence of code units • Internal Processing, not interchange • Not necessarily valid UTF • C0 A0 • D800 0061

  8. Conformance (cont.) • Formalized policies for stability of the standard • Clarification of semantics of important characters, including BOM • Revised scope of enclosing combining marks • Revised semantics of ZWJ for cursive scripts • Normalization Corrections • U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF • All corrections subject to strict stability constraints: • For 3.2 repertoire, NFC3.2(X) = NFC4.0(X)

  9. Textual Clarifications • Major changes to Chapters 2, 3, 6, 14 and 15 • Definitive terminology for code points: • graphic, format, control, private-use • = assigned characters • surrogate, noncharacter, reserved • not characters • Substantial improvements to many character block descriptions, especially Indic

  10. Programming language identifiers • Now backwards-compatible • Once a Unicode identifier, • Always a Unicode identifier • Alternate definition for complete stability • Fix set of allowed characters • Allow all reserved code points • + Complete stability • - “Odd” characters • Also see new UTR on Syntax Characters

  11. Case mappings now normative (but tailorable) • Clearer definition of string functions: • isUpper(), isLower(), isTitle(), isFold() • toUpper(), toLower(), toTitle(), toFold() • Definition of titlecase uses word boundaries • Note that the Turkic mappings do not maintain canonical equivalence, without additional processing.

  12. UAX #9:BIDI • BIDI: Arabic/Hebrew Display • HTML, all modern word processors, OSs,… • New: • canonically equivalence now preserved • data change, not algorithm • shaping is done after reordering • but not across directional boundaries • clarifications of: • ZWJ, ZWNJ • intermediate level processing

  13. UAX #15: Normalization • Unique form for text comparison • W3C Character Model, International Domain Names, Network File System,… • New: • Description of Stable Code Points. • Notation NFC(x) and isNFC(x), in Notation. • Added pointer to UTN #5 Canonical Equivalences in Applications • Rewrote Annex 12: Corrigenda for clarity, and to describe the use of Normalization Corrections. • Added Annex 13: Canonical Equivalence.

  14. UAX #14: Line Breaking • Line-Break (word-wrap) all Unicode text • Customizable for different languages • New: • Negative numbers and dates with hyphens will not break across lines • Word-Joiner will link any characters (except hard line breaks) • Behavior of soft hyphen clarified • marks opportunity for breaking, not specific graphic appearance. • Rules for GL relaxed: SP and ZW override • New Property Values: NL, WJ

  15. UAX #29: Text Boundaries • Default “User Character”, Word, Sentence boundaries • Customizable for different languages • Word, sentence: tailoring expected • New: • Extracted from 3.0, but significantly revised • Grapheme cluster (“user character”) • Hangul Syllable or other Base • plus (optionally) any number of NSMs

  16. No Sub. Changes • UAX #11:East Asian Width • Guidelines for choosing character width • UAX #24: Script Names • Default script assignment • Used in regular expressions • Now UAX

  17. Superseded UAXes • Incorporated into and thus superseded by Unicode Version 4.0: • UAX #13: Unicode Newline Guidelines • UAX #19: UTF-32 • UAX #21: Case Mappings • UAX #27: Unicode 3.1 • UAX #28: Unicode 3.2

  18. Unicode Character Database • Crucial Component of Unicode • Documentation coalesced into UCD.html. • New properties and values • Hangul_Syllable_Type, Unicode_Radical_Stroke • CJK numeric values added. • PropertyValueAliases adds block names • UCD fallback props more precisely defined. • for code points not explicitly in data files • New Characters • Appropriate properties assigned

  19. UCD4.0 (cont.) • Modifier letters • The general category of 02B9..02BA, 02C6..02CF changed to general category Lm. • Khmer • Two Khmer characters are deprecated; four others strongly discouraged. • Decimal Digits • Numeric_Type=decimal digit now aligned with General_Category=Nd • Braille • Added script value

  20. UCD4.0 (cont. 2) • Case Mapping • Fixed for Turkish, Lithuanian • Default Ignorables • Hangul Filler characters • Soft-Hyphen, CGJ, ZWS • Arabic End of Ayah and Syriac Abbreviation Mark no longer DI, shaping classes fixed. • Grapheme_Extend • removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)

  21. Unicode Technical Standard • UTS: separate standard • independent conformance requirements • UTR: information and guidelines • Documents may move from UTR status to UTS

  22. UTS #10: Unicode Collation • Significance: • String comparison, matching, searching • Compares all Unicode characters • Handles linguistic features • Accents, Case, Punctuation,… • Contextual weighting,… • Tailor for different languages • Version 4.0.0 due Sept. 2003 • From now on, to be sync'ed in repertoire and version with the Unicode Standard.

  23. UTS #18: Regular Exp. • Significance: • Crucial to many applications: web, XML,… • Unicode adds significant requirements • Level 1: Basic Support • Perl • Level 2: Extended Support • Level 3: Tailored Support • New: • Recently approved as UTS (was UTR) • Adds clearer conformance requirements • Flexible list of features • Partial conformance claims

  24. UTS #6: SCSU • Simple Unicode Compression • Added suitability for XML • See also Technical Note on BOCU • Main difference: preserves binary order • x < y => BOCU(x) < BOCU(y)

  25. New UTRs • DraftUTR #23: Character Properties • Draft Character Property Model • Character Folding • Hiragana-Katakana, Case, … • Programming Language IDs, Syntax characters

  26. Q& A • Other talks here: • Common Locale Data • interchange of language-specific data for sorting, dates, times, currencies • ICU • premier Unicode enablement library • full-featured, x-platform • C, C++, Java

  27. Background Slides

  28. Unicode 3.2 (March, 2002) • New Characters: 1,016 • Symbols • Large collection of mathematical symbols, especially targeted at MathML, recycling symbols, ornamental brackets. • Special Characters • combining grapheme joiner, word joiner, invisible operators for math, variation selectors • Modern Scripts • minority scripts of the Philippines

  29. Conformance • Eliminates irregular UTF-8 • Defines variation sequences • Replaces ZWNBSP with Word Joiner • Clarifies scope of combining marks (further revised in 4.0) • Clarifications of conjoining jamo behavior, hangul syllable structure, decomposables,

  30. Textual Clarifications • Combined vowels in Khmer, characters discouraged in Khmer • Use of dingbats

  31. Unicode Standard Annexes • UAX #21: Case Mappings (was UTR)

  32. Unicode Character Database • New properties: • IDS_Binary_Operator, IDS_Trinary_Operator, Radical, Unified_Ideograph, • Default_Ignorable_Code_Point, Deprecated Soft_Dotted, Logical_Order_Exception • Grapheme_Base, Grapheme_Extend,Grapheme_Link • DerivedAge • Normalization Corrections • Added Property & Property Value Aliases • Adds StandardizedVariants.html

  33. Related Items • UTS #10: Unicode Collation Algorithm • Ignorable character handling, dual versioning, more conditions on well-formed weights, separate weights for CJK and unassigned characters, non-characters • Note: base version still U3.1 • UTR #26: CESU-8 • Unicode Technical Notes • Updated Character Encoding Stability Policy • Added Public Review process • Updated Glossary

  34. Unicode 3.1 (March, 2001) • New Characters: 44,946 • First supplementaries encoded! • Modern scripts • CJK Ideographs (now totaling 71,039) • Historic scripts • Old Italic, Gothic, Deseret, Byzantine Musical Symbols • Symbols • Mathematical Alphanumeric Symbols, (Western) Musical Symbols

  35. Conformance • Non-shortest-form UTF-8 excluded • Clarification of the stability of the standard, • code units vs. code points, non-characters, normative properties, informative properties, normative references • Revisions of guidelines: • wchar_t, unassigned code points, identifiers • Major revision of Georgian • Use of ZWNJ and ZWJ for ligatures • Language tag characters encoded • but discouraged

  36. Unicode Standard Annexes • UAX #19: UTF-32

  37. Unicode Character Database • Major revision of PropList properties: • White_Space, Bidi_Control, Join_Control, Hex_Digit • Alphabetic, Ideographic, Lowercase, Uppercase ID_Start, ID_Continue, XID_Start, XID_Continue Noncharacter_Code_Point • Quotation_Mark, Terminal_Punctuation, Math, Dash, Hyphen, Diacritic, Extender • New properties: Case folding, Scripts • Added DerivedProperties, NormalizationTest

  38. Related Items • Documented Character Encoding Stability Policy • UTS #10: Unicode Collation Algorithm • Merged data files; updated to base version 3.1 • UTR #18: Unicode Regular Expression Guidelines • UTR #20:Unicode in XML and other Markup Languages • UTR #22: Character Mapping Tables • UTR #24: Script Names

  39. Schedule • 2003, April: UCD/UAXes • Final data files available • Implementation can proceed • 2003: September: • Book Available

More Related