New in unicode
Download
1 / 23

New in Unicode - PowerPoint PPT Presentation


  • 165 Views
  • Uploaded on

New in Unicode. Mark Davis, John Jenkins. Agenda . Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data Repository Expanded Role for Consortium. Unicode 4.1.0. Released 2005 March 31 New Characters New Unicode Character Database

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' New in Unicode' - alden-barrera


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
New in unicode

New in Unicode

Mark Davis, John Jenkins


Agenda
Agenda

  • Unicode 4.1.0

  • UCA 4.1.0

  • Regular Expressions

  • Security Considerations

  • Character Mapping

  • Common Locale Data Repository

  • Expanded Role for Consortium


Unicode 4 1 0
Unicode 4.1.0

  • Released 2005March 31

  • New Characters

  • New Unicode Character Database

  • New Specifications


1 273 new characters
1,273 New Characters

  • Roundtripping for HKSCS and GB 18030

  • Five new currency signs

  • Additional characters for Indic and Korean

  • Eight new scripts


Changes in the standard
Changes in the Standard

  • Conformance Changes

    • Modifications to Default Case Operations

    • Clarification of Decomposition Mappings

  • Other Changes

    • SPACE not recommended as base for nonspacing marks

    • Use of CGJ to prevent reordering, prevent contractions in sorting/matching (UCA)

    • Positioning of Meteg

    • Rendering of Thai Combining Marks


Unicode character database
Unicode Character Database

  • Determines the behavior of characters in modern software:

    • Alphabetics, Letters, Numbers, Identifiers, Scripts, …

  • New properties

    • Grapheme_Cluster_Break, Sentence_Break, Word_Break, Pattern_Syntax, and Pattern_White_Space

  • Revised Property Values

    • Eg Alphabetic ⊃ ( Lowercase ∪ Uppercase )

  • Expanded documentation

  • Each release now complete, not delta


New specifications
New Specifications

  • UAX #31: Identifier and Pattern Syntax

    • Basis for Backwards-Compatible Identifiers

      • Programming Languages

      • Resources and Services

    • Basis for Stable Syntax characters

      • Whitespace

      • Operators

  • UAX #34: Unicode Named Character Sequences

    • Mechanism for identifying/naming significant sequences

    • Standardized list


Major revisions in annexes
Major Revisions in Annexes

  • UAX #15: Unicode Normalization Forms

    • Correction for Idempotency Problem

    • Enhanced discussion of Hangul

  • UAX #14: Line Breaking Properties

    • Modifications for Hangul

    • Changes because SPACE not recommended as base for nonspacing marks

    • Separated all suggested tailorings into separate section

  • UAX #29: Text Boundaries

    • Using new properties, adding Joiner/Non-Joiner

    • Modifications to Word -Break


Uts 10 unicode collation algorithm
UTS #10: Unicode Collation Algorithm

  • Basis for language-sensitive sorting, searching, and matching

  • Synchronized with Unicode 4.1.0

  • New:

    • Characters

    • Revised Weights

    • Specification: matching, ignorables, Thai, …


Uts 18 unicode regular expressions
UTS #18: Unicode Regular Expressions

  • Regular expressions used widely in programs, for matching patterns (eg Wildcards)

  • Unicode expands the scope drastically

  • Explicit Conformance Clauses

  • POSIX-Conformance


Uax 36 unicode security
UAX #36: Unicode Security

  • Incorrect usage of Unicode can expose programs or systems to possible security attacks! Examples:

  • Numbers: ৪୨ = 42 !

    • Bengali {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, Oriya {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}.

  • Domain Names:


Character mapping ml
Character Mapping ML

  • XML format for the interchange of mapping data for character encodings and aliases.

  • Promoted to Unicode Technical Standard; with new Conformance section (2).

  • Added explicit text about multi-character mappings.


Common locale data repository
Common Locale Data Repository

Δευτέρα, 05 Σεπτεμβρίου 2005

  • Common, necessary software locale data for world languages

  • XML format for effective interchange

Montag, 5. September 2005

1 234,57руб.

¥1,234.57

Arabic – arabski

Bulgarian – bułgarski

Czech – czeski

Africa – 非洲Central America – 中美洲

Eastern Africa – 东非

Northern Africa – 北非

AED – د.إ.‏

BHD – د.ب.

DZD – د.ج.‏

EGP – ج.م.‏

EUR – €

Z < Å


Typical locale data
Typical Locale Data

  • Dates/time formats

  • Number/Currency formats

  • Measurement Systems

  • Collation Specifications (UCA-based)

    • Used for sorting, searching, matching

  • Tailorings of translated names for language, territory, script, timezones, currencies, …

  • ...


Latest release cldr 1 3
Latest Release: CLDR 1.3

  • 296 locales: 96 languages, 130 territories

    • Languages: Afar [Qafar]; Afrikaans; Albanian [shqipe]; Amharic [አማርኛ]; Arabic [‎العربية‎]; Armenian [Հայերէն]; …

    • Territories: Afghanistan [‎افغانستان‎]; Albania [Shqipëria]; Algeria [‎الجزائر‎]; Argentina; Armenia [Հայաստանի Հանրապետութիւն]; Australia; Austria [Österreich]; Azerbaijan [Azərbaycan, Азәрбајҹан]; …

  • Complete set of generated POSIX-format data

    • Plus tool to generate versions tuned for different platforms.

  • Expanded locale data

    • Timezone localizations

    • Including UN M.49 continents and regions

    • Many other revisions and additions of data

  • New Tests & Tools


Expanded role for consortium
Expanded Role for Consortium

  • Dedicated to the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.

  • Providing the fundamental specifications for full software globalization, full interoperability



Institutional supporting members

Institutional & Supporting Members

(New Membership Categories)



Liaison members

Center of Computer and Information Development (CCID), Beijing, China

High Council of Informatics (HCI), Iran

Information and Communication Technology Agency of Sri Lanka (ICTA)

The International Forum for Information Technology in Tamil (INFITT)

The Internet Engineering Task Force (IETF)

ISO/IEC JTC1/SC2 and WG2

Linguistic Society of America (LSA)

National Endowment for the Humanities (NEH)

National Information Standards Organization (NISO)

NSAI/ICTSCC/SC4:Irish standardization: Codes, Character Sets, and Int’lization

Open I18n.org: The Free standards Group Open Internationalization Initiative

Research Institute for ILCAA, Tokyo University of Foreign Studies

Research Institute for the Languages of Finland (RILF)

Special Libraries Association (SLA )

Technical Committee on Information Technology (TCVN/TC1), Hanoi, Viet Nam

United Nations Group of Experts on Geographical Names (UNGEGN)

World Wide Web Consortium - W3C I18N Core Working Group

Liaison Members


Unicode technical committee
Unicode Technical Committee Beijing, China

  • Multiple Globalization Standards

    • The Unicode Standard, including UAXes

    • Unicode Technical Standards: Collation, …

    • Unicode Technical Notes: Best Practices, Background Information

  • Quarterly F2F Meetings

  • Email Discussion


Cldr technical committee
CLDR Technical Committee Beijing, China

  • Meetings

    • Short, frequent: Telecon + Instant Messaging

    • Email Discussion

  • Data

    • All additions / revisions in bug database

    • Anyone can file; committee assesses, vets


Why join
Why Join? Beijing, China

  • Support the technology

    • That enables your success in international, technical, and emerging markets.

  • Protect your investment

    • The stability you need

    • The extensions you require

    • The developments you call for: security, …

  • Demonstrate your leadership

    • For the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.


ad