Language Tags the next generation - PowerPoint PPT Presentation

language tags the next generation l.
Skip this Video
Loading SlideShow in 5 Seconds..
Language Tags the next generation PowerPoint Presentation
Download Presentation
Language Tags the next generation

play fullscreen
1 / 47
Download Presentation
Language Tags the next generation
Download Presentation

Language Tags the next generation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Language Tagsthe next generation Internationalization and Unicode Conference #32

  2. Presenters • Addison PhillipsLab126 • Mark DavisGoogle

  3. Languages, Language Tags, and Locales (oh my!) • Identifying language (and locale)—the challenge • ISO 639 • IETF BCP 47 • RFC 4646, RFC 4647 • RFC 4646bis • Challenges for users

  4. Human Language as Metadata • Some data is just data, but some data is human-readable text. • Text processing depends on language: • spelling, stemming, tokenization, word/line/sentence boundaries, thesauri, terminology, morphological analysis, font and stylistic traditions, collation. • IT systems depend on language negotiation: • localization, message selection, user interface, presentation, number/date/time/etc. formatting, list presentation

  5. Human Language IN this book a number of dialects are used, to wit: the Missouri negro dialect; the extremest form of the backwoods Southwestern dialect; the ordinary "Pike County" dialect; and four modified varieties of this last. The shadings have not been done in a haphazard fashion, or by guesswork; but painstakingly, and with the trustworthy guidance and support of personal familiarity with these several forms of speech. I make this explanation for the reason that without it many readers would suppose that all these characters were trying to talk alike and not succeeding. "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson)

  6. Identifying Languages • Languages don’t form nice hierarchies • “splitters” vs “lumpers” • dialects, subdialects, regional and stylistic differences, patois • Differing communities with different needs • terminology, librarians, computer systems, translators, etc.

  7. In the Beginning (ca. 1980 CE) Received Wisdom from the Dark Ages • Locales: • japanese, french, german, C • ENU, FRA, JPN • ja_JP.PCK • AMERICAN_AMERICA.WE8ISO8859P1 • Languages… … looked a lot like locales (and vice versa)

  8. ISO 639 • Defines language identifier codes • Multiple parts: • ISO 639-1 (alpha2 codes676) (136 codes) • ISO 639-2 (alpha3 codes17576) (about 500) • ISO 639-3 (alpha3 codes) (about 7000) • ISO 639-4 (principles for encoding) • ISO 639-5 (language families) • ISO 639-6 (alpha4 codes) (under development)

  9. Impact of ISO 639-3 • ISO 639-2 and 639-3 share a codespace • all 639-2 codes are also 639-3 codes • Macrolanguages

  10. Human Language en "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson)

  11. ISO 639 ISO 639-1 (early 1980s) ISO 639-2 (alpha3) ISO 639-3 (2007) IETF BCP 47 RFC 1766 (1995) RFC 3066 (2001) RFC 4646 (2006) RFC 4646bis (2008) Parallel Efforts

  12. BCP 47 • Internet Engineering Task Force (IETF) “Best Current Practice” (BCP) • Enable presentation, selection, and negotiation of content in protocols and formats • Widely used! XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl, Apache, IE, Mozilla……….

  13. Adds Granularity • Need to identify language on varying levels of mutual intelligibility and granularity "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson) en-US en

  14. BCP 47 (Historic) Basic Structure • Alphanumeric (ASCII only) subtags • Up to eight characters long • Separated by hyphens • Case not important (i.e. zh = ZH = zH = Zh) 1*8alphanum * [ “-” 1*8 alphanum ]

  15. RFC 1766 zh-TW ISO 639-1 (alpha2) ISO 3166 (alpha2) i-klingon Registered value

  16. RFC 3066 sco-GB ISO 639-2 (alpha 3 codes) But use… eng-GB X alpha 2 codes when they exist

  17. What’s a Locale • “a concept or identifier used by programmers to represent a particular collection of cultural, regional, or linguistic preferences.” java.util.Locale .Net Culture LANG (setlocale in C, C++) NLS_LANG in Oracle … and so on…

  18. Locales? Huh? Theatre Center News: The date of the last version of this document was 2003年3月20. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt.

  19. Locales and Languages • locale ≊ language + [other stuff] • Language needs to specify written form U+224A (“≊”) = ALMOST EQUAL OR EQUAL TO

  20. Locale Identifiers • Different ideas: • “Accept-Locale” vs. Accept-Language • URIs/URNs, etc. • CLDR/LDML • And Requirements: • Operating environments and harmonization • App Servers • Web Services • New Solution? Cost of Adoption: • UTF-8 to the browser: 8 long years

  21. IUC23, March 2003 Locales and Language Tags meet We really need locale identifiers. Language tags are being (ab)used as locale identifiers anyway… Not going to need a big new thing… Yeah, we’ll write an RFC … we can do this really fast…

  22. Problems with BCP 47 (circa RFC 3066) • Script Variation: • zh-Hant/zh-Hans • (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.) • Obsolesce of registrations: • art-lojban (now jbo), i-klingon (now tlh) • Instability in underlying standards: • sr-CS (CS used to be Czechoslovakia and now it’s not Serbia and Montenegro) • Lack of a single authoritative, stable source

  23. And More Problems • Little support for registered values in software • Reassignment of values by ISO 3166 • Lack of consistent tag formation (Chinese dialects?) • Standards not readily available, bad references • Bad implementation assumptions • the rules: 1*8 alphanum *[ “-” 1*8 alphanum] example “abcd1234-5678efgh-boont” • badly interpreted as: 2*3 ALPHA [ “-” 2ALPHA ] example: only stuff like “en-US” or “frr-CH” • Many registrations to cover small variations • 8 German registrations to cover two variations

  24. LTRU and RFC 4646 • Defines a generative syntax • machine readable • future proof, extensible • Defines a single source (IANA Language Subtag Registry) • Stable subtags, no conflicts • Machine readable • Defines when to use subtags • (sometimes)

  25. Anatomy of a Language Tag sl-Latn-IT-rozaj-1994-r-foovia-x-mine ISO 639-1/2 (alpha2/3) ISO 15924 script codes (alpha 4) ISO 3166 (alpha2) or UN M49 Extensions (none at present) Private Use Registered variants

  26. More Examples • fr, de, nl, en, ja • fr-FR, fr-CA, de-DE, de-CH… • es-419 (Spanish for Americas) • en-US (English for USA) • de-CH-1996 (Old tags are all valid) • sl-rozaj-1994 (Multiple variants) • zh-t-wadegile (Extensions)

  27. zh-Hant (!= zh-TW) zh-Hans (!= zh-CN) Azerbaijani (az) Arab, Cyrl, Latn Serbian (sr) Cyrl, Latn Yiddish (yi) Hebr, Latn Mongolian (mn) Cyrl, Latn, Hani Belarussian (be) Cyrl, Latn Etc. Solves the Script problem

  28. Benefits • Subtag registry in one place: one source, machine-readable • Subtags identified by length/content • Extensible • Compatible with RFC 3066 tags • Stable: subtags are forever

  29. Tag Choice • “Tag Content Wisely” • use the shortest tag reasonable • use as many subtags as necessary to disambiguate • don’t invent things; use the registry • map deprecated values to modern equivalents • Suppress-Script • avoid scripts when they add no additional information (Suppress-Script in the registry indicates this for some languages in some cases.)

  30. zxx (non-linguistic, not applicable) und (undetermined) mis (uncoded) mul (multiple) Zxxx (not written) Collection codes Specialized Subtags

  31. Unicode Language Identifiers (CLDR) • Adds some region codes: • ZZ • QU • etc. • Provides for canonicalization • Restricts syntax: • no grandfathered codes • no extlang

  32. Problems • Matching • Does “en-US” match “en-Latn-US”? • Tag Choices • Users have more to choose from. • Implementations • More to do, more to think about • (easier to parse, process, support the good stuff)

  33. Tag Matching (RFC 4647) • Uses “Language Ranges” in a “Language Priority List” to select sets of content according to the language tag • Three Schemes • Basic Filtering • Extended Filtering • Lookup • See also: “Unicode in Google” talk for “distance matching” (later today)

  34. Many technologies would like language tags (attributes, etc.) to be atomic—but language tags have structure <span class=“foo” xml:lang=“en-US” /> foo(lang:en) { color: red; } Accept-Language=zh;q=1.0;de-DE;q=0.8 Tags are not Tokens!

  35. Filtering • Ranges specify the least specific item • “en” matches “en”, “en-US”, “en-Brai”, “en-boont” • Basic matching uses plain prefixes • “en-US” matches “en-US” or “en-US-boont” but not “en-Latn-US” • Extended matching can match “inside bits” • “en-*-US”

  36. Lookup • Range specifies the most specific tag in a match. • Returns exactly one item. • “en-US” might return either “en” or “en-US” but not“en-US-boont” • Mirrors the locale fallback mechanism and many language negotiation schemes.

  37. Global Binary Resources Lookup and Language Negotiation • Resources “fall back” to find the best match zh-Hans-SG (Chinese, Simplified script, Singapore) zh-Hans (Chinese, Simplified script) zh (Chinese) (root) Falling back • See also: “Unicode in Google” talk (later today)

  38. What Do I Do (Content Author)? • Not much. • Existing tags are all still valid: tagging is mostly unchanged. • Resist temptation to (ab)use the private use subtags. • Unless your language has script variations: • Tag content with the appropriate script subtag(s) • Script subtags only apply to a small number of languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small number of others.

  39. What Do I Do (Programmer)? • Check code for compliance with RFC 4646 • Decide on well-formed or validating • Implement suppress-script • Change to using the registry • Bother infrastructure folks (Java, MS, Mozilla, etc) to implement the standard

  40. I need a new subtag… • Register new subtags with • only primary language or variant subtags • read RFC 4646 for instructions • two-week review period with expert approval

  41. LTRU Milestone Dates • RFC 4646 • Registry went live in December 2005 • RFC 4647 • (Anticipated) RFC 4646bis • This includes ISO 639-3 support, extended language subtags, and possibly ISO 639-6

  42. RFC 4646bis (Internet-Draft) • Currently taking shape • Adds about 7000 additional primary language subtags from ISO 639-3 • Extended language subtags for Chinese and other languages being debated … and some cleanup work on processes and procedures

  43. Macrolanguages and Extlang: The Big Debate zh-Hant-HK Chinese, Traditional Script, Hong Kong SAR yue-Hant-HK Cantonese, Traditional Script, Hong Kong SAR or do we do……….. zh-yue-Hant-HK extlang Chinese, Cantonese, Traditional Script, Hong Kong SAR

  44. Current Solution zh-yue-Hant-HK Permitted, butDeprecated in favor of “no extlang” form yue-Hant-HK

  45. Things to Do (languages) • Get involved in LTRU • Get involved in W3C Internationalization Activity • Get involved with Unicode and CLDR • Write implementations • Work on adoption of BCP 47: understand the impact

  46. Things to Read • Tag and Registry RFC • Matching RFC • 4646bis Draft • References • LTRU Mailing List

  47. Ideas and Questions