1 / 39

Unicode in

Unicode in. 2008Q3 Mark Davis, Vladimir Weinstein, Andy Heninger. Standard SW Globalization. Data Handling Date, Time, Number Formatting Collation Locales/Languages Timezones & Calendars,… General Internationalization Using character properties instead of hard-coded lists

gerek
Download Presentation

Unicode in

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode in 2008Q3 Mark Davis, Vladimir Weinstein, Andy Heninger

  2. Standard SW Globalization • Data Handling • Date, Time, Number Formatting • Collation • Locales/Languages • Timezones & Calendars,… • General Internationalization • Using character properties instead of hard-coded lists • Separation of code from localizable data (≈resource bundles) • Avoiding string concatenation, dealing with truncation …

  3. Where was the problem? (pause) Server View Index Data DB Server Upload dump

  4. Ensure Client App is Unicode Windows, don’t use ANSI Prevent Encoding Mismatches charset before web form params Allow full Unicode identifiers File names,… Ensure Uniform Segmentation Word ≠ [0-9a-zA-Z]+ Watch for hidden assumptions Cp1252 corrupting bytes Title requirement 3+ chars ok for English, but not Chinese (狗) More places than you might think ❷ Server ❹ ✔ View ✔ Index ❹ Data ✔ ❸ DB ✔ ❺ Server Upload ✔ ✔ ❷ ✔ ❶ dump ❺

  5. Just a few extra challenges… • Massive amounts of data • Much web cruft to deal with • Very short release cycles • Many product × language/locale pairs (next slide)

  6. Locale × Product Versions http://googleblog.blogspot.com/2008/07/hitting-40-languages.html

  7. Translation • Professional Vendors, Contractors, Volunteers

  8. Translation Strategies • Normal Translation Memory • Multiple, very short release cycles • Weeks, not months • Product Alternatives for new features • Delay release until completely translated • Disable new features until translated • Accept some English strings in new features

  9. Int’l Strategy: Unicode Zone Non-Unicode Converters Validation Unicode Unicode Zone

  10. Both forms of Unicode • UTF-8: C++, python • Mixture of char*, STL string, new robust class • UTF-8 is particularly good storage for the web (more later) • UTF-16: Java, Windows, Javascript, Mac • Libraries / Data • ICU, Joda Time, Internal libraries • Unicode Character Database, Unicode Locales (CLDR) • TZDB, ISO 4217 (currencies) – time sensitive • Update to new versions (eg Unicode 5.1) asap

  11. Unicode identifiers Language/Locale, Script, Region, Currency, Timezone based on BCP47, ISO 4217, TZDB Required: unique, stable CS = Czechoslovakia? Serbia & Montenegro? Serbia = CS? = RS? London is in UK? GB? Google Valid: Canonical US, iw Noncanonical SU, he (deprecated / not preferred) Google Disallowed*: Private Use XA Unassigned BB Ill-Formed B1 Variants i-tao, en-SCOUSE Stable Identifiers

  12. User’s Locale / Language • Needed to improve quality • Locale = Language + (possibly) other info • Known if user is Signed In • Heuristics where not Signed In. • IP Address • Accept-Language • Country from Accept-Language • Domain,…

  13. Normalizing Languages/Locales • Based on Unicode locale data (CLDR) • zh, und-CN, und-Hans,… ≃zh • zh-TW, zh-Hant,… ≃zh-TW • en, und-Latn, und-US,… ≃ en • en-GB, en-Latn-GB,… ≃ en-GB • he-IL, iw-IL, he-Hebr, he,… ≃ iw

  14. Matching Languages/Locales • Input: User’s requested languages, our supported languages • Output: “best” supported language • Need better match than truncation • A “distance” metric on normalized languages • Language, then script, then country • Plus special information:hr vs bs, no vs nn, ro vs mo, tl vs fil

  15. Web Cruft • Problems • Bad input: charset, language,… • Inaccurate detection • Difficulties in segmentation / morphology • These are non-trivial • Pages with conversion errors or unassigned (non-existent) characters: ≈4% • Multiply that by billions and billions of pages…

  16. You didn’t know there was going to be a test… • How many pages are on the web? • What’s the most frequent character? Script? (next slides) …

  17. Most Web Data

  18. Data in Different Scripts

  19. Bad Source • Original page has corrupted data • Doubly-encoded UTF-8 • Random illegal control codes, unassigned chars • Forms input data of unknown/wrong encoding • Mixtures of different charsets, from • Random pasting in non-Unicode enabled tools • Page composition (eg server-side includes), mixing charsets • Indic font encodings

  20. Bad Server • Server mis-identifies the type or encoding of the page in the HTTP protocol. • Example: JPEGs served up as text • Server overrides page with wrong charset • If you don’t do special detection, you get random junk • Interpreting a JPEG as windows-1252:not altogether productive…

  21. Charset Tagging Trends http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

  22. Encoding Detection • Pages are so often untagged & mis-tagged: • Both at HTTP and HTML levels • And what happens if they differ? • We have to heuristically detectthe “real” character encoding • Need to do better than the browser • In the browser, the user can adjust a bad guess • UTF-8 source is the safest, but still must be verified Bad codes {charset} _charset EnTW …

  23. Attacks! • Cross-site scripting (XSS) • Don’t treat ill-formed UTF-8 as space (or syntax) • <p id=abc�onMouseOver=evilDoers()… • Don’t swallow valid characters after ill-formed • …q="�>onMouseOver=… • Don’t allow UTF-7, UTF-16 as output encodings • Browsers often mis-detect, and allow XSS.

  24. Spamming/Spoofing • IDNA Spoofing: “paypal.com” • Spamming: need to detect equivalences • http://spamsource.cn • http://spamsource.cn fullwidth dot • http://bücher.de • http://xn--bcher-kva.de • http://b%C3%BCcher.de

  25. Language Detection • Pages are so often untagged & mis-tagged: • Both at HTTP and HTML levels • So, we have to heuristically determine the “real” language • Unfortunately, detecting language is more complicated than encoding • Mixtures of languages on same page • Need to detect short strings, out of context, without encoding • Needs to happen after entity expansion: &#xxx; → Y • Fortunately, misdetecting language is way less problematic than encoding Bad codes en-securidEnglishxlChinesezsuseses en-us."en-us "es-es-tsundefined espa�olutf-8

  26. Non-English Languages

  27. Language Tags & Detection

  28. If Lang Tags Normalized…

  29. Tagged vs Detected

  30. Bad HTML • It's easy to parse valid HTML correctly • But invalid HTML is not uncommon • We need to be as good at doing bad HTML as the browsers are • That is, what the user sees in IE or Firefox is what needs to be indexed • Illegal characters (controls) sneak in as character entities: &#x1E;

  31. Segmentation Challenges • Indexing & query: breaking text into words • ユニコードとは何か→ ユニコード · とは · 何か • Problems if wrong: • Source segmented as: |AB|C| • User searches for “BC” not found • Can segment/query multiple ways

  32. Thai Segmentation • คอมพิวเตอร์ จะ เกี่ยวข้อง กับ เรื่อง ของ ตัวเลข • Before segmentation (2007-03): 10 hits • After segmentation: → 300,000+ hits! • Spaces in query still make difference • คอมพิวเตอร์จะเกี่ยวข้องกับเรื่องของตัวเลข acts as a complete phrase, equals: • “คอมพิวเตอร์จะเกี่ยวข้องกับเรื่องของตัวเลข”

  33. Morphology Challenges • Varies by language • Stopwords, phrases: a, the,… • Diacriticals: sasa → saša, sasha • Decompounding: Abiball → abiball OR abi ball • “Forms” of a word: go → gone, went, … • Synonyms: car shopping → auto shopping • …

  34. Correcting User Typing • Users may be on keyboard without accents, or expect transliteration • Types “Sasha” or “Sasa” or “Саша” for “Saša” • Misspellings

  35. Character folding • Avoid spurious input differences • “financial” (fi lig., PDF) • Normalize with: • NFC + subset of NFKC + UCA + others • Suppress display • “➠”

  36. SW Globalization at Mark Davis

  37. Q&A

  38. In Action • Indexing stores canonicalized originals • … Fishing … ro◌̂les→ • … fishing … rôles • Query expanded to variants • fish → fish|fishing • rôle → role|rôle|roles|rôles • Expansions may be language-dependent

  39. Freeform Parsing

More Related