320 likes | 628 Views
Unicode Transforms in ICU. Mark Davis Chief SW Globalization Architect IBM. What is ICU?. The Premier Unicode-Enablement Library Open-Source: non-viral license Full-featured, cross-platform C, C++, Java APIs
E N D
Unicode Transforms in ICU Mark DavisChief SW Globalization Architect IBM
What is ICU? • The Premier Unicode-Enablement Library • Open-Source: non-viral license • Full-featured, cross-platform • C, C++, Java APIs • Collation, Charset Conversion, Resources, Boundaries, Calendars, Transforms (case, norm., translit., …), Format/Parse (dates, times, msgs, nums., curr., …), Unicode strings/props • Unicode Conformant • http://oss.software.ibm.com/icu/ 21st International Unicode Conference
ICU Transforms • Powerful, flexible mechanism • Uppercase, Lowercase, Titlecase, Full/Halfwidth • Normalization • Hex, Character Names • Script to Script conversion… • Supports Styled Text, not just Plain Text • Chaining, Filters, Buffering • Customizable 21st International Unicode Conference
Transform Examples • “Any-Uppercase” a → A • “Any-Hex/Java” a → \u0061 • “Greek-Latin” a → α 21st International Unicode Conference
Filters • “[aeiou] Latin - Greek” • “Latin” is the source • “[aeiou]” is a filter, restricts the application to only English vowels. • “Greek” is the target • “[^\u0000-\u007E] Any - Hex” • “A δ is…” → “A \u03B4 is\u2026” 21st International Unicode Conference
UnicodeSet Filters • Ranges [ABC a-z] • Union [[:Lu:] [:P:]] • Intersection [[:Lu:] & [\u0000-\u01FF]] • Set Difference [[:Lu:] - [\u0000-\u01FF]] • Complement [^aeiou] • Properties • Uppercase letters[:Lu:] • Punctuation[:P:] • Script[:Greek:] Other Unicode properties in ICU 2.2 21st International Unicode Conference
Example Filter • “[:Lu:] Latin-Katakana; Latin-Hiragana” • Converts all uppercase Latin characters to Katakana, • Then converts all other Latin characters to Hiragana. 21st International Unicode Conference
Chaining Transforms • “Hiragana-Latin; Any-Title” • たけだ, まさゆき • takeda, masayuki • Takeda, Masayuki • Any number of transforms in chain 21st International Unicode Conference
Filtering plus Chaining • “NFD; [:M:] Remove; NFC” • Decompose • Remove accents (Marks) • Recompose 21st International Unicode Conference
김, 국삼 김, 명희 정, 병호 たけだ, まさゆき ますだ, よしひこ やまもと, のぼる Ρούτση, Άννα Καλούδης, Χρήστος Θεοδωράτου, Ελένη Gim, Gugsam Gim, Myeonghyi Jeong, Byeongho Takeda, Masayuki Masuda, Yoshihiko Yamamoto, Noboru Roútsē, Ánna Kaloúdēs, Chrêstos Theodōrátou, Elénē Script ↔ Script Examples 21st International Unicode Conference
Script ↔ Script Conversions • General conversions: Greek-Latin • Source-Target Reversible: φ → ph → φ • Not Target-Source Reversible: f → φ → ph • Variants • By Language: Greek-German • By Standard: Greek-Latin/UNGEGN • Can build your own 21st International Unicode Conference
Styled Text • Preserves individual styles on letters, where possible απα → apa 21st International Unicode Conference
p? ph? ps? When Buffering • Conversions are not performed if they may extend over boundaries Key Result a α p αp a απα p απαp h απαφ 21st International Unicode Conference
Custom Rules • Similar to Regular Expressions • Variables • Property matches • Contextual matches • Rearrangement • $1, $2… • Quantifiers: • *, +, ? 21st International Unicode Conference
Differences from Regular Expressions • More Powerful… • Buffered/Keyboard • Styled Text • Ordered Rules • Cursor Backup • Less Powerful… • Only greedy quantifiers • No backup: so no (X | Y) • No “input-side back references” 21st International Unicode Conference
Example of Custom Rules • “UnixQuotes-RealQuotes” \`\` > “; # two graves → right-quote \'\' > ” ; # two generics → left-quote • Example (SJ Mercury News online) ``expertise''→“expertise” 21st International Unicode Conference
Rule Ordering • Find first rule that matches at start • If no match, or (isBuffered & clipped-Match) • advance start by 1 • Else if match, • Substitute text • Move start as specified • Continue until start reaches limit 21st International Unicode Conference
Rule Ordering Example Translit. Reg Exp. xy > c ; s/xy/c/g yx > d ; s/yx/d/g xyx-yxy-xyx cx-dy-cx cx-yc-cx 21st International Unicode Conference
Context • Rules: • γ } [ Γ Κ Χ Ξ γ κ χ ξ ] > n; • γ > g; • Meaning: • Convert gamma into n IF followed byΓ, Κ, Χ, Ξ, γ, κ, χ, or ξ • Otherwise into g 21st International Unicode Conference
Cursor Backup • Allows text to be revisited • Reduces rule-count • Example Rules • BY > ビ | ~Y ; • ~YO > ョ; |BYO 1 ビ|~YO 2 ビョ| 21st International Unicode Conference
Demonstration • Public Demo • http://oss.software.ibm.com/icu/demo • (local copy, samples) 21st International Unicode Conference
More Information http://oss.software.ibm.com/… User Guide /icu/userguide/ C /icu/apiref/utrans_h.html C++ /icu/apiref/ Java API /icu4j/doc/com/ibm/text/ • Latest Version of these slides • http://www.macchiato.com 21st International Unicode Conference
ICU Transforms • Powerful, flexible mechanism • Uppercase, Lowercase, Titlecase, Full/Halfwidth • Normalization • Hex, Character Names • Script to Script conversion… • Supports Styled Text, not just plaintext • Chaining & Filters • Customizable 21st International Unicode Conference
Q & A 21st International Unicode Conference
Backup Slides • Not used in the presentation, except in response to questions 21st International Unicode Conference
Buffered Usage • No conversion for clipped match …t…t • Fill buffer • Transliterate • May have left-overs x …τ…t th… • Copy left-overs to start • Fill rest of buffer • Transliterate θ… 21st International Unicode Conference
Styled Text Handling • Transforms operate on Replaceable, an interface/abstract class defined by ICU • In ICU4c, UnicodeString is a Replaceable subclass (with no out-of-band data -- no styles) • ICU4j defines ReplaceableString, a Replaceable subclass, also with no styles • Clients must define their own Replaceable subclass that implements their styled text. 21st International Unicode Conference
Transliteration Sources • Søren Binks • http://homepage.mac.com/sirbinks/translit.html • UNGEGN • http://www.eki.ee/wgrs/ • … 21st International Unicode Conference
API: Information • Like other ICU APIs, can get each of the available Transform IDs: • count =Transliterator:: countAvailableIDs(); • myID = Transliterator::getAvailableID(n); • And get a localizable name for each: • Transliterator::getDisplayName(myID, france, nameForUser); Note: these are C++ APIs; C and Java are also available. 21st International Unicode Conference
API: Creation • Use an ID to create: • myTrans = Transliterator::createInstance("Latin-Greek"); 21st International Unicode Conference
API: Simple usage • Convert entire string • myTrans.transliterate(myString); 21st International Unicode Conference
More Control • Specify Context • Use with Styled Text abcdefghijklmnopqrstuvwxyz contextStart contextLimit start limit 21st International Unicode Conference