1 / 36

Bits of Unicode

Bits of Unicode. Data structures for a large character set Mark Davis IBM Emerging Technologies. ☢ Caution ☢. “ Characters ” ambiguous, sometimes: Graphemes: “ x̣ ” (also “ ch ” , … ) Code points: 0078 0323 Code units: 0078 0323 (or UTF-8: 78 CC A3) For programmers

job
Download Presentation

Bits of Unicode

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bits of Unicode Data structures for alarge character set Mark Davis IBM Emerging Technologies

  2. ☢ Caution ☢ • “Characters” ambiguous, sometimes: • Graphemes: “x̣” (also “ch”,…) • Code points: 0078 0323 • Code units: 0078 0323 (or UTF-8: 78 CC A3) • For programmers • Unicode associates codepoints (or sequences of codepoints) with properties • See UTR#17

  3. The Problem • Programs often have to do <key,value> lookups • Look up properties by codepoint • Map codepoints to values • Test codepoints for inclusion in set • e.g. value == true/false • Easy with 256 codepoints: just use array

  4. Size Matters • Not so easy with Unicode! • Unicode 3.0 • subset (except PUA) • up to FFFF16 = 65,53510 • Unicode 3.1 • full range • up to 10FFFF16 = 1,114,11110

  5. With ASCII Simple Fast Compact codepoint ➠ bit:32 bytes codepoint ➠ short:½ K With Unicode Simple Fast Huge (esp. v3.1) codepoint ➠ bit:136 K codepoint ➠ short:2.2 M Array Lookup

  6. Further complications • Mappings, tests, properties often must be for sequences of codepoints. • Human languages don’t just use single codepoints. • “ch” in Spanish, Slovak; etc.

  7. First step: Avoidance • Properties from libraries often suffice • Test for (Character.getType(c) == Nd)instead of long list of codepoints • Easier • Automatically updated with new versions • Data structures from libraries often suffice • Java Hashtable • ICU (Java or C++) CompactArray • JavaScript properties • Consult http://www.unicode.org

  8. Data structures: criteria • Speed • Read (static) • Write (dynamic) • Startup • Memory footprint • Ram • Disk • Multi-threading

  9. Hashtables • Advantages • Easy to use out-of-the-box • Reasonably fast • General • Disadvantages • High overhead • Discrete (no range lookup) • Much slower than array lookup

  10. Overhead: char1 ➠ char2 overhead … overhead next hash key value overhead overhead char1 char2 …

  11. Trie • Advantages • Nearly as fast as array lookup • Much smaller than arrays or Hashtables • Take advantage of repetition • Disadvantages • Not suited for rapidly changing data • Best for static, preformed data

  12. Index … Data M1 M2 Codepoint Trie structure

  13. M1 M2 Codepoint Trie code • 5 Operations • Shift, Lookup, Mask, Add, Lookup v = data[index[c>>S1]+(c&M2)]] S1

  14. Trie: double indexed • Double, for more compaction: • Slightly slower than single index • Smaller chunks of data, so more compaction

  15. Index1 … Index2 … Data M1 M2 M3 Codepoint Trie: double indexed

  16. M1 M2 M3 Codepoint Trie code: double indexed b1 = index1[ c >> S1 ] b2 = index2[ b1 + ((c >> S2) & M2)] v = data[ b2 + (c & M3) ] S1 S2

  17. Inversion List • Compaction of set of codepoints • Advantages • Simple • Very compact • Faster write than trie • Very fast boolean operations • Disadvantages • Slower read than trie or hashtable

  18. Inversion List Structure • Structure • Index (optional) • List of codepoints in ascending order • Example Set [ 0020-0061, 0135, 19A3-201B ] Index 0: 0020 in 1: 0062 out 0135 2: in 0136 3: out 19A3 4: in 201C 5: out

  19. Inversion List Example • Find smallest i such that c < data[i] • If no i, i = length • Thenc ∈ List ↔ odd(i) • Examples: • In: 0023, 0135 • Out: 001A, 0136, A357 Index 0: 0020 in 1: 0062 out 0135 2: in 0136 3: out 19A3 4: in 201C 5: out

  20. Index Index 0: 0020 0: 0000 1: 0062 1: 0020 0135 2: 3: 0062 0136 3: 0135 2: 19A3 4: 0136 4: 201C 5: 19A3 5: 201C 6: Inversion List Operations • Fast Boolean Operations • Example: Negation ➠ ➠

  21. Inversion List: Binary Search • from Programming Pearls • Completely unrolled, precalculated parameters int index = startIndex; if (x >= data[auxStart]) { index += auxStart; } switch (power) { case 21: if (x < data[t = index-0x10000]) index = t; case 20: if (x < data[t = index-0x8000]) index = t; …

  22. Index Inversion Map 0: 0020 1: 0062 0135 2: • Inversion Listplus • Associated Values • Lookup index just as in Inversion List • Take corresponding value 0136 3: 19A3 4: 201C 5: 0: 0 5 1: 3 2: 9 3: 8 4: 3 5: 6: 0

  23. Key ➠ String Value • Problem • Often almost all values are 1 codepoint • But, must map to strings in a few cases • Don’t want overhead for strings always • Solution • Exception values indicate extra processing • Can use same solution for UTF-16 code units

  24. Example • Get a character ch • Find its value v • If v is in [D800..E000], may be string • check v2 = valueException[v - D800] • if v2 not null, process it, continue • Process v

  25. String Key ➠ Value • Problem • Often almost all keys are 1 codepoint • Must have string keys in a few cases • Don’t want overhead for strings always • Solution • Exception values indicate possible follow-on codepoints • Can use same solution for UTF-16 code units • Use key closure!

  26. Closure • If (X + Y) is a key, then X is a key Before After s ➠ x s ➠ x ➠ sh ➠ y sh ➠ y shch ➠ z shch ➠ z c ➠ w c ➠ w shc ➠ yw

  27. s h c h a … x y yw z not found,use last Why Closure?

  28. Bitpacking • Squeeze information into value • Example: Character Properties • category: 5 bits • bidi: 4 bits (+ exceptions) • canonical category: 6 bits + expansion • compressCanon = [bits >> SHIFT] & MASK; • canon = expansionArray[compressCanon];

  29. Statetables • Classic: • entry = stateTable[ state, ch ]; • state = entry.state; • doSomethingWith( entry.action ); • until (state < 0);

  30. Statetables • Unicode: • type = trie[ch]; • entry = stateTable[ state, type ]; • state = entry.state; • doSomethingWith( entry.action ); • until (state < 0); • Also, String Key ➠ Value

  31. Sample Data Structures: ICU • Trie: CompactArray • Customized for each datatype • Automatic expansion • Compact after setting • Character Properties • use CompactArray, Bitpacking • Inversion List: UnicodeSet • Boolean Operations

  32. Sample Usage #1: ICU • Collation • Trie lookup • Expanding character: String Key ➠ Value • Contracting character: Key ➠ String Value • Break Iterators • For grapheme, word, line, sentence break • Statetable

  33. Sample Usage #2: ICU • Transliteration • Requires • Mapping codepoints in context to others • Rearranging codepoints • Controlling the choice of mapping • Character Properties • Inversion List • Exception values

  34. Sample Usage #3: ICU • Character Conversion • From Unicode to bytes • Trie • From bytes to Unicode • Arrays for simple maps • Statetables for complex maps • recognizes valid / invalid mappings • provides compaction • Complications • Invalid vs. Valid mapped vs. Valid unmapped • Fallbacks

  35. References • Unicode Open Source — ICU • http://oss.software.ibm.com/icu • ICU4j: Java API • ICU4c: C and C++ APIs • Other references — see Mark’s website: • http://www.macchiato.com

  36. Q & A

More Related