folded trie efficient data structure for all of unicode l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Folded Trie: Efficient Data Structure for All of Unicode PowerPoint Presentation
Download Presentation
Folded Trie: Efficient Data Structure for All of Unicode

Loading in 2 Seconds...

play fullscreen
1 / 21

Folded Trie: Efficient Data Structure for All of Unicode - PowerPoint PPT Presentation


  • 543 Views
  • Uploaded on

Folded Trie: Efficient Data Structure for All of Unicode. Vladimir Weinstein vweinste@us.ibm.com. Globalization Center of Competency, San Jose, CA. Introduction. A lot of data for each code point Need appropriate data structures

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Folded Trie: Efficient Data Structure for All of Unicode' - milla


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
folded trie efficient data structure for all of unicode

Folded Trie: Efficient Data Structure for All of Unicode

Vladimir Weinstein

vweinste@us.ibm.com

Globalization Center of Competency, San Jose, CA

introduction
Introduction
  • A lot of data for each code point
  • Need appropriate data structures
  • Unicode version 3.1 introduced code points into supplementary space – addressable range grew to more than a million
  • Repetitive data
  • Sparsely populated range, especially the supplementary space
data structures
Data Structures
  • Arrays
    • Advantages: very fast access time, fast write time
    • Disadvantage: Unacceptable memory consumption
  • Hash tables
    • Advantages: Easy to use, Reasonably fast, General
    • Disadvantages: High overhead, complicated sequential access, slower than array lookup, data within ranges is not shared
data structures continued
Data Structures (continued)
  • Inversion Maps
    • Advantages: simple, very compact, fast boolean operations
    • Disadvantages: worse access time than arrays and possibly hash tables
  • For more details see “Bits of Unicode” at http://www.macchiato.com/slides/Bits_of_Unicode.ppt
tries
Tries
  • A trie is a structure with one or more indexes and one data storage.
  • Name comes from “Information Retrieval”
  • Shares repetitive data
  • Good compaction
  • Not appropriate for frequently changing data
single index trie
Single-Index Trie
  • A trie structure with an index array and a data array.
  • Advantages
    • Excellent size
    • Very good access performance (two array accesses, shift, mask and addition)
  • Disadvantages
    • Not appropriate for frequently changing data
    • Index array gets too big when dealing with supplementary code points
single index trie diagram

UPPER_WIDTH

LOWER_WIDTH

LOWER_MASK

BMP code point

Upper

Lower

15

0

Data Array

Index

0

Block

Data

0

Block

Single-Index Trie Diagram
double index trie
Double-Index Trie
  • Two index arrays and a data block
  • Compared to single-index trie:
    • Provides better compression of the index array
    • Worse performance, but still very fast
    • Feasible for supplementary code points
double index trie diagram

UPPER_WIDTH

MIDDLE_WIDTH

LOWER_WIDTH

MIDDLE_MASK

LOWER_MASK

Code point

Lower

Upper

Middle

0

20

Data

Index 1

Index 2

0

Index1

0

Index2

Block

Data

Double-Index Trie Diagram
folded trie
Folded Trie
  • Fast access for BMP code points
  • Slower access for supplementary code points, but far less frequent
  • Compacts supplementary index
  • Needs additional build time processing
  • Fast address with UTF-16 code units
    • no need to construct code point
folded trie supplementary access diagram

Lead Surrogate

15

0

1

Same for the surrogate block

110110..

3

Has data for surrogate block?

2

Folded Trie

Data

No

Trail Surrogate

Yes

0

15

9

110111..

Lead Surrogate Data

4

4

5

Pseudo Code Point

6

Index + Data

Folded Trie – Supplementary Access Diagram
  • BMP code points access same as with single-index

Final Data

icu implementation utrie
ICU Implementation: UTrie
  • ICU implementation is called UTrie
  • Stores either 16 bit or 32 bit wide data (extensible in the future)
  • Up to 256K different data elements
  • Can be frozen and reused as memory mapped image for fast startup
  • Using UTrie requires custom code

More about ICU at the end of presentation

range enumeration

start-1

Element 1

start

Element 2

Element 2

Element 2

Element 2

Element 2

Element 2

limit-1

Element 3

limit

Range Enumeration
  • Allows enumerating over a set of contiguous maximal ranges of same data elements
  • Elements can be preprocessed by additional callback
  • Saves time when processing the whole Unicode range by efficiently walking the trie structure
latin 1 fast path
Latin-1 Fast Path
  • Build time option
  • Allows direct array access for the Latin-1 range (0x00-0xFF)
  • Latin-1 range is not compressed if this option is used
  • Appropriate when access for Latin-1 range is critical
    • collation
example normalization data
Example: Normalization Data
  • Normalization data is stored using UTries
  • For example, main data has the following format

31

15

7

6

5

3

0

Extra data index

Combining class

BCK

FWD

QC_MAYBE

QC_NO

  • Can be either:
  • index to variable length data
  • first part of supplementary lookup value
  • Special handling indicator (Hangul, Jamo)

Combines back

Values for normalization quick check

Combines forward

  • Variable-length data contains composition and decomposition info
example character properties data

Data

Index

Example: Character Properties Data
  • The result of UTrie lookup is an index
  • Double indexing allows for even better compression, since many code points have the same property value
  • UTrie data width is 16 bit (thousands of data entries), while the property data width is 32 bits (few hundred unique data words).

Folded Trie

Property data

32 bits

16 bits

international components for unicode
International Components for Unicode
  • International Components for Unicode(ICU) is a library that provides robust and full-featured Unicode support
  • Several library services use the common UTrie implementation
  • Wide variety of supported platforms
  • open source (X license – non-viral)
  • C/C++ and Java versions
  • http://oss.software.ibm.com/icu/
conclusion
Conclusion
  • UTrie data structure provides good compression with fast access
  • The main constraint for usage is the nature of the data that needs to be stored
  • Designed for repetitive and sparse data
folding and surrogate access
Folding and Surrogate Access
  • Folding process compacts the index for supplementaries and moves it right above the BMP index
  • Access in ICU4C:
    • Define a C callback, invoked when special lead surrogate is detected
    • Manually detect special lead surrogates
  • In ICU4J, provide a subclass with a method that detects special lead surrogates
summary
Summary
  • Introduction: Storing Unicode data
  • Types of data structures
  • Tries
  • Single-index trie
  • Double-index trie
  • Folded trie
  • Usage of folded trie in normalization
  • Usage of folded trie for character properties