1 / 22

Beyond Text Representation

Beyond Text Representation. Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development. Basic Text Analysis Tasks. Code page conversion and text re presentation Segmentation (tokens, sentences, paragraphs)

clarge
Download Presentation

Beyond Text Representation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development

  2. Basic Text Analysis Tasks • Code page conversion and text representation • Segmentation (tokens, sentences, paragraphs) • Morphological analysis / dictionary lookup • Compound word decomposition • Spell Checking/Spell Aid • … 18th International Unicode Conference

  3. Advanced Text Analysis Tasks • Summarization • Categorization/Clustering • Extraction of names, terms or relations • Information extraction • Parsing All task should be provided for all languages 18th International Unicode Conference

  4. A Library for Text Analysis • The same text analysis tasks are needed in different multilingual contexts/systems • The same software library should be used in all contexts/systems to perform the analysis • The library should work language neutral • The text analysis tasks required for a given context/system should be an input parameter for the library 18th International Unicode Conference

  5. Two Problems and One Solution • The realization of such a library faces two kinds of challenges: • Implementing the actual language specific analysis tasks • Encapsulating the language specific processing by representing input and output in a language neutral fashion • Unicode plays a major role in solving problem B 18th International Unicode Conference

  6. A Software Design for a Text Analysis Library • Single API towards the application • Separated but combinable language-specific processing modules • Central representation system for linguistic information • Centralized flow of control driven by linguistic analysis targets 18th International Unicode Conference

  7. Implementation • Implemented as C++ DLL/shared Lib • Provides an extensive object oriented API for applications and plugins • Uses Unicode (ICU based) for all text content • Ported to 9 platforms (therefore no platform dependant solutions acceptable) • Because of use in search/indexing strong focus on performance • Supports 30+ languages and 90+ code pages 18th International Unicode Conference

  8. Enter Unicode • Used as internal character representation format (character set) • Converters from/to over 90 external code pages had to be written/integrated • A decision had to be made on the Unicode encoding format: we choose UTF-16 18th International Unicode Conference

  9. The Pros UTF-16 • We started out without knowledge of surrogate issues • False assumption: Fixed length encoding • Good balance between size and straightforward representation • Efficient interoperability with Windows, Java, XML4C APIs etc 18th International Unicode Conference

  10. The Cons of UTF-16 • Not a fixed length encoding because of surrogates • Can not be passed to legacy functions (C library, OS APIs) • Character classification functions have to work on pointers for surrogates • Wastes some space with western languages 18th International Unicode Conference

  11. ANSI C/C++ Compatibility • ANSI C++ does define a type w_char for “wide” character representation (and a matching wide string class wstring) • Unfortunately size and encoding of w_char are not standardized • So we combined the ANSI C++ basic_string template class with the Unicode character data type from ICU to create a C++ and Unicode conformant string class 18th International Unicode Conference

  12. Impact Beyond Character Representation • Tokenization • Finite state processing • Dictionary formats • “Environmental” issues • Development tools support 18th International Unicode Conference

  13. Impact: Tokenization • Tokenization needs access to character properties • Most but not all relevant are provided by Unicode character database • For application defined properties there is no more fast & simple 256 character property lookup • Approach limited to western scripts 18th International Unicode Conference

  14. Impact: Finite State Processing • Finite state character processing in C usually works with transition tables encoded as arrays • This is easy to implement and very fast in execution • To cover the full range of all Unicode characters, more sophisticated transition tables are required 18th International Unicode Conference

  15. Impact:Dictionaries • Dictionaries tend to be large • As much of them as possible has to be loaded in memory for performance reasons • For multilingual (server) applications multiple dictionaries will be in memory • Therefore dictionary size matters much • Doubling dictionary size might not be an viable option 18th International Unicode Conference

  16. Impact: “Environmental” Issues • There is always as residue of single byte string data (from message catalog, command line, library calls etc.) which sometimes has to be mixed with Unicode string data • Interfaces for console, messages, logs etc. are mostly single byte • Configuration files should be platform-neutral, easily editable and support the full Unicode character set 18th International Unicode Conference

  17. Impact: Development Tools Support • Only specialized editors can handle Unicode text • Most debuggers don’t display Unicode • Source code string constants are hard to maintain • Message catalog compilers on some platforms are not Unicode enabled 18th International Unicode Conference

  18. A Word About Unicode Normalization Forms • For reasons of efficient interoperability a fixed Unicode normalization had to be specified • Early normalization is performance critical • Since round trip convertibility was not a design goal Unicode Kompatibility Composed Normal Form has been chosen • Normalization and cope page conversion can and should be done in one step 18th International Unicode Conference

  19. Benefits of Unicode Use • No more code page troubles within the boundaries of the application • Very often algorithms can be established for groups of languages • Multilanguage document collections and even mixed language documents are no problem to represent • Easy and efficient Java (JNI) integration 18th International Unicode Conference

  20. Summing Up:Building on Unicode… • …solves only the basic character representation problem for multilingual text analysis • …sets a solid foundation for a multilingual system • …enables algorithms to be reused for groups of languages. • …can have impact on the system far beyond the character representation level • …has been worth the trouble 18th International Unicode Conference

More Related