Developing a multilingual text analysis engine does using unicode solve all the issues l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 39

Developing a multilingual text analysis engine - does using Unicode solve all the issues? PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Developing a multilingual text analysis engine - does using Unicode solve all the issues?. Dr. Brian O’Donovan, IBM Ireland, Sept. 2002. Agenda. What is IBM LanguageWare? Where/How is it Used? Our Experiences of Conversion to Unicode Benefits Accruing Challenges to be Overcome

Download Presentation

Developing a multilingual text analysis engine - does using Unicode solve all the issues?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Developing a multilingual text analysis engine - does using Unicode solve all the issues?

Dr. Brian O’Donovan,

IBM Ireland,

Sept. 2002


Agenda

  • What is IBM LanguageWare?

  • Where/How is it Used?

  • Our Experiences of Conversion to Unicode

    • Benefits Accruing

    • Challenges to be Overcome

  • Word Identification problems

  • Future work


What is IBM LanguageWare?

  • A suite of tools to assist in linguistic analysis

  • Initially developed by IBM USA, but now developed by a Globally distributed team

    • Used in internal and external products

    • Available to OEMs as a toolkit


Where are We?

  • Developers

    • North Carolina, USA

    • Dublin, Ireland

    • Helsinki, Finland

    • Taipei, Taiwan

    • Yamato, Japan

    • Seoul, Korea

  • Major Customers

    • Various Locations, USA

    • Böblingen, Germany


Afrikaans

Arabic

Catalan

Simplified Chinese

Traditional Chinese

Czech

Danish

Dutch (x2)

English

x4 regions & x2 domains

Finnish

French (x2)

German (x3)

Greek

Hebrew

Hungarian

Icelandic

Italian

Japanese

Korean

Norwegian (x2)

Polish

Portuguese (x2)

Russian

Spanish

Swedish

Tamil

Thai

Turkish

28 (or 30+) Languages Supported


Lexical

Analysis Hierarchy

Summarization

Grammar check

Words = spell check,

morphology, etc.


Where/How is it Used?

  • Word Processors

    • spell aid, grammar checker

  • Search Engines & Data Mining

    • Extracting Lemmatized form

    • Query Expansion (e.g. Synonyms)

    • Assisting in Taxonomy Generation

  • Translation Memory

    • Identification of linguistic units

  • NLP tools use as first stage of analysis

  • Machine Translation (limited)

  • Speech recognition (not currently)


Conversion to Unicode

  • Previous version used different code pages for each language

    • In fact a collection of different engines

  • New version has a single engine for all languages using Unicode (not all languages have been converted yet)

    • Provided many benefits

    • Raised some challenges


Benefits of Unicode

  • No more conversion tables

    • Standardized on utf-16

    • Big-Endian Dictionaries

    • Converted to platform byte order on load

  • Possible to deal properly with multilingual text

  • Able to utilize ICU utilities

  • Can edit/view all test cases on one machine


Issues with Unicode

  • Lots of work

  • Converting dictionaries is not trivial

  • Changing code would be huge (we were rewriting anyway)

  • Large code page

  • Some ambiguous representations & orthographic rules

  • ICU Tokenisation not always to our liking


How we dealt with large code page

  • Our architecture is based upon Finite State Transducers (FSTs)

  • In FSTs you frequently need to build tables with entries for every possible next character

    • for single byte code page => 256 entries

    • for double byte code page => 64K entries

    • If not careful, dictionaries might grow to 256 times their previous size

  • We deal with this by building a character transition table per dictionary

    • Each UNICODE character appearing in the dictionary is assigned a code by which it is referenced within the dictionary

      • This is not a private code page

      • The table is dynamically created when building the dictionary and will map differently each time

  • When looking up dictionary each character is first mapped through the table

    • If any characters are not in mapping table => we won't find a match in the dictionary

  • New characters are automatically added to the mapping table once words with the characters are added to the dictionary

  • Typically European languages require fewer than 100 characters

    • e.g. punctuation symbols do not appear in the dictionary


FSA Node

English Dictionary

a

b

c

Z


FSA Node

Japanese Dictionary


Representation Issues

  • Even with utf-16 as a universal standard representation issues can arise

  • The letter ë is represented as 0x00EB

    • But could be e=0x0065 followed by umlaut=0x0308

  • Arabic letter Heh (ه) = E5 in Windows code page

    • In UNICODE 0x0647

    • But it has different forms e.g. ههه

    • Unicode defines 4 more code points for presentation forms

    • isolated=0xFEE9, final=0xFEEA, initial=0xFEEB and medial=0xFEEC


Arabic Shape Bitmaps (Heh)

  • Heh varies with position

    • Isolated

    • Initial

    • Medial

    • Final


Arabic Shape Bitmaps (Seen)

  • Seen varies much less

    • Isolated

    • Initial

    • Medial

    • Final


Orthographic Variation

  • A word may appear differently in a sample text than in the dictionary.

    • All languages have different rules about what is allowed/required.

  • Some languages have simple rules

    • e.g. English casing rules

      • a lowercase dictionary word can appear in text as titlecase or uppercase

      • a titlecase dictionary word could appear in text as uppercase

      • an uppercase or mixed case dictionary word must appear in text exactly the same.


Other languages have more complex rules

  • In France letters lose their accent when capitalized

    • e.g. é is capitalized as E not É

  • but this rule does not apply in French Canada

    • So être becomes Etre in Paris but Être in Montreal

  • In German capitalization can change the letter count

    • 'ß' is sometimes capitalized as 'SS'

  • In English we optionally drop accents

    • e.g. the name Zoë can be written Zoe in English

    • But in German dropped umlauts are represented by a following e

      • e.g. Böblingen can be written Boeblingen


Vowel dropping in Semitic languages

  • Semitic languages such as Arabic and Hebrew, have an orthographic rule that short vowels may be dropped.

  • Imagine that Arabic has the words hat, hit, hut

  • They will be written in dictionary as

    • Hat = هَت

    • Hut = هُت

    • Hit = هِت

  • In any piece of text the string هت could be any of these words


Arabic examples in Large font

  • Hat = هَت

  • Hut = هُت

  • Hit = هِت

  • Ht = هت


Inflection

Variants

walk

walked

walking

Surface Forms

walk

Walk

WALK

walked

Walked

WALKED

walking

Walking

WALKING

Root Word

walk

Model of How Orthographic & Representation variation handled

Orthography

& Representation

Inflection


Inflection

Variants

passé

Surface Forms

passé

Passe

passe‌´

Passé

Passe‌‌´

Passe

PASSÉ

PASSE

PASSE‌‌́´

Root Word

passé

Model of How Orthographic & Representation variation handled

Orthography

& Representation

Inflection


Recognize talk

l

k

a

t

4

3

2

0

1


Recognize

walk or talk

w

l

k

a

t

4

3

2

0

1


Recognize

walk, walked, walking

talk, talked or talking

d

6

w

5

e

l

k

a

t

4

3

2

0

1

i

g

n

9

7

8


Recognize all variants of walk or talk

d

6

w

5

e

l

k

a

t

4

3

2

0

1

i

g

W

n

9

7

a

8

T

D

15

14

E

K

L

A

13

12

11

I

10

G

N

18

16

17


Recognize all variants of wálk or tálk

d

6

w

5

e

k

l

á

4

t

2

3

1

0

i

a

g

n

19

9

7

8

á

W

T

D

15

a

14

E

K

L

13

12

11

I

G

Á

N

18

10

16

17


Word Breaking

  • ICU break iterators can only provide us with a first pass of the word segmentation

    • Significant improvements in ICU 2.2

  • Sometimes the word segmentation is not obvious

    • multi word expressions

    • compound words

    • languages with no spaces


Multiword Expressions

  • Word break is not always obvious

    • French pommes de terre

      • as 3 words = apples of the ground

      • as 1 word = potatoes

    • French l'intelligence artificielle

      • really 2 words le and intelligence artificielle

  • Sometimes the word break is ambigous

    • English red tape

      • as 2 words = tape with red color

      • as 1 word = synonym of bureaucracy


Decompounding

  • German and related languages allow speakers to generate their own compound words by combining component words

    • typically a series of nouns and/or adjectives

  • ICU considers these one word

    • but to analyse them properly we need to break it into its constituent words and look these up in the dictionary

  • Can be computationally expensive

  • We can also sometimes get ambiguous results


Ambiguous Decompounding - example 1

  • Wachstube

    • Option 1 = wax tube

      • Wachs = wax

      • Tube = tube

    • Option 2 = guard room

      • Wache = guard

      • Stube = room

    • Option 3 = awake room

      • Wach = awake

      • Stube = room


Ambiguous Decompounding - example 2

  • Hochschullehrer = University Lecturer

  • 3 component words

    • Hoch = High

    • Schule = school

    • Lehrer = teacher

  • Could index under

    • Hochschullehrer = Univesity Lecturer

    • Hochschule = High School & Lehrer = Teacher

    • Hoch = Higher & Shullehrer = school teacher

  • However Schullehrer (=schoolteacher) is a lexicalised compound

    • The two words are so often used together that native speakers no consider them to be a single word

    • Hoch & Schullehrer is only valid decomposition

    • In fact most German Speakers would regard Hochschullehrer as a single lexical word because it is commonly used


Ambiguous Decompounding - example 2

  • Neuroschaltungsverstärkung

    • Neuro = neuronal

    • Schaltungs = circuit

    • verstärkung = amplification

  • Could be Neuroschaltungs|verstärkung

    • interpreted as an amplification of a "neuronal circuit"

  • Could also be Neuro|schaltungsverstärkung

    • interpreted as a kind of circuit amplification which is done by neuronal technology


Languages with no spaces

  • Some languages (e.g. Chinese, Japanese, Thai) do not put spaces between words

    • Therefore, it is difficult to figure out where the word breaks are

    • Sometimes there are multiple possible word segmentations.

    • Sometimes the choice of segmentation can change the meaning

  • For these languages we use ICU break iterator to find "unambiguous word break" (e.g. punctuation symbol, line break, character type change)

    • Looking in the dictionary we find all possible word combinations within this text sequence

    • Using statistical techniques we figure out which word sequence is most likely


此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

此路不通行

此路不通行不得在此小便

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

No through way for pedestrians. Urination forbidden here.

Interp 2

此路不通

行不 / 得

在此 / 小便

Cul-de-sac. Walking Forbidden. Urinate here.

Chinese Segmentation Example

此路不通行不得在此小便


此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

Interp 1

此 /路 /不 /通 //行

This road not through walk

No through way for pedestrians.

Chinese Segmentation Example

此路不通行


此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

This road not through walkForbidden here urinate

No through way for pedestrians. Urination forbidden here.

Chinese Segmentation Example

此路不通行/不得在此小便


此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

No through way for walkers. Urination forbidden here.

Interp 2

此路不通 = cul-de-sac

行不 / 得 = walk not

在此 / 小便 = here urinate

Cul-de-sac. Walking Forbidden. Urinate here.

Chinese Segmentation Example

此路不通行不得在此小便


Future Challenges

  • Complete move of all languages to new architecture

  • Improving quality and breadth of linguistic data

    • more languages

    • more words (better user dictionary support)

    • richer relationships (e.g. part of, type of etc.)

  • Increasing Accuracy of analysis

    • Part of speech disambiguation

    • Ranking of parse results

  • Increasing performance speed

    • Latest version exceeds 2.5 Giga Char/hour on standard PC

    • Some customers say it is not fast enough, but this only allows 1.5 micro seconds/char

  • Available through ICU API


  • Login