Developing a multilingual text analysis engine does using unicode solve all the issues l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Developing a multilingual text analysis engine - does using Unicode solve all the issues? PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

Developing a multilingual text analysis engine - does using Unicode solve all the issues?. Dr. Brian O’Donovan, IBM Ireland, Sept. 2002. Agenda. What is IBM LanguageWare? Where/How is it Used? Our Experiences of Conversion to Unicode Benefits Accruing Challenges to be Overcome

Download Presentation

Developing a multilingual text analysis engine - does using Unicode solve all the issues?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Developing a multilingual text analysis engine does using unicode solve all the issues l.jpg

Developing a multilingual text analysis engine - does using Unicode solve all the issues?

Dr. Brian O’Donovan,

IBM Ireland,

Sept. 2002


Agenda l.jpg

Agenda

  • What is IBM LanguageWare?

  • Where/How is it Used?

  • Our Experiences of Conversion to Unicode

    • Benefits Accruing

    • Challenges to be Overcome

  • Word Identification problems

  • Future work


What is ibm languageware l.jpg

What is IBM LanguageWare?

  • A suite of tools to assist in linguistic analysis

  • Initially developed by IBM USA, but now developed by a Globally distributed team

    • Used in internal and external products

    • Available to OEMs as a toolkit


Where are we l.jpg

Where are We?

  • Developers

    • North Carolina, USA

    • Dublin, Ireland

    • Helsinki, Finland

    • Taipei, Taiwan

    • Yamato, Japan

    • Seoul, Korea

  • Major Customers

    • Various Locations, USA

    • Böblingen, Germany


28 or 30 languages supported l.jpg

Afrikaans

Arabic

Catalan

Simplified Chinese

Traditional Chinese

Czech

Danish

Dutch (x2)

English

x4 regions & x2 domains

Finnish

French (x2)

German (x3)

Greek

Hebrew

Hungarian

Icelandic

Italian

Japanese

Korean

Norwegian (x2)

Polish

Portuguese (x2)

Russian

Spanish

Swedish

Tamil

Thai

Turkish

28 (or 30+) Languages Supported


Analysis hierarchy l.jpg

Lexical

Analysis Hierarchy

Summarization

Grammar check

Words = spell check,

morphology, etc.


Where how is it used l.jpg

Where/How is it Used?

  • Word Processors

    • spell aid, grammar checker

  • Search Engines & Data Mining

    • Extracting Lemmatized form

    • Query Expansion (e.g. Synonyms)

    • Assisting in Taxonomy Generation

  • Translation Memory

    • Identification of linguistic units

  • NLP tools use as first stage of analysis

  • Machine Translation (limited)

  • Speech recognition (not currently)


Conversion to unicode l.jpg

Conversion to Unicode

  • Previous version used different code pages for each language

    • In fact a collection of different engines

  • New version has a single engine for all languages using Unicode (not all languages have been converted yet)

    • Provided many benefits

    • Raised some challenges


Benefits of unicode l.jpg

Benefits of Unicode

  • No more conversion tables

    • Standardized on utf-16

    • Big-Endian Dictionaries

    • Converted to platform byte order on load

  • Possible to deal properly with multilingual text

  • Able to utilize ICU utilities

  • Can edit/view all test cases on one machine


Issues with unicode l.jpg

Issues with Unicode

  • Lots of work

  • Converting dictionaries is not trivial

  • Changing code would be huge (we were rewriting anyway)

  • Large code page

  • Some ambiguous representations & orthographic rules

  • ICU Tokenisation not always to our liking


How we dealt with large code page l.jpg

How we dealt with large code page

  • Our architecture is based upon Finite State Transducers (FSTs)

  • In FSTs you frequently need to build tables with entries for every possible next character

    • for single byte code page => 256 entries

    • for double byte code page => 64K entries

    • If not careful, dictionaries might grow to 256 times their previous size

  • We deal with this by building a character transition table per dictionary

    • Each UNICODE character appearing in the dictionary is assigned a code by which it is referenced within the dictionary

      • This is not a private code page

      • The table is dynamically created when building the dictionary and will map differently each time

  • When looking up dictionary each character is first mapped through the table

    • If any characters are not in mapping table => we won't find a match in the dictionary

  • New characters are automatically added to the mapping table once words with the characters are added to the dictionary

  • Typically European languages require fewer than 100 characters

    • e.g. punctuation symbols do not appear in the dictionary


Slide12 l.jpg

FSA Node

English Dictionary

a

b

c

Z


Slide13 l.jpg

FSA Node

Japanese Dictionary


Representation issues l.jpg

Representation Issues

  • Even with utf-16 as a universal standard representation issues can arise

  • The letter ë is represented as 0x00EB

    • But could be e=0x0065 followed by umlaut=0x0308

  • Arabic letter Heh (ه) = E5 in Windows code page

    • In UNICODE 0x0647

    • But it has different forms e.g. ههه

    • Unicode defines 4 more code points for presentation forms

    • isolated=0xFEE9, final=0xFEEA, initial=0xFEEB and medial=0xFEEC


Arabic shape bitmaps heh l.jpg

Arabic Shape Bitmaps (Heh)

  • Heh varies with position

    • Isolated

    • Initial

    • Medial

    • Final


Arabic shape bitmaps seen l.jpg

Arabic Shape Bitmaps (Seen)

  • Seen varies much less

    • Isolated

    • Initial

    • Medial

    • Final


Orthographic variation l.jpg

Orthographic Variation

  • A word may appear differently in a sample text than in the dictionary.

    • All languages have different rules about what is allowed/required.

  • Some languages have simple rules

    • e.g. English casing rules

      • a lowercase dictionary word can appear in text as titlecase or uppercase

      • a titlecase dictionary word could appear in text as uppercase

      • an uppercase or mixed case dictionary word must appear in text exactly the same.


Other languages have more complex rules l.jpg

Other languages have more complex rules

  • In France letters lose their accent when capitalized

    • e.g. é is capitalized as E not É

  • but this rule does not apply in French Canada

    • So être becomes Etre in Paris but Être in Montreal

  • In German capitalization can change the letter count

    • 'ß' is sometimes capitalized as 'SS'

  • In English we optionally drop accents

    • e.g. the name Zoë can be written Zoe in English

    • But in German dropped umlauts are represented by a following e

      • e.g. Böblingen can be written Boeblingen


Vowel dropping in semitic languages l.jpg

Vowel dropping in Semitic languages

  • Semitic languages such as Arabic and Hebrew, have an orthographic rule that short vowels may be dropped.

  • Imagine that Arabic has the words hat, hit, hut

  • They will be written in dictionary as

    • Hat = هَت

    • Hut = هُت

    • Hit = هِت

  • In any piece of text the string هت could be any of these words


Arabic examples in large font l.jpg

Arabic examples in Large font

  • Hat = هَت

  • Hut = هُت

  • Hit = هِت

  • Ht = هت


Model of how orthographic representation variation handled l.jpg

Inflection

Variants

walk

walked

walking

Surface Forms

walk

Walk

WALK

walked

Walked

WALKED

walking

Walking

WALKING

Root Word

walk

Model of How Orthographic & Representation variation handled

Orthography

& Representation

Inflection


Model of how orthographic representation variation handled22 l.jpg

Inflection

Variants

passé

Surface Forms

passé

Passe

passe‌´

Passé

Passe‌‌´

Passe

PASSÉ

PASSE

PASSE‌‌́´

Root Word

passé

Model of How Orthographic & Representation variation handled

Orthography

& Representation

Inflection


Slide23 l.jpg

Recognize talk

l

k

a

t

4

3

2

0

1


Slide24 l.jpg

Recognize

walk or talk

w

l

k

a

t

4

3

2

0

1


Slide25 l.jpg

Recognize

walk, walked, walking

talk, talked or talking

d

6

w

5

e

l

k

a

t

4

3

2

0

1

i

g

n

9

7

8


Slide26 l.jpg

Recognize all variants of walk or talk

d

6

w

5

e

l

k

a

t

4

3

2

0

1

i

g

W

n

9

7

a

8

T

D

15

14

E

K

L

A

13

12

11

I

10

G

N

18

16

17


Slide27 l.jpg

Recognize all variants of wálk or tálk

d

6

w

5

e

k

l

á

4

t

2

3

1

0

i

a

g

n

19

9

7

8

á

W

T

D

15

a

14

E

K

L

13

12

11

I

G

Á

N

18

10

16

17


Word breaking l.jpg

Word Breaking

  • ICU break iterators can only provide us with a first pass of the word segmentation

    • Significant improvements in ICU 2.2

  • Sometimes the word segmentation is not obvious

    • multi word expressions

    • compound words

    • languages with no spaces


Multiword expressions l.jpg

Multiword Expressions

  • Word break is not always obvious

    • French pommes de terre

      • as 3 words = apples of the ground

      • as 1 word = potatoes

    • French l'intelligence artificielle

      • really 2 words le and intelligence artificielle

  • Sometimes the word break is ambigous

    • English red tape

      • as 2 words = tape with red color

      • as 1 word = synonym of bureaucracy


Decompounding l.jpg

Decompounding

  • German and related languages allow speakers to generate their own compound words by combining component words

    • typically a series of nouns and/or adjectives

  • ICU considers these one word

    • but to analyse them properly we need to break it into its constituent words and look these up in the dictionary

  • Can be computationally expensive

  • We can also sometimes get ambiguous results


Ambiguous decompounding ex ample 1 l.jpg

Ambiguous Decompounding - example 1

  • Wachstube

    • Option 1 = wax tube

      • Wachs = wax

      • Tube = tube

    • Option 2 = guard room

      • Wache = guard

      • Stube = room

    • Option 3 = awake room

      • Wach = awake

      • Stube = room


Ambiguous decompounding ex ample 2 l.jpg

Ambiguous Decompounding - example 2

  • Hochschullehrer = University Lecturer

  • 3 component words

    • Hoch = High

    • Schule = school

    • Lehrer = teacher

  • Could index under

    • Hochschullehrer = Univesity Lecturer

    • Hochschule = High School & Lehrer = Teacher

    • Hoch = Higher & Shullehrer = school teacher

  • However Schullehrer (=schoolteacher) is a lexicalised compound

    • The two words are so often used together that native speakers no consider them to be a single word

    • Hoch & Schullehrer is only valid decomposition

    • In fact most German Speakers would regard Hochschullehrer as a single lexical word because it is commonly used


Ambiguous decompounding ex ample 233 l.jpg

Ambiguous Decompounding - example 2

  • Neuroschaltungsverstärkung

    • Neuro = neuronal

    • Schaltungs = circuit

    • verstärkung = amplification

  • Could be Neuroschaltungs|verstärkung

    • interpreted as an amplification of a "neuronal circuit"

  • Could also be Neuro|schaltungsverstärkung

    • interpreted as a kind of circuit amplification which is done by neuronal technology


Languages with no spaces l.jpg

Languages with no spaces

  • Some languages (e.g. Chinese, Japanese, Thai) do not put spaces between words

    • Therefore, it is difficult to figure out where the word breaks are

    • Sometimes there are multiple possible word segmentations.

    • Sometimes the choice of segmentation can change the meaning

  • For these languages we use ICU break iterator to find "unambiguous word break" (e.g. punctuation symbol, line break, character type change)

    • Looking in the dictionary we find all possible word combinations within this text sequence

    • Using statistical techniques we figure out which word sequence is most likely


Chinese segmentation example l.jpg

此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

此路不通行

此路不通行不得在此小便

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

No through way for pedestrians. Urination forbidden here.

Interp 2

此路不通

行不 / 得

在此 / 小便

Cul-de-sac. Walking Forbidden. Urinate here.

Chinese Segmentation Example

此路不通行不得在此小便


Chinese segmentation example36 l.jpg

此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

Interp 1

此 /路 /不 /通 //行

This road not through walk

No through way for pedestrians.

Chinese Segmentation Example

此路不通行


Chinese segmentation example37 l.jpg

此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

This road not through walkForbidden here urinate

No through way for pedestrians. Urination forbidden here.

Chinese Segmentation Example

此路不通行/不得在此小便


Chinese segmentation example38 l.jpg

此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

No through way for walkers. Urination forbidden here.

Interp 2

此路不通 = cul-de-sac

行不 / 得 = walk not

在此 / 小便 = here urinate

Cul-de-sac. Walking Forbidden. Urinate here.

Chinese Segmentation Example

此路不通行不得在此小便


Future challenges l.jpg

Future Challenges

  • Complete move of all languages to new architecture

  • Improving quality and breadth of linguistic data

    • more languages

    • more words (better user dictionary support)

    • richer relationships (e.g. part of, type of etc.)

  • Increasing Accuracy of analysis

    • Part of speech disambiguation

    • Ranking of parse results

  • Increasing performance speed

    • Latest version exceeds 2.5 Giga Char/hour on standard PC

    • Some customers say it is not fast enough, but this only allows 1.5 micro seconds/char

  • Available through ICU API


  • Login