developing a multilingual text analysis engine does using unicode solve all the issues
Download
Skip this Video
Download Presentation
Developing a multilingual text analysis engine - does using Unicode solve all the issues?

Loading in 2 Seconds...

play fullscreen
1 / 39

Developing a multilingual text analysis engine - does using Unicode solve all the issues? - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

Developing a multilingual text analysis engine - does using Unicode solve all the issues?. Dr. Brian O’Donovan, IBM Ireland, Sept. 2002. Agenda. What is IBM LanguageWare? Where/How is it Used? Our Experiences of Conversion to Unicode Benefits Accruing Challenges to be Overcome

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Developing a multilingual text analysis engine - does using Unicode solve all the issues?' - ulani


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
developing a multilingual text analysis engine does using unicode solve all the issues

Developing a multilingual text analysis engine - does using Unicode solve all the issues?

Dr. Brian O’Donovan,

IBM Ireland,

Sept. 2002

agenda
Agenda
  • What is IBM LanguageWare?
  • Where/How is it Used?
  • Our Experiences of Conversion to Unicode
    • Benefits Accruing
    • Challenges to be Overcome
  • Word Identification problems
  • Future work
what is ibm languageware
What is IBM LanguageWare?
  • A suite of tools to assist in linguistic analysis
  • Initially developed by IBM USA, but now developed by a Globally distributed team
    • Used in internal and external products
    • Available to OEMs as a toolkit
where are we
Where are We?
  • Developers
    • North Carolina, USA
    • Dublin, Ireland
    • Helsinki, Finland
    • Taipei, Taiwan
    • Yamato, Japan
    • Seoul, Korea
  • Major Customers
    • Various Locations, USA
    • Böblingen, Germany
28 or 30 languages supported
Afrikaans

Arabic

Catalan

Simplified Chinese

Traditional Chinese

Czech

Danish

Dutch (x2)

English

x4 regions & x2 domains

Finnish

French (x2)

German (x3)

Greek

Hebrew

Hungarian

Icelandic

Italian

Japanese

Korean

Norwegian (x2)

Polish

Portuguese (x2)

Russian

Spanish

Swedish

Tamil

Thai

Turkish

28 (or 30+) Languages Supported
analysis hierarchy
LexicalAnalysis Hierarchy

Summarization

Grammar check

Words = spell check,

morphology, etc.

where how is it used
Where/How is it Used?
  • Word Processors
    • spell aid, grammar checker
  • Search Engines & Data Mining
    • Extracting Lemmatized form
    • Query Expansion (e.g. Synonyms)
    • Assisting in Taxonomy Generation
  • Translation Memory
    • Identification of linguistic units
  • NLP tools use as first stage of analysis
  • Machine Translation (limited)
  • Speech recognition (not currently)
conversion to unicode
Conversion to Unicode
  • Previous version used different code pages for each language
    • In fact a collection of different engines
  • New version has a single engine for all languages using Unicode (not all languages have been converted yet)
    • Provided many benefits
    • Raised some challenges
benefits of unicode
Benefits of Unicode
  • No more conversion tables
    • Standardized on utf-16
    • Big-Endian Dictionaries
    • Converted to platform byte order on load
  • Possible to deal properly with multilingual text
  • Able to utilize ICU utilities
  • Can edit/view all test cases on one machine
issues with unicode
Issues with Unicode
  • Lots of work
  • Converting dictionaries is not trivial
  • Changing code would be huge (we were rewriting anyway)
  • Large code page
  • Some ambiguous representations & orthographic rules
  • ICU Tokenisation not always to our liking
how we dealt with large code page
How we dealt with large code page
  • Our architecture is based upon Finite State Transducers (FSTs)
  • In FSTs you frequently need to build tables with entries for every possible next character
    • for single byte code page => 256 entries
    • for double byte code page => 64K entries
    • If not careful, dictionaries might grow to 256 times their previous size
  • We deal with this by building a character transition table per dictionary
    • Each UNICODE character appearing in the dictionary is assigned a code by which it is referenced within the dictionary
      • This is not a private code page
      • The table is dynamically created when building the dictionary and will map differently each time
  • When looking up dictionary each character is first mapped through the table
    • If any characters are not in mapping table => we won't find a match in the dictionary
  • New characters are automatically added to the mapping table once words with the characters are added to the dictionary
  • Typically European languages require fewer than 100 characters
    • e.g. punctuation symbols do not appear in the dictionary
slide12
FSA Node

English Dictionary

a

b

c

Z

slide13
FSA Node

Japanese Dictionary

representation issues
Representation Issues
  • Even with utf-16 as a universal standard representation issues can arise
  • The letter ë is represented as 0x00EB
    • But could be e=0x0065 followed by umlaut=0x0308
  • Arabic letter Heh (ه) = E5 in Windows code page
    • In UNICODE 0x0647
    • But it has different forms e.g. ههه
    • Unicode defines 4 more code points for presentation forms
    • isolated=0xFEE9, final=0xFEEA, initial=0xFEEB and medial=0xFEEC
arabic shape bitmaps heh
Arabic Shape Bitmaps (Heh)
  • Heh varies with position
    • Isolated
    • Initial
    • Medial
    • Final
arabic shape bitmaps seen
Arabic Shape Bitmaps (Seen)
  • Seen varies much less
    • Isolated
    • Initial
    • Medial
    • Final
orthographic variation
Orthographic Variation
  • A word may appear differently in a sample text than in the dictionary.
    • All languages have different rules about what is allowed/required.
  • Some languages have simple rules
    • e.g. English casing rules
      • a lowercase dictionary word can appear in text as titlecase or uppercase
      • a titlecase dictionary word could appear in text as uppercase
      • an uppercase or mixed case dictionary word must appear in text exactly the same.
other languages have more complex rules
Other languages have more complex rules
  • In France letters lose their accent when capitalized
    • e.g. é is capitalized as E not É
  • but this rule does not apply in French Canada
    • So être becomes Etre in Paris but Être in Montreal
  • In German capitalization can change the letter count
    • 'ß' is sometimes capitalized as 'SS'
  • In English we optionally drop accents
    • e.g. the name Zoë can be written Zoe in English
    • But in German dropped umlauts are represented by a following e
      • e.g. Böblingen can be written Boeblingen
vowel dropping in semitic languages
Vowel dropping in Semitic languages
  • Semitic languages such as Arabic and Hebrew, have an orthographic rule that short vowels may be dropped.
  • Imagine that Arabic has the words hat, hit, hut
  • They will be written in dictionary as
    • Hat = هَت
    • Hut = هُت
    • Hit = هِت
  • In any piece of text the string هت could be any of these words
arabic examples in large font
Arabic examples in Large font
  • Hat = هَت
  • Hut = هُت
  • Hit = هِت
  • Ht = هت
model of how orthographic representation variation handled
Inflection

Variants

walk

walked

walking

Surface Forms

walk

Walk

WALK

walked

Walked

WALKED

walking

Walking

WALKING

Root Word

walk

Model of How Orthographic & Representation variation handled

Orthography

& Representation

Inflection

model of how orthographic representation variation handled22
Inflection

Variants

passé

Surface Forms

passé

Passe

passe‌´

Passé

Passe‌‌´

Passe

PASSÉ

PASSE

PASSE‌‌́´

Root Word

passé

Model of How Orthographic & Representation variation handled

Orthography

& Representation

Inflection

slide24
Recognize

walk or talk

w

l

k

a

t

4

3

2

0

1

slide25
Recognize

walk, walked, walking

talk, talked or talking

d

6

w

5

e

l

k

a

t

4

3

2

0

1

i

g

n

9

7

8

slide26
Recognize all variants of walk or talk

d

6

w

5

e

l

k

a

t

4

3

2

0

1

i

g

W

n

9

7

a

8

T

D

15

14

E

K

L

A

13

12

11

I

10

G

N

18

16

17

slide27
Recognize all variants of wálk or tálk

d

6

w

5

e

k

l

á

4

t

2

3

1

0

i

a

g

n

19

9

7

8

á

W

T

D

15

a

14

E

K

L

13

12

11

I

G

Á

N

18

10

16

17

word breaking
Word Breaking
  • ICU break iterators can only provide us with a first pass of the word segmentation
    • Significant improvements in ICU 2.2
  • Sometimes the word segmentation is not obvious
    • multi word expressions
    • compound words
    • languages with no spaces
multiword expressions
Multiword Expressions
  • Word break is not always obvious
    • French pommes de terre
      • as 3 words = apples of the ground
      • as 1 word = potatoes
    • French l'intelligence artificielle
      • really 2 words le and intelligence artificielle
  • Sometimes the word break is ambigous
    • English red tape
      • as 2 words = tape with red color
      • as 1 word = synonym of bureaucracy
decompounding
Decompounding
  • German and related languages allow speakers to generate their own compound words by combining component words
    • typically a series of nouns and/or adjectives
  • ICU considers these one word
    • but to analyse them properly we need to break it into its constituent words and look these up in the dictionary
  • Can be computationally expensive
  • We can also sometimes get ambiguous results
ambiguous decompounding ex ample 1
Ambiguous Decompounding - example 1
  • Wachstube
    • Option 1 = wax tube
      • Wachs = wax
      • Tube = tube
    • Option 2 = guard room
      • Wache = guard
      • Stube = room
    • Option 3 = awake room
      • Wach = awake
      • Stube = room
ambiguous decompounding ex ample 2
Ambiguous Decompounding - example 2
  • Hochschullehrer = University Lecturer
  • 3 component words
    • Hoch = High
    • Schule = school
    • Lehrer = teacher
  • Could index under
    • Hochschullehrer = Univesity Lecturer
    • Hochschule = High School & Lehrer = Teacher
    • Hoch = Higher & Shullehrer = school teacher
  • However Schullehrer (=schoolteacher) is a lexicalised compound
    • The two words are so often used together that native speakers no consider them to be a single word
    • Hoch & Schullehrer is only valid decomposition
    • In fact most German Speakers would regard Hochschullehrer as a single lexical word because it is commonly used
ambiguous decompounding ex ample 233
Ambiguous Decompounding - example 2
  • Neuroschaltungsverstärkung
    • Neuro = neuronal
    • Schaltungs = circuit
    • verstärkung = amplification
  • Could be Neuroschaltungs|verstärkung
    • interpreted as an amplification of a "neuronal circuit"
  • Could also be Neuro|schaltungsverstärkung
    • interpreted as a kind of circuit amplification which is done by neuronal technology
languages with no spaces
Languages with no spaces
  • Some languages (e.g. Chinese, Japanese, Thai) do not put spaces between words
    • Therefore, it is difficult to figure out where the word breaks are
    • Sometimes there are multiple possible word segmentations.
    • Sometimes the choice of segmentation can change the meaning
  • For these languages we use ICU break iterator to find "unambiguous word break" (e.g. punctuation symbol, line break, character type change)
    • Looking in the dictionary we find all possible word combinations within this text sequence
    • Using statistical techniques we figure out which word sequence is most likely
chinese segmentation example
此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

此路不通行

此路不通行不得在此小便

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

No through way for pedestrians. Urination forbidden here.

Interp 2

此路不通

行不 / 得

在此 / 小便

Cul-de-sac. Walking Forbidden. Urinate here.

Chinese Segmentation Example

此路不通行不得在此小便

chinese segmentation example36
此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

Interp 1

此 /路 /不 /通 //行

This road not through walk

No through way for pedestrians.

Chinese Segmentation Example

此路不通行

chinese segmentation example37
此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

This road not through walkForbidden here urinate

No through way for pedestrians. Urination forbidden here.

Chinese Segmentation Example

此路不通行/不得在此小便

chinese segmentation example38
此 = this (ci)

路 = road (lu)

不 = no/not (bu)

通 = through (tong)

此路不通 = cul-de-sac (cilubutong)

行 = walk (xing)

得 = get (de)

不得 = forbidden (bu-de)

在 = be (ci)

在此 = here (zaici)

小 = small (xiao)

便 = convenience (bian)

小便 = urinate (xiaobian)

Interp 1

此 /路 /不 /通 //行不得 /在此 /小便

No through way for walkers. Urination forbidden here.

Interp 2

此路不通 = cul-de-sac

行不 / 得 = walk not

在此 / 小便 = here urinate

Cul-de-sac. Walking Forbidden. Urinate here.

Chinese Segmentation Example

此路不通行不得在此小便

future challenges
Future Challenges
  • Complete move of all languages to new architecture
  • Improving quality and breadth of linguistic data
    • more languages
    • more words (better user dictionary support)
    • richer relationships (e.g. part of, type of etc.)
  • Increasing Accuracy of analysis
    • Part of speech disambiguation
    • Ranking of parse results
  • Increasing performance speed
    • Latest version exceeds 2.5 Giga Char/hour on standard PC
    • Some customers say it is not fast enough, but this only allows 1.5 micro seconds/char
  • Available through ICU API
ad