COOC
Download
1 / 30

COOC - PowerPoint PPT Presentation


  • 167 Views
  • Updated On :

COOC. Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger, Diana Raileanu, Hubert Schlarb Supervisors: Jan Alexandersson, Paul Buitelaar. Contents. Intro Theoretical foundations At the outset

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'COOC' - varvara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

COOC

Practicum / Software Project, SS 2000

Final Report

Tanja von den Berg, Tilman Jäger, Kerstin Klöckner,

Stephan Lesch, Holger Neis, Norbert Pfleger,

Diana Raileanu, Hubert Schlarb

Supervisors: Jan Alexandersson, Paul Buitelaar


Contents l.jpg
Contents

  • Intro

  • Theoretical foundations

  • At the outset

  • Project aspects

    • Preprocessing

    • Training

    • Application

    • Evaluation

  • Outlook

COOC


Intro l.jpg
Intro

  • Word Sense Disambiguation (WSD) as preparation for semantic analysis of text documents

  • Application areas: translation systems, info retrieval systems, document classification, etc.

  • Machine learning approaches:

    - supervised (semantically tagged corpora)

    - unsupervised (untagged corpora)

  • COOC: the first unsupervised, corpus-based approach for German

COOC: Einleitung


Theoretical foundations l.jpg
Theoretical Foundations

WSD (Word Sense Disambiguation) in context:

E.g.: bank - place to sit vs. financial institution

I‘m going to the bank to get some money.

COOC: cooccurrence of words in a given context

GermaNet: (WordNet for German)

WordNet: - lexical and semantic data bank

- semantic net, ontology

- lexical and conceptual relations

(antonymy, hyponymy)

COOC: Theoretische Grundlagen


Theoretical foundations ii l.jpg
Theoretical Foundations (II)

Method:

- knowledge sources (WordNet, Thesaurus)

- the possibility of finding relations between words and meanings

supervised: - requires already disambiguated data

- requires large amounts of data

unsupervised: - requires even more data

- data need not be desambiguated

COOC: Theoretische Grundlagen


Theoretical foundations iii l.jpg
Theoretical Foundations (III)

  • Examples of unsupervised methods:

  • Lesk (1986): comparison among dictionary entries

  • Yarowski (1992):

  • - Roget‘s Thesaurus, Groliers Encyclopedia

  • - collections of contexts for a thesaurus category

  • - identification of characteristic words

  • Resnik (1997): - Penn Treebank Corpus, pos-tagged, syntactically annotated

  • - selectional preference (predicate arguments)

COOC: Theoretische Grundlagen


At the outset l.jpg
At the outset

  • Approach of Seligman (94):

    • Japanese dialogues (direction finding, hotel reservations in spontaneous speech)

    • thesaurus with 4 fixed abstraction levels

    • explicit semantic smoothing

  • COOC project:

    • Tiger corpus (Frankfurter Rundschau)

    • GermaNet with varying number of abstraction levels (up to 26)

    • implicit semantic smoothing

COOC: Ausgangssituation


Flow diagram l.jpg
Flow diagram

COOC: Training


Preprocessing l.jpg
Preprocessing

  • Conversion of the training corpus (plain text) into the COOC format

  • Statistics on GermaNet categories

COOC: Vorbehandlung


Resources l.jpg
Resources

  • Tiger corpus (1.051.446 tokens) - German newspaper text from the Frankfurter Rundschau

  • TnT tagger(Brants 2000) - statistical Part-of-Speech tagger

  • Mmorph(Petitpierre & Russell, 1995) - morphological analysis tool

  • GermaNet: - lexico-semantic network for German (about 25000 nouns, 6000 verbs, 3500 Adjectives)

COOC: Vorbehandlung


Cooc format l.jpg
COOC-Format

Philip Glass wurde auf seinen weltweiten Tourneen mit Kassetten und Tonbändern überschüttet. (Phillip Glass was showered with audio tape and cassettes during his wordwide tour.)

...

166 seinen NA PPOSAT167 weltweiten weltweit ADJA [ 113815 113669 111763 111559 ]

...

172 Tonbändern Tonband NN [ 75749 ... 1749365 ] ... [ 75749 ... 144863 ]173 überschüttet überschütten VVPP [ 353400 ... 226602 ] [ 353400 ... 2266023 ]

...

COOC: Vorbehandlung


Germanet hierarchy l.jpg
GermaNet Hierarchy

COOC: Vorbehandlung


Statistics on germanet categories l.jpg
Statistics onGermaNet Categories

  • Omission of higher-frequency categories

  • Reduction of computational complexity

  • Format: Frequency ID(Offset) Synset

  • Example: 70725 1749365 Objekt_0

    43450 369009 Situation_0

    ...........

    2 843903 Kofferraum_0

    1 695036 Intellekt_0_Genius_0

COOC: Vorbehandlung


Segmentation l.jpg
Segmentation...

...at sentence boundaries:

Landesbank schlägt Verträge zwischen Stadt und privaten Investoren vor

Überall wird gebuddelt und gemauert.

Hamburg erlebt den größten Geschäftsbau-Boom.

Jährlich hinzukommen rund 300 000 Quadratmeter an Büroräumen.

...or e.g. after every 3 significant words:

Landesbank schlägt Verträge zwischen Stadt und privaten

Investoren vor Überall wird gebuddelt und gemauert. Hamburg erlebt

den größten Geschäftsbau-Boom. Jährlich hinzukommen rund

300 000 Quadratmeter an Büroräumen.

COOC: Training


Windows l.jpg
Windows

Text window: n segments with current segment in the middle

wider scope than n-grams

S(i)

S(i+1)

S(i+2)

S(i+3)

S(i+4)

W(t)

W(t+1)

W(t+2)

n = 3

COOC: Training


Training unsupervised l.jpg
Training: unsupervised

Compare Peter goes by train with Diana goes by bike:

train and bike should both be VEHICLES; but different ambiguities

COOC: Training


Statistics l.jpg
Statistics

  • For a pair of categories:

  • conditional probability

  • mutual information

  • Effect: correct category combinations emerge

  • statistically

COOC: Training


Training parameters l.jpg
Training: Parameters

  • Segmentation methods

  • Window width

  • limiting calculation time and space requirements:

  • exclusion of certain POS combinations

  • only categories in certain frequency intervals

  • only pairs with frequency > minimum

COOC: Training


Application l.jpg
Application

  • Actual disambiguation process

    • input: sentences/text in COOC format, containing ambiguous words

    • output: disambiguated sentences/words

    • requires training results

COOC: Anwendung


To proceed l.jpg
To proceed

  • Connection to the training data bank

    • selection of parameters (window and segment size) of the training data bank

  • Text processing

    • construction of the initial windows

    • desambiguation of the current segment

    • results are written to the Ouput Data

COOC: Anwendung


To proceed ii l.jpg

S(i)

S(i+1)

S(i+2)

S(i+3)

S(i+4)

S(i)

S(i+1)

S(i+2)

S(i+3)

S(i+4)

To proceed (II)

  • Window handling:

    • the middle (current) segment is then disambiguated word by word

    • at the last segment, the window is moved one segment to the right

COOC: Anwendung


To proceed iii l.jpg
To proceed (III)

  • Handling the words in the middle (current) segment

    • distinguish significant vs. insignificant words (with and without GermaNet categories)

    • for significant words, the most probable meaning is computed and output

    • insignificant words are written unchanged into the Output Data

COOC: Anwendung


Probability of the appeareance of a category in context l.jpg
Probability of the Appeareance of a Category in Context

  • where:

    • MI: mutual information

    • PR: conditional probability

    • c0: current category

    • ci: context category

COOC: Anwendung


Calculation of the most probable meaning l.jpg
Calculation of the most probable meaning

  • where:

    • PC: probability of the appearance of a category given a context

COOC: Anwendung


Example disambiguation l.jpg
Example: Disambiguation

Folklore, Rock, Klassik und Jazz zu vermischen reicht ihnen nicht, sie nutzen die Elektronik und sind sogar dazu übergegangen, Instrumente selbst zu bauen.

Not satisfied to merely mix up Folk, Rock, Classical,

and Jazz, they make use of Electronic Music as well,

and go so far as to build their own instruments.

3002 Rock Rock NN 2 Rock_0

3004 Klassik Klassik NN 1 Klassik_0

3008 vermischen vermischen VVINF 1 vermengen_0_vermischen_0

3009 reicht reichen VVFIN 7 reichen_0

3014 nutzen nutzen VVFIN 2 nutzen_2_nützen_2

3016 Elektronik Elektronik NN 1 Elektronik_0

3023 Instrumente Instrument NN 2 Musikinstrument_0_Instrument_2

3026 bauen bauen VVINF 4 bauen_3

3002 Rock Rock NN [ 39981 ... 3228 ] [ 39981 ... 3228 ]

3004 Klassik Klassik NN [ 221503 ... 221266 ]

3008 vermischen vermischen VVINF [ 643704 643048 ]

3009 reicht reichen VVFIN [ 21538 ] [ 339847 307402 ] [ 581324 ... 568361 ]

[ 581324 ... 862674] [ 581324 ... 912753 ] [ 586102 585849 ] [ 588150 ... 586261 ]

3016 Elektronik Elektronik NN [ 405356 ... 383322 ]

3023 Instrumente Instrument NN [ 5357 3228 ] [ 142311 ... 3228 ]

3026 bauen bauen VVINF [ 650176 647379 ] [ 742021 ... 734399 ]

[ 743571 ... 734399 ] [ 743710 735354 734399 ]

COOC: Anwendung


Evaluation comparison l.jpg
Evaluation:Comparison

Test corpus

1017 Komponisten Komponist NN 1 Komponist_0_Komponistin_0

2010 Möglichkeiten Möglichkeit NN 2 Möglichkeit_2_Eventualität_0

14011 verfügbar verfügbar ADJD 0

14014 machen machen VVINF 6 betätigen_0_treiben_0_machen_0

24006 wirkt wirken VVFIN 6 wirken_2

Evaluation corpus (Negra/Lexsem corpus)

1017 Komponisten Komponist NN Komponist_0_Komponistin_0

2010 Möglichkeiten Möglichkeit NN Möglichkeit_2_Chance_0_Gelegenheit_0

14011 verfügbar verfügbar ADJD unknown

14014 machen machen VVINF unspec

24006 wirkt wirken VVFIN wirken_2

COOC: Evaluation


Slide27 l.jpg

Meanings in the test corpus

2346 words annotated with 3.1 meanings per word,

1366 of these ambiguous, with average of 4.6 meanings

COOC: Evaluation


Results 3 segments window l.jpg
Results (3 Segments/Window)

Sentences

Segmentgröße 0(Satz) 2 5 7 10 15

count 1882

trivial 773

hitcount 703 586 688 718 724 720

incorrect 347 523 421 349 357 366

nicht desambiguiert 59 210 52 42 28 23

Precision (alle) [32,3%] 80,97 81,28 79,84 81,03 80,74 80,31

Precision (amb.) [21,7%] 66,95 52,84 62,04 67,30 66,98 66,30

Recall 96,51 88,51 96,88 97,41 98,15 98,41

segment size

not disambiguated

COOC: Evaluation


Summary l.jpg
Summary

  • COOC:

  • is the first unsupervised, corpus-based method of disambiguating semantically ambiguous words for German

  • goes beyond n-gram statistics

  • uses plain text, GermaNet, MMorph and a POS tagger

  • is a tool for unsupervised learning, semantic tagging, and evaluation

  • first evaluation gives 67,3% (81) precision and 97,4% recall

COOC: Zusammenfassung


Outlook l.jpg
Outlook

  • Use of GermaNet 2 (but still need a hand-labeled evaluation corpus)

  • Repeat experiment with WordNet and Penn Treebank Corpus

  • Several experiments to determine optimal parameters

  • Two theses:

    • lexical disambiguation

    • general predictions

COOC: Ausblick


ad