word sense disambiguation and annotation transfer in parallel texts
Skip this Video
Download Presentation
Word sense disambiguation and annotation transfer in parallel texts

Loading in 2 Seconds...

play fullscreen
1 / 30

Word sense disambiguation and annotation transfer in parallel texts - PowerPoint PPT Presentation

  • Uploaded on

Word sense disambiguation and annotation transfer in parallel texts. Dan TUFIŞ 1,2 , Radu ION 1 , Verginica MITITELU 1 1 RACAI – Research Institute for Artificial Intelligence, Bucharest 2 University „A.I. Cuza”, Ia şi { tufis,radu, vergi}@racai.ro. Outline of the talk.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Word sense disambiguation and annotation transfer in parallel texts' - gili

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
word sense disambiguation and annotation transfer in parallel texts

Word sense disambiguation and annotation transfer in parallel texts

Dan TUFIŞ1,2, Radu ION1,Verginica MITITELU1

1RACAI – Research Institute for Artificial Intelligence, Bucharest

2University „A.I. Cuza”, Iaşi

{tufis,radu, vergi}@racai.ro

outline of the talk
Outline of the talk
  • WSD: monolingual / multilingual approaches
  • WSD based on parallel texts and aligned wordnets
  • Brief account on Romanian wordnet (part of BalkaNet)
  • WSDtool+ main processing steps
      • Word Alignment (COWAL),
      • Sense labelling (four sense inventory: synsets IDs, SUMO/MILO concepts, IRST Domains, Explanatory Dictionary of Romanian)
      • Custering
      • Annotation generation
  • Annotation import in aligned and WSDed corpora
  • Conclusions and further work
wsd based on parallel texts and aligned wordnets
WSD based on parallel texts and aligned wordnets

Our approach:

A mixture of unsupervised and KB approaches with multiple processing steps:

  • word alignment in parallel corpora (COWAL) and translation equivalents extraction (translation model);
  • sense labeling using (BalkaNet):
    • Princeton word sense inventory;
    • SUMO/MILO ontology
    • IRST Domains classes
    • EXPD labels

Based on aligned wordnets (covers ~75% of the word occurrences in a corpus)

  • sense clustering based on translation equivalents extracted from parallel corpora (takes care of the uncovered cases by the previous step;
  • generation of the WSD annotation in the parallel corpus
brief account on romanian wordnet
Brief account on Romanian wordnet

Conceptually dense: for any RO-synset, aligned to an ILI code (EN-synset),all the hierarchical antecedents of the EN-synset are also linked to synsets present the RO-wordnet.

Virtually, more synsets and more literals (monosemous literals; “coverage heuristics”)

brief account on romanian wordnet structure of a synset
Brief account on Romanian wordnetStructure of a synset

PWN id



Parte a unui organism animal sau vegetal care îndeplineşte anumite funcţii

Dan Tufis








wsd main steps 1 word alignment and filtering of the translation equivalents
WSD MAIN STEPS 1.Word AlignmentandFiltering of the Translation Equivalents
  • The word alignment system (COWAL) is producing N-M cross-POS alignment links with an average F-measure of more than 80% (F=82.52%, AER=17.48%).
  • Only the preserving major POS (V, N, A, R) translation links retained. In such a case F is better than 92% (F=92.04%, AER=7.96%)
wsd main steps 2 sense labeling
WSD MAIN STEPS2. Sense Labeling
  • Aligned wordnets (lexical ontologies)
  • Conceptual knowledge structuring (upper ad mid level ontologies): SUMO/MILO
  • Domain taxonomies (UDC-librarian’s taxonomy): IRST-DOMAINS
  • Explanatory Dictionary of Romanian (3 sense levels)
  • Coverage heuristics: if one of the words in a translation pair is not member of any synset in the respective language wordnet, but the other word is present in the aligned wordnet, and moreover it is monosemous, the first word gets the sense as given by the monosemous word. If one of the languages is English, any other language can benefit from this heuristics (approx. 80% of the PWN literals are monosemous):
    • Ex: hilarious <-> hilar => ENG20-01221243-a

burp <-> râgâi=> ENG20-00003374-v

prospicience <->clarviziune => ENG20-05469664-n

example i lamp lamp










Example (I):

PWN2.0 (lamp) = {03500372-n, 03500773-n}

RoWN (lampă) = {03500773-n, 03500872-n}


ILI= 03500773-n=>

SUMO (03500773-n) = +Device

DOMAINS (03500773-n) = furniture

example ii lamp felinar









Example (II):

PWN2.0 (lamp) = {03500372-n, 03500773-n}

RoWN (felinar) = {003505057-n}

δ (03500372-n, 003505057-n)=0.5 δ (03500373-n, 003505057-n)=0.125

ILI= 03500773-n=>

SUMO (03500773-n) = IlluminationDevice

DOMAINS (03500773-n) = factotum


3.Word Sense Clustering

_____|-> (1)

|-----| |-> (1)

| |_____|---> (1)

| |___|-> (1)

| |-> (1)

| |---> (1)

|----| | _|-> (1)

| | | |-| |-> (1)

| | |---| |-| |-> (1)

| | | | |-| |-> (1)

| |-----| | |-| |-> (1)

|--| | |---| |-> (1)

| | | |-> (1)

| | |___|---> (6)

| | |___|-----> (1)

| | |-----> (1)

| | _____|-----> (6)

-| |----| |-----> (6)

| |-----> (4)



| |---> (2)

| |---| _|-> (2)

| | | |-| |-> (2)

| | |---| |-> (2)

|--| |-> (2)

| |-----> (2)

| | ___|-----> (2)

|---| |----| |-----> (2)

| | | _|-> (2)

| | |---| |-> (2)

|-----| |-> (2)

| ____|-> (3)

|----| |-> (2)

| _|-> (2)

|----| |-> (2)

|-> (2)

An agglomerative, hierarchical algorithm using a vector space model, Euclidean distance and cardinality weighted computation of the centroid (the “center of weight” of a new class).

The undecidable problem of how many classes, gets hints from the

work already done in step 1 and 2

4. WSD annotation in the

parallel corpus (“1984”)












evaluation i
Evaluation (I)
  • “lexical sample” and 1-tag annotation evaluation (with k-tag, the performance would be essentially the one of the filtered word alignment, i.e. 92.04% )
  • 216 English ambiguous words (at least two senses/POS) with 2081 occurrences in “1984” were semantically disambiguated by three experts in terms of PWN2.0 sense inventory. The experts negotiated all the disagreement cases, thus resulting the Gold Standard annotation (GS)
    • - this is “lexical sample/lexical choice” evaluation type of SENSEVAL (much harder than “all words”whichincludes monosems and homographs as well)
  • For each PWN2.0 sense number, the GS was deterministically enriched with the SUMO category and the DOMAINS label.
  • Thus, we had three sense inventories in the GS and could evaluate system’s WSD accuracy in terms of each of them.
evaluation ii
Evaluation (II)
  • Automatic WSD was performed three ways:
      • using only RO-EN aligned BalkaNet wordnets (AWN)
      • combining AWN with clustering (AWN+C)
      • combining AWN+C with the simple heuristics (AWN+C+MFS)
  • Out of 2081 total occurrences 61 (34 words) could not receive a sense tag because the target literal was wrongly aligned by the translation equivalents extractor module of the WSDtool, or because it was not translated or wrongly translated by the human translator. In this case we used MFS, a simple heuristics assigning the most frequent sense label (42 occurrences were correctly tagged).
evaluation iii
Evaluation (III)

WSD based on WN2.0+RoWN (PWN2.0 id)

annotation import in aligned and wsded corpora

Test whether it is possible (and to what degree) to automatically transfer syntactic relations (as they are lexicalized in a corpus) from a resource-rich language (English) into another language with fewer resources (Romanian), using parallel corpora.

Annotation import in aligned and WSDed corpora

  • Parallel corpus: George Orwell’s 1984
    • XCES XML encoded
    • Sentence aligned (only 1-1 alignments retained)
    • Tokenized
    • Morpho-syntactically annotated
    • Chunked
    • Word aligned
    • The English version - parsed with an FDG parser (Tapanainen and Järvinen, 1997)
  • A set of generic transfer rules of the syntactic relations from En-Ro (hand written, language pair dependent)
transfer cases
Transfer cases
  • Perfect transfer
  • Transfer with amendments
  • Language specific phenomena
  • Impossibility of transfer
2 transfer with amendments i
2. Transfer with amendments (I)

Active-Passive inversion

‘Lucrul’ is not the object of the ‘sugerat’ but its subject

2 transfer with amendments ii
2.Transfer with amendments (II)

Dummy anticipatory ‘It’ is subject and book is complement. Yet in Romanian, ‘carte’ is subject.

4 impossibility of transfer i
4. Impossibility of transfer (I)
  • Equivalent verbs with different syntactic behaviour: ‘like’ – ‘plăcea’
See for details:

Verginica Mititelu, Radu Ion:

Cross-Language Transfer of Syntactic Relations Using Parallel Corpora, 2005

conclusions i
Conclusions (I)
  • Cross experiment evaluations of the WSD results are hard to compare when different granularity sense-inventory are used (PWN2.0: 115424 meanings vs. SUMO-MILO: 2066 categories, vs. IRST DOMAINS: 163)
  • Considering the fine granularity of the WSD annotation our results are superior to those reported by the few researchers who used the same sense inventory (e.g. G.Rigau and his colleagues): not surprising! Most of the WSD experiments were carried on in monolingual environments; word alignment reveals the mental lexicon used by professional human translators in parallel texts and as such is an invaluable knowledge source.
conclusions ii
Conclusions (II)
  • One of the greatest advantages of applying such methods to parallel data: it may be used to automatically sense-tag corpora in not only one language, but rather several at once.
  • The automatic procedure of transferring syntactic relations (as shown in the Romanian experiment) is reliable provided that all necessary resources are present with the required level of annotation
  • Language specific structures and grammatical phenomena require the pre- and post-processing of the data
  • Given that syntactic relations may be imported we started experiments on importing FrameNet annotations from English into Romanian.
further work
Further work
  • Further development of the Romanian wordnet
    • about 15,000 “hard-synsets”
    • about 70,000 “easy synsets” (monosemous and/or mono-literals)
  • Development of new word-aligned multilingual corpora (including AqC21), improving current translation models and building new ones; (see the XCES sample of the EN-RO AqC: 80 docs, tagged, lemmatised, chunk-parsed, word-aligned, WSDed, approx. 1MB)
  • Extending the annotation import experiment and improving the rule-set governing the transfer:
    • Framenet project
    • Dependency grammar for Romanian
  • Development of a multilingual SMT system (we started En-Ro experiments with very encouraging results!)
recent papers with details for this talk
Recent papers with details for this talk
  • Dan Tufiş, Verginica Mititelu, Luigi Bozianu, Cătălin Mihăilă: Romanian WordNet: New Developments and Applications. In Piek Vossen and Christiane Fellbaum (eds.)Proceedings of the 3rd International Wordnet Conference, Jeju Island, Korea, South Jeju, January 2006, 10 pages
  • Dan Tufiş, Radu Ion, Alexandru Ceauşu, Dan Stefănescu: Combined Aligners. In Proceeding of the ACL2005 Workshop on “Building and Using Parallel Corpora: Data-driven Machine Translation and Beyond”, June, 2005, Ann Arbor, Michigan, June, Association for Computational Linguistics, pp. 107-110
  • Verginica Mititelu, Radu Ion: Cross-Language Transfer of Syntactic Relations Using Parallel Corpora. In Diana Inkpen & Carlo Strapparava (eds.) Proceedings of the Workshop on Cross-Language Knowledge Induction, 2-4 August, Cluj-Napoca, 2005, pp. 46-51, ISBN 973-703-139-9,
  • Dan Tufiş, Radu Ion: Evaluating the word sense disambiguation accuracy with three different sense inventories. In Proceedings of the Natural Language Understanding and Cognitive Systems Symposium, Miami, Florida, May 2005, pp. 118-127, ISBN 972-8865-23-6
  • Dan Tufiş, Radu Ion, Nancy Ide: Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets. In proceedings of the 20th International Conference on Computational Linguistics, COLING2004, Geneva, 2004, pp. 1312-1318, ISBN 1-9324432-48-5
  • Dan Tufiş: Word Sense Disambiguation: A Case Study on the Granularity of Sense Distinctions. In WSEAS Transactions on Information Science and Applications, vol. 2, no. 2, February 2005, pp.183-188, ISSN 1790-0032
  • Dan Tufiş, Eduard Barbu. A Methodology and Associated Tools for Building Interlingual Wordnets. In Proceedings of the 5th LREC Conference, Lisabona, 2004, pp. 1067-1070
  • Dan Tufiş, Dan Cristea, S. Stamou. BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. In Romanian Journal on Information Science and Technology, Dan Tufiş (ed.) Special Issue on BalkaNet, Romanian Academy, vol7, no. 2-3, 2004, pp. 9-34, ISSN 1453-8245
  • Dan Tufiş, Eduard Barbu, Verginica Mititelu, Radu Ion, Luigi Bozianu. The Romanian Wordnet. In Romanian Journal on Information Science and Technology, Dan Tufiş (ed.) Special Issued on BalkaNet, Romanian Academy, vol7, no. 2-3, 2004, pp. 105-122, ISSN 1453-8245
  • Dan Tufiş, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, May 2004, pp. 163-189, ISSB 0010-4817