Word sense disambiguation and annotation transfer in parallel texts
1 / 30

Word sense disambiguation and annotation transfer in parallel texts - PowerPoint PPT Presentation

  • Uploaded on

Word sense disambiguation and annotation transfer in parallel texts. Dan TUFIŞ 1,2 , Radu ION 1 , Verginica MITITELU 1 1 RACAI – Research Institute for Artificial Intelligence, Bucharest 2 University „A.I. Cuza”, Ia şi { tufis,radu, vergi}@racai.ro. Outline of the talk.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Word sense disambiguation and annotation transfer in parallel texts' - gili

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Word sense disambiguation and annotation transfer in parallel texts l.jpg

Word sense disambiguation and annotation transfer in parallel texts

Dan TUFIŞ1,2, Radu ION1,Verginica MITITELU1

1RACAI – Research Institute for Artificial Intelligence, Bucharest

2University „A.I. Cuza”, Iaşi

{tufis,radu, vergi}@racai.ro

Outline of the talk l.jpg
Outline of the talk parallel texts

  • WSD: monolingual / multilingual approaches

  • WSD based on parallel texts and aligned wordnets

  • Brief account on Romanian wordnet (part of BalkaNet)

  • WSDtool+ main processing steps

    • Word Alignment (COWAL),

    • Sense labelling (four sense inventory: synsets IDs, SUMO/MILO concepts, IRST Domains, Explanatory Dictionary of Romanian)

    • Custering

    • Annotation generation

  • Annotation import in aligned and WSDed corpora

  • Conclusions and further work

  • Wsd based on parallel texts and aligned wordnets l.jpg
    WSD based on parallel texts and aligned wordnets parallel texts

    Our approach:

    A mixture of unsupervised and KB approaches with multiple processing steps:

    • word alignment in parallel corpora (COWAL) and translation equivalents extraction (translation model);

    • sense labeling using (BalkaNet):

      • Princeton word sense inventory;

      • SUMO/MILO ontology

      • IRST Domains classes

      • EXPD labels

        Based on aligned wordnets (covers ~75% of the word occurrences in a corpus)

    • sense clustering based on translation equivalents extracted from parallel corpora (takes care of the uncovered cases by the previous step;

    • generation of the WSD annotation in the parallel corpus

    Brief account on romanian wordnet l.jpg
    Brief account on Romanian wordnet parallel texts

    Conceptually dense: for any RO-synset, aligned to an ILI code (EN-synset),all the hierarchical antecedents of the EN-synset are also linked to synsets present the RO-wordnet.

    Virtually, more synsets and more literals (monosemous literals; “coverage heuristics”)

    Brief account on romanian wordnet structure of a synset l.jpg
    Brief account on Romanian wordnet parallel textsStructure of a synset

    PWN id






    <DEF>Parte a unui organism animal sau vegetal care îndeplineşte anumite funcţii</DEF>

    <STAMP>Dan Tufis</STAMP>






    EXPD id

    IRST id

    SUMO id

    Wsd main steps 1 word alignment and filtering of the translation equivalents l.jpg
    WSD MAIN STEPS parallel texts 1.Word AlignmentandFiltering of the Translation Equivalents

    • The word alignment system (COWAL) is producing N-M cross-POS alignment links with an average F-measure of more than 80% (F=82.52%, AER=17.48%).

    • Only the preserving major POS (V, N, A, R) translation links retained. In such a case F is better than 92% (F=92.04%, AER=7.96%)

    Wsd main steps 2 sense labeling l.jpg
    WSD MAIN STEPS parallel texts2. Sense Labeling

    • Aligned wordnets (lexical ontologies)

    • Conceptual knowledge structuring (upper ad mid level ontologies): SUMO/MILO

    • Domain taxonomies (UDC-librarian’s taxonomy): IRST-DOMAINS

    • Explanatory Dictionary of Romanian (3 sense levels)

    • Coverage heuristics: if one of the words in a translation pair is not member of any synset in the respective language wordnet, but the other word is present in the aligned wordnet, and moreover it is monosemous, the first word gets the sense as given by the monosemous word. If one of the languages is English, any other language can benefit from this heuristics (approx. 80% of the PWN literals are monosemous):

      • Ex: hilarious <-> hilar => ENG20-01221243-a

        burp <-> râgâi=> ENG20-00003374-v

        prospicience <->clarviziune => ENG20-05469664-n

    Example i lamp lamp l.jpg

    EQ-SYN parallel texts










    Example (I): <lamp lampă>

    PWN2.0 (lamp) = {03500372-n, 03500773-n}

    RoWN (lampă) = {03500773-n, 03500872-n}


    ILI= 03500773-n=><lamp(2) lampă(1)>

    SUMO (03500773-n) = +Device

    DOMAINS (03500773-n) = furniture

    Example ii lamp felinar l.jpg

    WN1 parallel texts









    Example (II): <lamp felinar>

    PWN2.0 (lamp) = {03500372-n, 03500773-n}

    RoWN (felinar) = {003505057-n}

    δ (03500372-n, 003505057-n)=0.5 δ (03500373-n, 003505057-n)=0.125

    ILI= 03500773-n=><lamp(1) felinar(1.1)>

    SUMO (03500773-n) = IlluminationDevice

    DOMAINS (03500773-n) = factotum

    Slide11 l.jpg

    WSD MAIN STEPS parallel texts

    3.Word Sense Clustering

    _____|-> (1)

    |-----| |-> (1)

    | |_____|---> (1)

    | |___|-> (1)

    | |-> (1)

    | |---> (1)

    |----| | _|-> (1)

    | | | |-| |-> (1)

    | | |---| |-| |-> (1)

    | | | | |-| |-> (1)

    | |-----| | |-| |-> (1)

    |--| | |---| |-> (1)

    | | | |-> (1)

    | | |___|---> (6)

    | | |___|-----> (1)

    | | |-----> (1)

    | | _____|-----> (6)

    -| |----| |-----> (6)

    | |-----> (4)



    | |---> (2)

    | |---| _|-> (2)

    | | | |-| |-> (2)

    | | |---| |-> (2)

    |--| |-> (2)

    | |-----> (2)

    | | ___|-----> (2)

    |---| |----| |-----> (2)

    | | | _|-> (2)

    | | |---| |-> (2)

    |-----| |-> (2)

    | ____|-> (3)

    |----| |-> (2)

    | _|-> (2)

    |----| |-> (2)

    |-> (2)

    An agglomerative, hierarchical algorithm using a vector space model, Euclidean distance and cardinality weighted computation of the centroid (the “center of weight” of a new class).

    The undecidable problem of how many classes, gets hints from the

    work already done in step 1 and 2

    Slide12 l.jpg

    4. WSD annotation in the parallel texts

    parallel corpus (“1984”)

    <tu id="Ozz20">

    <seg lang="en">

    <s id="Oen.">

    <w lemma="the" ana="Dd">The</w>

    <w lemma="patrol" ana="Ncnp" sn=“3"oc=“Group"dom="military">


    <w lemma="do" ana="Vais">did</w>

    <w lemma="not" ana="Rmp" sn="1"oc="not“ dom="factotum">


    <w lemma="matter" ana="Vmn" sn="1"oc="SubjAssesAttr"dom="factotum">


    <w lemma="however" ana="Rmp" sn="1"oc="SubjAssesAttr|PastFn”dom="factotum">


    <seg lang="ro">

    <s id="Oro.">

    <w lemma="şi" ana=Crssp>Şi</w>

    <w lemma="totuşi" ana="Rgp" sn="1“oc="SubjAssesAttr |PastFn"dom="factotum">


    <w lemma="patrulă" ana="Ncfpry" sn="1.1.x"oc=“Group"dom="military">


    <w lemma="nu" ana="Qz" sn="1.x"oc="not"dom="factotum">


    <w lemma="conta" ana="Vmii3p" sn="2.x"oc="SubjAssesAttr"dom="factotum">



    Evaluation i l.jpg
    Evaluation (I) parallel texts

    • “lexical sample” and 1-tag annotation evaluation (with k-tag, the performance would be essentially the one of the filtered word alignment, i.e. 92.04% )

    • 216 English ambiguous words (at least two senses/POS) with 2081 occurrences in “1984” were semantically disambiguated by three experts in terms of PWN2.0 sense inventory. The experts negotiated all the disagreement cases, thus resulting the Gold Standard annotation (GS)

      • - this is “lexical sample/lexical choice” evaluation type of SENSEVAL (much harder than “all words”whichincludes monosems and homographs as well)

    • For each PWN2.0 sense number, the GS was deterministically enriched with the SUMO category and the DOMAINS label.

    • Thus, we had three sense inventories in the GS and could evaluate system’s WSD accuracy in terms of each of them.

    Evaluation ii l.jpg
    Evaluation (II) parallel texts

    • Automatic WSD was performed three ways:

      • using only RO-EN aligned BalkaNet wordnets (AWN)

      • combining AWN with clustering (AWN+C)

      • combining AWN+C with the simple heuristics (AWN+C+MFS)

  • Out of 2081 total occurrences 61 (34 words) could not receive a sense tag because the target literal was wrongly aligned by the translation equivalents extractor module of the WSDtool, or because it was not translated or wrongly translated by the human translator. In this case we used MFS, a simple heuristics assigning the most frequent sense label (42 occurrences were correctly tagged).

  • Evaluation iii l.jpg
    Evaluation (III) parallel texts

    WSD based on WN2.0+RoWN (PWN2.0 id)

    Evaluation depend s on the sense inventories l.jpg
    Evaluation depend parallel textss on the sense inventories

    PWN2.0+RoWN ( AWN+C+MFS)

    Annotation import in aligned and wsded corpora l.jpg

    AIM: parallel texts

    Test whether it is possible (and to what degree) to automatically transfer syntactic relations (as they are lexicalized in a corpus) from a resource-rich language (English) into another language with fewer resources (Romanian), using parallel corpora.

    Annotation import in aligned and WSDed corpora

    Resources l.jpg
    Resources parallel texts

    • Parallel corpus: George Orwell’s 1984

      • XCES XML encoded

      • Sentence aligned (only 1-1 alignments retained)

      • Tokenized

      • Morpho-syntactically annotated

      • Chunked

      • Word aligned

      • The English version - parsed with an FDG parser (Tapanainen and Järvinen, 1997)

    • A set of generic transfer rules of the syntactic relations from En-Ro (hand written, language pair dependent)

    Transfer cases l.jpg
    Transfer cases parallel texts

    • Perfect transfer

    • Transfer with amendments

    • Language specific phenomena

    • Impossibility of transfer

    1 perfect transfer l.jpg
    1. Perfect Transfer parallel texts

    2 transfer with amendments i l.jpg
    2. Transfer with amendments (I) parallel texts

    Active-Passive inversion

    ‘Lucrul’ is not the object of the ‘sugerat’ but its subject

    2 transfer with amendments ii l.jpg
    2.Transfer with amendments (II) parallel texts

    Dummy anticipatory ‘It’ is subject and book is complement. Yet in Romanian, ‘carte’ is subject.

    3 language specific phenomena l.jpg
    3. Language Specific Phenomena parallel texts

    • Pro-drop phenomenon

    4 impossibility of transfer i l.jpg
    4. Impossibility of transfer (I) parallel texts

    • Equivalent verbs with different syntactic behaviour: ‘like’ – ‘plăcea’

    Slide25 l.jpg

    See for details: parallel texts

    Verginica Mititelu, Radu Ion:

    Cross-Language Transfer of Syntactic Relations Using Parallel Corpora, 2005

    Conclusions i l.jpg
    Conclusions (I) parallel texts

    • Cross experiment evaluations of the WSD results are hard to compare when different granularity sense-inventory are used (PWN2.0: 115424 meanings vs. SUMO-MILO: 2066 categories, vs. IRST DOMAINS: 163)

    • Considering the fine granularity of the WSD annotation our results are superior to those reported by the few researchers who used the same sense inventory (e.g. G.Rigau and his colleagues): not surprising! Most of the WSD experiments were carried on in monolingual environments; word alignment reveals the mental lexicon used by professional human translators in parallel texts and as such is an invaluable knowledge source.

    Conclusions ii l.jpg
    Conclusions (II) parallel texts

    • One of the greatest advantages of applying such methods to parallel data: it may be used to automatically sense-tag corpora in not only one language, but rather several at once.

    • The automatic procedure of transferring syntactic relations (as shown in the Romanian experiment) is reliable provided that all necessary resources are present with the required level of annotation

    • Language specific structures and grammatical phenomena require the pre- and post-processing of the data

    • Given that syntactic relations may be imported we started experiments on importing FrameNet annotations from English into Romanian.

    Further work l.jpg
    Further work parallel texts

    • Further development of the Romanian wordnet

      • about 15,000 “hard-synsets”

      • about 70,000 “easy synsets” (monosemous and/or mono-literals)

    • Development of new word-aligned multilingual corpora (including AqC21), improving current translation models and building new ones; (see the XCES sample of the EN-RO AqC: 80 docs, tagged, lemmatised, chunk-parsed, word-aligned, WSDed, approx. 1MB)

    • Extending the annotation import experiment and improving the rule-set governing the transfer:

      • Framenet project

      • Dependency grammar for Romanian

    • Development of a multilingual SMT system (we started En-Ro experiments with very encouraging results!)

    Recent papers with details for this talk l.jpg
    Recent papers parallel texts with details for this talk

    • Dan Tufiş, Verginica Mititelu, Luigi Bozianu, Cătălin Mihăilă: Romanian WordNet: New Developments and Applications. In Piek Vossen and Christiane Fellbaum (eds.)Proceedings of the 3rd International Wordnet Conference, Jeju Island, Korea, South Jeju, January 2006, 10 pages

    • Dan Tufiş, Radu Ion, Alexandru Ceauşu, Dan Stefănescu: Combined Aligners. In Proceeding of the ACL2005 Workshop on “Building and Using Parallel Corpora: Data-driven Machine Translation and Beyond”, June, 2005, Ann Arbor, Michigan, June, Association for Computational Linguistics, pp. 107-110

    • Verginica Mititelu, Radu Ion: Cross-Language Transfer of Syntactic Relations Using Parallel Corpora. In Diana Inkpen & Carlo Strapparava (eds.) Proceedings of the Workshop on Cross-Language Knowledge Induction, 2-4 August, Cluj-Napoca, 2005, pp. 46-51, ISBN 973-703-139-9,

    • Dan Tufiş, Radu Ion: Evaluating the word sense disambiguation accuracy with three different sense inventories. In Proceedings of the Natural Language Understanding and Cognitive Systems Symposium, Miami, Florida, May 2005, pp. 118-127, ISBN 972-8865-23-6

    • Dan Tufiş, Radu Ion, Nancy Ide: Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets. In proceedings of the 20th International Conference on Computational Linguistics, COLING2004, Geneva, 2004, pp. 1312-1318, ISBN 1-9324432-48-5

    • Dan Tufiş: Word Sense Disambiguation: A Case Study on the Granularity of Sense Distinctions. In WSEAS Transactions on Information Science and Applications, vol. 2, no. 2, February 2005, pp.183-188, ISSN 1790-0032

    • Dan Tufiş, Eduard Barbu. A Methodology and Associated Tools for Building Interlingual Wordnets. In Proceedings of the 5th LREC Conference, Lisabona, 2004, pp. 1067-1070

    • Dan Tufiş, Dan Cristea, S. Stamou. BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. In Romanian Journal on Information Science and Technology, Dan Tufiş (ed.) Special Issue on BalkaNet, Romanian Academy, vol7, no. 2-3, 2004, pp. 9-34, ISSN 1453-8245

    • Dan Tufiş, Eduard Barbu, Verginica Mititelu, Radu Ion, Luigi Bozianu. The Romanian Wordnet. In Romanian Journal on Information Science and Technology, Dan Tufiş (ed.) Special Issued on BalkaNet, Romanian Academy, vol7, no. 2-3, 2004, pp. 105-122, ISSN 1453-8245

    • Dan Tufiş, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, May 2004, pp. 163-189, ISSB 0010-4817

    Slide30 l.jpg

    Thank you! parallel texts