1 / 33

Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst. Overview. WordNet: paradigmatic vs syntagmatic information Recurrent Free Phrases Encoding RFP through Phrasets and Syntagmatic Relations Getting RFPs in bilingual dictionaries and corpora

gbourgeois
Download Presentation

Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst 2nd GWC, January 20th-23rd 2004 - Brno

  2. Overview • WordNet: paradigmatic vs syntagmatic information • Recurrent Free Phrases • Encoding RFP through Phrasets and Syntagmatic Relations • Getting RFPs in bilingual dictionaries and corpora • Conclusions 2nd GWC, January 20-23 2004 - Brno

  3. Paradigmatic vs Syntagmatic An international conference took place in Brno 2nd GWC, January 20-23 2004 - Brno

  4. Paradigmatic vs Syntagmatic Czech Republic meeting Prague national symposium An international conference took place in Brno 2nd GWC, January 20-23 2004 - Brno

  5. Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno 2nd GWC, January 20-23 2004 - Brno

  6. Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno multiword expression 2nd GWC, January 20-23 2004 - Brno

  7. Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno multiword expression semantic restriction 2nd GWC, January 20-23 2004 - Brno

  8. Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno multiword expression free phrase semantic restriction 2nd GWC, January 20-23 2004 - Brno

  9. Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno multiword expression free phrase semantic restriction Syntagmatic relations (in presentia) 2nd GWC, January 20-23 2004 - Brno

  10. Why is syntagmatic info useful • From a lexicographic point of view • See examples of usage in dictionaries (and WN itself) • Often a very short phrase • Sometimes more useful than definitions • From a computational point of view • statistics oriented, corpus based methods • crucial role of co-occurrence information • co-occurrence of words vs meanings 2nd GWC, January 20-23 2004 - Brno

  11. Lexical units in WordNet • Criterium for inclusion in synsets: only lexicalized concept • What counts as a lexical unit • Simple words: {tree} • Idioms • non compositional meaning • {rollercoaster, big_dipper, ...} • Restricted collocations • compositional, reduced substitution, no literal translation • {criminal_record, record} (Italian: precedenti penali) • Named entities: {Praha, capital_of_the_Czech_Repubblic, …} 2nd GWC, January 20-23 2004 - Brno

  12. Problems with inclusion criteria - 1 • Artificial nodes: synsets with no lexical unit • {social_group}– {gruppo_sociale} • Free combinations of words (Benson et al., 1986) • DEF: a combination of words following only the general rules of syntax • Restricted collocations: • reduced substitution, no literal transl., but compositional • ex: circulatory system (*blood, *circulation system) • are they lexical unit? • should we include them in synsets? • Can we “keep” information currently contained in artificial nodes and restricted collocations without violating the criterium for inclusion in synsets? 2nd GWC, January 20-23 2004 - Brno

  13. Problems with inclusion criteria - 2 • A considerable number of expressions which aresystematically used to express a concept are excluded from (Multi)WordNet as they are not lexical units • Ex: “andare in bicicletta” [to bike] • andare: to move by walking or using a means of locomotion • in bicicletta: by bike • Ex: “punta di freccia” [arrowhead] 2nd GWC, January 20-23 2004 - Brno

  14. Introducing Recurrent Free Phrases • Recurrent free phrase (RFP): a free combination of words which is recurrently used to express a concept • 1. Syntactically constrained: N|V|A|P Phrases (cfr. restricted collocations) • 2. High frequency (“governo italiano” Italian government) • 3. High degree of association (“prima volta” first time) • 4. Salience: • intuition of the native speaker lexicographer that a certain expression picks up a concept which is perceived as relevant and somehow unitary • not necessarily related to frequency and word association • “vertice internazionale” international summit (high salience) • “coscia destra” right thigh 2nd GWC, January 20-23 2004 - Brno

  15. The salience criterium • Hypothesis: • Related to the amount of world knowledge that is attached to a certain phrase • Such knowledge cannot be inferred from the meanings composing the phrase • Example: • right hand (more salient) • right thigh 2nd GWC, January 20-23 2004 - Brno

  16. Recurrent Free Phrases for NLP • Knowledge-based word alignment of parallel corpora • EX: cornfield ~ campo di grano • Word Sense Disambiguation • campo: 12 senses in MWN • grano: 9 senses • both unambiguous in “campo di grano” 2nd GWC, January 20-23 2004 - Brno

  17. Criteria for RFP selection • RFPs expressing a concept which is not lexicalized in a language but lexicalized in another language (lexical gaps) • EX: andare in bicicletta [to bike] • RFPs synonyms with a lexical unit in the same language • EX: strofinaccio dei piatti / canovaccio [dishcloth] • RPFs that are frequent, cohese and salient within a corpus considered as reference corpus • EX: vertice internazionale [international summit] • RPFs whose components are highly polysemous. • EX: campodi grano [cornfield ] 2nd GWC, January 20-23 2004 - Brno

  18. MultiWordNet • MultiWordNet: Italian/English lexical database • Princeton WordNet building criteria • Strict alignment (see expand model) • Explicit treatment of lexical gaps • Italian (44,000 words) and • Hebrew (University of Haifa, just started) • Cfr Spanish WordNet (EuroWordNet) 2nd GWC, January 20-23 2004 - Brno

  19. Introducing Phrasets • Phraset: a set of synonymous recurrent free phrases ENG-synset {cornfield} ITA-synset {GAP} ITA-phraset {campo_di_grano} ENG-synset {toilet_roll} ITA-synset {GAP} ITA-phraset {rotolo_di_carta_igienica} ENG-synset {dishcloth} ITA-synset {canovaccio} ITA-phraset {strofinaccio_dei_piatti, strofinaccio_da_cucina} 2nd GWC, January 20-23 2004 - Brno

  20. RFPs vs definitions RFPs are not definitions E-synset {tree -- a tall perennial wody plant having a main trunk …} I-synset {albero -- ogni pianta perenne con fusto legnoso ramificato} I-phraset {} E-synset {paperboy} I-synset {GAP – ragazzo che recapita i giornali} I-phraset {ragazzo_dei_giornali} E-synset {straphanger} I-synset {GAP – chi viaggia in piedi su mezzi pubblici reggendosi ad un sostegno} I-phraset {} 2nd GWC, January 20-23 2004 - Brno

  21. Synsets vs Phrasets Free combination of words Recurrent Free Phrases Phrasets Restricted collocations Named entities Synsets Idioms Simple words 2nd GWC, January 20-23 2004 - Brno

  22. Syntagmatic Relations in WN • MEANING project: using the involve semantic relation to encode deep selectional restrictions • Can RFP be encoded through semantic relations? 2nd GWC, January 20-23 2004 - Brno

  23. Encoding “campagna antifumo” -1 Through phrasets Synset: {campagna} Phraset: {} campaign hypernym Synset: {GAP} Phraset: {campagna_antifumo} campaign against smoking 2nd GWC, January 20-23 2004 - Brno

  24. Encoding “campagna antifumo” - 2 Through a semantic relation has_constraint Synset: {campagna} Synset: {antifumo} campaign against smoking 2nd GWC, January 20-23 2004 - Brno

  25. Pros and cons of using semantic rels for encoding RPFS • Smart and concise but what about • trigram RFP? • synonymous RFPs • RPFs that are translation equivalent of lexical units? • Restrictions on word order and word morphology? 2nd GWC, January 20-23 2004 - Brno

  26. Taking the best of both encodings • Phrasets and lexical syntagmatic relations appezzamento (parcel) cereale (cereal) hypernym hypernym campo (field) frumento, grano(corn) composed-of (grano) composed-of (campo) hypernym GAP -- campo di grano (cornfield) 2nd GWC, January 20-23 2004 - Brno

  27. RFP in Bilingual Dictionaries • Collins bilingual dictionary (medium size) • Italian Translation Equivalents (Bentivogli and Pianta, 2000) • 92.2% correspond to lexical units • 7.8% correspond to free combination of words (lexical gaps) • Manual check of 300 lexical gaps • 67% correspond to RFPs => More than half of the synsets which are gaps in Italian potentially have an associated phraset 2nd GWC, January 20-23 2004 - Brno

  28. RFPs in corpora • Correlation between RPFs and frequency? • Analysis of a 32M word corpus (Repubblica, 2000-2001) • Standard n-gram analysis package (NSP) • All bigrams including at least a stopword excluded • 118,464 bigrams occurring more than 3 times • Highest rank: 5,914 occurrences (“New York”) • Rank 4: 31,453 bigrams • 497 distinct ranks (frequence classes) 2nd GWC, January 20-23 2004 - Brno

  29. RFPs in corpora cont. • Lower ranks are systematically and densely populated • Higher ranks are sparsely and poorly populated • Rank groups • A: 5,914-509 (100 bigrams) • B: 505-257 (257) • C: 256-129 (731) • D: 128-65 (1,965) • E: 64-33 (4,525) • F: 32-17 (10,477) • G: 16-9 (22,167) • H: 8-5 (46,798) • I: 4 (31,453) • Manual check of 100 random bigrams from each rank group 2nd GWC, January 20-23 2004 - Brno

  30. RFPs in corpora cont. Manual check of 100 random bigrams from each rank group NB: similar results on trigrams 2nd GWC, January 20-23 2004 - Brno

  31. Correlation between num. of RFPs and frequency in a reference corpus 2nd GWC, January 20-23 2004 - Brno

  32. Future work • Better characterization and classification • Correlation with association measures • Evaluating RFP for WSD 2nd GWC, January 20-23 2004 - Brno

  33. Conclusions • Wordnet is poor of syntagmatic information • We introduced Recurrent Free Phrases, Phrasets, syntagmatic lexical relations • RFP: free combination of word recurrently used to express a concept • Criteria for their selection • Bilingual dictionaries contain many RFPs • Corpora: no clear correlation with frequency • Useful for: • lexicographic work • Word Sense Disambiguation 2nd GWC, January 20-23 2004 - Brno

More Related