1 / 115

Lexical Semantics and Ontologies Tutorial at the ACL/HCSnet 2006 Advanced Program in Natural Language Processing

Lexical Semantics and Ontologies Tutorial at the ACL/HCSnet 2006 Advanced Program in Natural Language Processing. Paul Buitelaar Language Technology Lab & Competence Center Semantic Web DFKI GmbH Saarbrücken, Germany. Overview. Day 1: Words and Meanings Human language as a system

Angelica
Download Presentation

Lexical Semantics and Ontologies Tutorial at the ACL/HCSnet 2006 Advanced Program in Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Semantics and OntologiesTutorial at the ACL/HCSnet 2006 Advanced Program in Natural Language Processing Paul Buitelaar Language Technology Lab & Competence Center Semantic Web DFKI GmbH Saarbrücken, Germany

  2. Overview • Day 1: Words and Meanings • Human language as a system • How do words relate to each other • Day 2: Words and Object Descriptions • Human language as a means of representation • How do words represent objects in the/a world

  3. Day 1 - Introduction • Words and Meanings • Synsets and Senses • Lexical Semantics in WordNet • Related Senses • Generative Lexicon and CoreLex • Domains and Senses • Tuning WordNet to a Domain

  4. Words and Meanings Lexical Semantics in WordNetGenerative Lexicon and CoreLexTuning WordNet to a Domain

  5. WordNet • Lexical Semantic Resource • Semantic Lexicon • Maps words to meanings (senses) • Lexical Database • Machine readable (has a formal structure) • Freely available • http://wordnet.princeton.edu/

  6. WordNet - Origins In 1985 a group of psychologists and linguists at Princeton University undertook to develop a lexical database … The initial idea was to provide an aid to use in searching dictionaries conceptually, rather than merely alphabetically … WordNet … instantiates hypotheses based on results of psycholinguistic research … … expose such hypotheses to the full range of the common vocabulary In anomic aphasia, there is a specific inability to name objects. When confronted with an apple, say, patients may be unable to utter ‘‘apple,’’ even though they will reject such suggestions as shoe or banana, and will recognize that apple is correct when it is provided. (Caramazza/Berndt 1978) Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross and Katherine J. Miller. ``Introduction to WordNet: an on-line lexical database.'' In: International Journal of Lexicography 3 (4), 1990, pp. 235 - 244.

  7. Synsets • WordNet is organized around word meaning (not word forms as with traditional lexicons) • Word meaning is represented by “synsets” • Synset is a “Set of Synonyms” • Example • {board, plank} • Piece of lumber • {board, committee} • Group of people

  8. Synset Hierarchy • Synsets are organized in hierarchies • Defines: • generalization (hypernymy) • specialization (hyponymy) • Example {entity} … {whole, unit} {building material} {lumber, timber} {board, plank} hypernymy hyponymy

  9. Hierarchies (WordNet 1.7)

  10. Hierarchy Example (WordNet 2.1)

  11. Synsets and Senses • Synsets represent word meaning • Words that occur in several synsets have a corresponding number of meanings (senses) • Example

  12. WordNet 2.1

  13. (Other) WordNet Relations • Synonymy • Similar in meaning • Hypernymy/Hyponymy • Generalization and Specialization • Meronymy • Part-of • e.g. study, bathroom, ... meronym house • Antonymy • Opposite in meaning • e.g. warm antonym cold

  14. Words and Meanings Lexical Semantics in WordNetGenerative Lexicon and CoreLexTuning WordNet to a Domain

  15. Systematic Polysemy • Homonymy • bank embankment We walked along the bank of the Charles river. institution Did he have an account at the HBU bank? • Systematic Polysemy • school group (of people) The school went for an outing. (learning) process School starts at 8.30 organization The school was founded in 1910. building The school has a new roof.

  16. Semantic Analysis Pragmatic Analysis Lexical Items of the Language Objects in the World school school Obj1 Obj4 Obj1 Obj4 Obj2 Obj3 Obj3 Obj2 Semantic or Pragmatic?

  17. Underspecified Discourse Referents • Anaphora Resolution • [A long book heavily weighted with military technicalities]NP:event-physical_object-content , in this edition it is neither so long event nor so technical content as it was originally. • Metonymy • The Boston office called • office > person • person part-ofoffice • Bridging • Peter bought a car. The engine runs well. • engine part-of car • The Boston office called. They asked for a new price. • office > person

  18. Generative Lexicon Theory Type Coercion I began the book book > event event ‘has-relation-with’book read is-a event • multifaceted representation of lexical semantics • reflecting systematic / regular / logical polysemy

  19. Generative Lexicon Theory Qualia Structure (Pustejovsky 1995) Formal inheritance (is-a / hyponymy) book formal artifact, communication, … Constitutive modification (part-of / meronymy) book constitutive section, … Telic purpose („what is the object used for“) book telic read, … Agentive causality („how did the object come about“) book agentive write, …

  20. CoreLex (Buitelaar 1998) • Automatic Qualia Structure Acquisition • CoreLex is an attempt to automatically acquire underspecified lexical semantic representations that reflect systematic polysemy • These representations can be viewed as shallow Qualia Structures • Sense Distribution in WordNet • Systematic polysemy can be empirically studied in WordNet by observing sense distributions >> If more than two words share the same sense distribution (i.e. have the same set of senses), then this may indicate a pattern of systematic polysemy (adapted from Apresjan 1973)

  21. Systematic Polysemous Classes book 1.{publication} => artifact 2.{product, production} => artifact 3.{fact} => communication 4.{dramatic_composition, dramatic_work} => communication 5.{record} => communication 6.{section, subdivision} => communication 7.{journal} => artifact Systematic Polysemous Class “artifact communication” amulet annals armband arrow article ballad bauble beacon bible birdcall blank blinker boilerplate book bunk cachet canto catalog catalogue chart chevron clout compact compendium convertible copperplate copy cordon corker ... guillotine homophony horoscope indicator journal laurels lay ledger loophole marker memorial nonsense novel obbligato obelisk obligato overture pamphlet pastoral paternoster pedal pennant phrase platform portrait prescription print puzzle radiogram rasp recap riddle rondeau … statement stave stripe talisman taw text tocsin token transcription trophy trumpery wand well whistle wire wrapper yardstick

  22. Noun1 Nounn Basic Type1 Basic Type1 Systematic Polysemous Class1 Systematic Polysemous Classn From WordNet to CoreLex

  23. Other Examples “animal natural_object” alligator broadtail chamois ermine lapin leopard muskrat ... “natural_object plant” algarroba almond anise baneberry butternut candlenut cardamon ... “action artifact group_social” artillery assembly band church concourse dance gathering institution ... “action attribute event psychological” appearance concentration decision deviation difference impulse outrage … “possession quantity_definite” cent centime dividend gross penny real shilling

  24. CoreLex vs. WordNet

  25. Representation and Interpretation • „Dotted Types“ (Pustejovsky) • Lexical types are either simple (human, artifact, ...) or complex (information AND physical_object) • Can be represented with a „dotted type“, e.g. informationphysical_object • In (Cooper 2005) interpreted as a record type (a delicious lunch can take forever):

  26. Related Work • Apresjan 1973 • Regular Polysemy. • Nunberg & Zaenen 1992 • Systematic polysemy in lexicology and lexicography. • Bill Dolan 1994 • Word Sense Ambiguation: Clustering Related Senses. • Copestake & Briscoe 1996 • Semi-productive polysemy and sense extension. • Peters, Peters & Vossen 1998 • Automatic Sense Clustering in EuroWordNet. • Tomuro 1998 • Semi-Automatic Induction of Systematic Polysemy from WordNet.

  27. Words and Meanings Lexical Semantics in WordNetGenerative Lexicon and CoreLexTuning WordNet to a Domain

  28. Reducing Ambiguity • WordNet has too many senses … • Reduce Ambiguity • Cluster related senses (CoreLex) • Tune WordNet to an application domain

  29. Domains and Senses Domains determine Sense Selection, e.g. • English: cell • prison cell in the Politics/Law domain • living cell in the Biomedical domain • English: tissue • living tissue in the Biomedical domain • cloth in the Fashion domain • German: Probe • test in the Biomedical domain • rehearsal in the Theater domain >> Compute Domain-Specific Sense

  30. Approaches • Subject Codes • Domain codes are in the dictionary • Topic Signatures • Compute (domain-specific) context models from dictionary definitions, domain corpora, web resources • Tuning of WordNet to a domain • Top Down: Cucchiarelli & Velardi, 1998 • Bottom Up: Buitelaar & Sacaleanu, 2001 • Related recent work: McCarthy et al, 2004; Chan & Ng, 2005; Mohammad & Hirst, 2006

  31. Subject Codes • Subject Codes (as used in LDOCE) indicate a domain in which a word is used in a particular sense • Examples (2600 codes) • Sub-Field Codes • MDZP (Medicine:Physiology) • Code Combinations • MLCO (Meteorology+Building) e.g. lightning conductor • MLUF (Meteorology+Europe+France) e.g. Mistral

  32. Adding Subject Codes to WordNet • Grouping Synsets together across POS MEDICINE Nouns: doctor#1, hospital#1 Verbs: operate#7 • Grouping Synsets together across Sub-Hierarchies SPORT life_form#1: athlete#1 physical_object#1: game_equipment#1 act#2 : sport#1 location#1 : playing_field#1 Magnini B. & Cavaglià G. Integrating Subject Field Codes into WordNet In: Proceedings LREC 2000

  33. WordNet DOMAINS Bernardo Magnini, Carlo Strapparava, Giovanni Pezzuli, and Alfio Gliozzo. Using domain information for word sense disambiguation. In: Proceedings of the SENSEVAL2 workshop 2001.

  34. WSD with Subject Codes • Match between set of words in the context of the ambiguous word and the set of words (“neighborhoods”) in the definitions + sample sentences of all senses that share a Subject Code bank: Economics bank: Medicine and Biology Guthrie J. A. & Guthrie I. & Wilks Y. & Aidinejad H. Subject Dependent Co-Occurrence and Word Sense Disambiguation In: Proceedings of ACL 1991.

  35. Topic Signatures from the Web • Construct Topic Signatures for WordNet synsets/senses • Retrieve document collections from the web and use queries constructed for each WordNet sense, e.g. ( boy AND ( altar boy OR ball boy OR … OR male person ) AND NOT (man OR … OR broth of a boy OR son OR … OR mama’s boy OR black ) ) Agirre E. & Ansa O. & Hovy E. & Martinez D. Enriching very large ontologies using the WWW In: Proc. of the Ontology Learning Workshop ECAI 2000

  36. Top Down Tuning – Cucchiarelli & Velardi • Automatically find the best set of (WordNet) senses that: • “… represent at best the semantics of the domain” • “[has the] … ‘right’ level of abstraction, so as to mediate between over-ambiguity and generality” • “… [is] balanced …, i.e. words should be evenly distributed among categories” Alessandro Cucchiarelli, Paola Velardi Finding a domain-appropriate sense inventory for semantically tagging a corpus. Natural Language Engineering 4/4, p.325-344, Dec. 1998.

  37. Methods Used • Create alternative sets of balanced categories by use of an adapted version of the Hearst/Schütze algorithm • Apply a scoring function to find the best set, with parameters: • Generality • Highest possible level of generalization with a small number of categories is preferred • Discrimination Power • Different senses lead to different categories • (Domain) Coverage • Words in the domain corpus that are represented by the selected categories • Average Ambiguity • Ambiguity reduction is measured by the inverse of the average ambiguity of all words

  38. Balanced Categories - Hearst/Schütze • Reduce WordNet noun hierarchy to a set of 726 disjoint categories, each consisting of a relatively large number of synsets and of an average size, with as small a variance as possible • Group categories together into a set of 106 super-categories according to mutual co-occurrence in a training corpus • Measure the frequency of categories on domain corpora United States Constitution Genesis Hearst M. & Schütze H. Customizing a Lexicon to Better Suit a Computational Task In: Proceedings ACL SIGLEX Workshop 1993

  39. Generality Generality of Category Set Ci: 1/DM(Ci) Average Distance between the Categories of Ci and the topmost synsets. 4 + 3 / 2 3 / 1 Ci = {Ci1, Ci2} DM (Ci )= (3.5 + 3) / 2 = 3.25 Topmost SynSet Ci1 Ci2 General SynSet

  40. Discrimination Power Discrimination Power of Category Set Ci: (Nc(Ci) - Npc(Ci))/ Nc(Ci) where Nc(Ci) is the number of words that reach at least one category of Ci and Npc(Ci) is the number of words that have at least two senses that reach the same category cij of Ci Ci1 Ci2 Ci3 Ci4 Ci = {Ci1Ci2 Ci3Ci4} General Synset Sense Domain Word w1 w2 w3

  41. Coverage & Average Ambiguity Coverage of Category Set Ci: Nc(Ci)/W where Nc(Ci) is the number of words that reach at least one category in Ci Inverse of Average Ambiguity of Category Set Ci: 1/A(Ci) where Nc(Ci) is the number of words that reach at least one category in Ci , and foreach word w in this set, Cwj(Ci) is the number of categories in Ci reached

  42. Best Category Set (WSJ) Top Down categories for the financial domain, based on the Wall Street Journal

  43. Sense Selection with WSJ Set Senses for stock - kept by domain tuning on the Wall Street Journal Senses for stock - discarded by domain tuning on the Wall Street Journal

  44. Bottom Up Tuning – Buitelaar & Sacaleanu • Ranking of WordNet synsets according to a domain-specific corpus • Compute term relevance against reference corpus • Compute synset relevance according to term relevance (where term = synonym in synset) • Ranking can be used in WSD (similar to usage of ‘most frequent heuristic’) Paul Buitelaar, Bogdan Sacaleanu Ranking and Selecting Synsets by Domain Relevance In: Proceedings of WordNet and Other Lexical Resources: Applications, Extensions and Customizations, NAACL 2001 Workshop, June 3/4 2001

  45. TFIDF The word is more important if it appears several times in a target document The word is more important if it appears in less documents tf(w) term frequency (number of word occurrences in a document) df(w) document frequency (number of documents containing the word) N number of all documents tfIdf(w) relative importance of the word in the document

  46. Term and Synset Relevance • Term Relevance • Relevance Score of Synset Members where t represents the term, d the domain, N is the total number of domains • Synset Relevance • Cumulated Relevance Score for a Synset

  47. Extended Synset Relevance • Lexical Coverage • Take Length of the Synset Into Account [Gefängniszelle, Zelle] ("prison cell") [Zelle] ("living cell") • Hyponyms • Take Hyponyms Into Account [Zelle,Gefängniszelle,Todeszelle] [Zelle,Körperzelle,Pflanzenzelle]

  48. Experiment – Medical Domain

  49. Related Recent Work • Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll • Finding predominant senses in untagged text. In Proc. of ACL 2004. • Chan, Yee Seng and Ng, Hwee Tou (2005) • Word Sense Disambiguation with Distribution Estimation. Proc. of IJCAI 2005. • Mohammad, Saif and Hirst, Graeme. • Determining word sense dominance using a thesaurus. Proc. of EACL 2006.

  50. Day 2 - Introduction • Words and Object Descriptions • Semantics on the Semantic Web • Semantic Web, Ontologies and Natural Language Processing • The Lexical Semantic Web • Knowledge Representation as Word Meaning • A Lexicon Model for Ontologies • Enriching Ontologies with Linguistic Information

More Related