1 / 31

Definition Clustering, Sense Naming & Lexical Augmentation

Mathieu LAFOURCADE lafourcade@lirmm.fr. Fabien JALABERT jalabert@lirmm.fr. Definition Clustering, Sense Naming & Lexical Augmentation. Study context 1/2. Natural Language Processing Lexical Semantics - WSD - Document indexing

mirit
Download Presentation

Definition Clustering, Sense Naming & Lexical Augmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MathieuLAFOURCADE lafourcade@lirmm.fr FabienJALABERT jalabert@lirmm.fr Definition Clustering, Sense Naming & Lexical Augmentation

  2. Study context 1/2 • Natural Language Processing • Lexical Semantics • - WSD • - Document indexing • Dictionary construction and vectorization •  pb extracting definition meta-language • example : ‘cannibale’ = ‘qui mange l’Homme en parlant de l’Homme’ • themes : homme, manger, rhétorique • Multi-source approach  noise reduction • problem : atom element = definition ≠ sense • Objectives • - clustering definitions to obtain senses • - naming these senses

  3. Sense 1 – Name Study context 2/2 Sense naming Clustering Term T def 1 - Source 1 Catégorie 1 Sense 1 t def 2 - Source 1 1 def 1 - Source 1 t def 3 - Source 1 2 Sense 2 def 1 - Source 2 t def 2 - Source 1 3 def 2 - Source 2 t def 2 - Source 2 4 def 1 - Source 3 def 1 - Source 3 t def 2 - Source 3 5 Sense 3 t Multi-source base 6 def 3 - Source 1 def 1 - Source 2 t Sense 2 – Name n def 2 - Source 3 ‘Acception’ or sense base Sense 2 – Name Re-injection as new lexical source

  4. Summary • Model, Construction, Organization • Definition Clustering • Sense Naming • Lexical Augmentation • Results

  5. transports maritimes et fluviaux oiseau arme Conceptual Vector Model 1/2 Salton Deerwester • An idea = a vector • A vector component = a primitive as defined in a Th. • Thesaurus Larousse : 873 concepts • Concepts are inter-related  Generator space • A definition  a vector Chauché Lafourcade Most activated primitives for ‘frégate’ : (oiseau 6134) (transports maritimes et fluviaux 5644) (arme 4891) …

  6. x y Conceptual Vector Model 2/2 Thematic distance = angle between two vectors Thematicaly terms close to ‘frégate’ : (destroyer0.2246) (youyou 0.2267) (voilier 0.2268) (contre-torpilleur 0.2274) (chlamydère 0.2276) (oiseau-jardinier 0.2295) (trois-mâts 0.233) … Thematicaly terms close to ‘frégate/oiseau/’ : (oiseau-jardinier 0.1237) (plumeur 0.1319) (goglu 0.136) (travailleur 0.136) (chlamydère 0.1385) (penne 0.141) (Galliformes 0.1422) (agami 0.1428) … Thematicaly terms close to‘frégate/bateau/’ : (démâtage 0.1604) (dégréer 0.1676) (naval 0.1718) (bateau-piège 0.1774) (bateau-vanne 0.1821) (batelet 0.1824) …

  7. Definition Vector Computation SYGMART Chauché 1 2 PHAMBG 3 13 PH PH 4 7 12 14 19 23 GN GV . GN GV . 5 6 8 9 15 16 18 20 22 le petit briser GN le GA brise GN glacer 17 21 10 11 petit le le glace

  8. Multi-Agent Organization Double-loop Lecerf Schwab Learning agents : Sygmart, computation of vectors from definition, synonymy, antonymy, … Agent Endogenous loop Other agents (society) Exogenous loop

  9. Clustering Objective Grouping definitions into senses

  10. Clustering 1/5 Strategy • Deep analysis - several criteria • No training (but enhancement through exogenous loop) • Frontier between senses and definitions • Centroïd approach • Heuristics (preferences) • - cluster number = nb max of definitions in dictionaries • - two definitions of a same source  two different clusters

  11. Clustering 2/5 Difficulty ‘botte’

  12. Clustering 3/5 Algorithm 1/2 • Source by source iteration • until obtaining a min value distribution •  Affectation of min. value source/cluster • From a distance matrix : Hungarian method – O(n3) Kuhn Ford, Fulkerson

  13. Clustering 4/5 Algorithm 2/2 • For each criteria • one evaluation • one distance matrix • Criteria • Comparing lexical contents of definitions • (with term frequency, co-occurrences, etc.) • Angular distance • Symbolic markers • - morphology • - etymology (‘avocat’: ‘ahuacatl’ / ‘advocatus’ ) • - use (‘vieux’ , ‘ancien’, ‘poétique’ … ) • - language level(‘argot’, ‘familier’, … ) • -domain(‘médecine’, ‘zoologie’, … )

  14. Clustering 5/5 Results Correct results in many cases 90 % for nouns, 70 % for verbs - to be done for adj Pb with very strong polysemy  vagueness, continuity in meanings  support verb: ‘prendre’,… Study augmentation of cluster number ‘botte’ We would like to designate meanings

  15. Sense Naming Objective To give the system some capacity to « talk about a sense »

  16. Sense Naming 1/10 Properties • Dictionary independent • Interface (man-system & system-system) • A new lexical source  looping :-) • Semantic annotation La frégate/vaisseau/ naviguait à travers les océans La frégate/oiseau/planait à travers les nues en poussant son cri incomparable

  17. Sense Naming 2/10 Procedure • Extraction • Validation and dispatching of polysem bags  bijection • Evaluation of candidates ordering and extracting the most appropriate ones

  18. Sense Naming 3/10 Extraction • Extraction attached to a meaning • Morpho-syntactic analysis of the definition • Extraction of markers : « anc. », « méd. », … • Extraction from unstructured or semi-structured data (XML…) ‘frégate’ : [nf] [ancien] Au XVe s., grande barque demi-pontée gréant deux voiles latines sur antenne et assurant la liaison entre les ports et les escadres de galère. [Club Internet] • Extraction from polysem bags • Word list (like synonym list of Université de Caen : ) Ploux, Victori ex: ‘botte’ = chaussure, bottillon, coup, attaque, amas, bouquet,…

  19. Sense Naming 4/10 Validation Bijection  being able to re-associate the proper meaning ƒ :(term, sense)  (term, annotation) ƒ-1 :(term, annotation)  (term, sense) • A candidate associated to a sense should be closer of its own sensethan any other • Unattached candidates are associated to the closest meaning • A candidate should not be present in a concurrent definition

  20. Sense Naming 5/10 Evaluation • Extraction grade • Evaluating the capacity to disambiguate • (to distinguish a sense from all others) • Evaluating the capacity to associate • Cognitive cost reduction Prince

  21. Sense Naming 6/10 Extraction grade • ‘frégate’ : [nf] [ancien] Au XVe s., grande barque demi-pontée gréant deux voiles latines • sur antenne et assurant la liaison entre les ports et les escadres de galère. [Club Internet] GV COD Sujet CC , CC antennes deux voiles latines sur … grande barque demi-pontée gréant au XVe

  22. vaisseau frégate t.11 w.1 0,85 (navire) (oiseau) 0,3= d1 0,8 0,95 w.2 (navire ancien) t.12 (sanguin) 0,4= d2 0,2= d3 1,2 Ma = d1 - d2 = 0,1 Mr = 0,1 / d1= 0.33 Rns = d3 / 0,33= 0.6 w.3 (navire moderne) Sense Naming 7/10 absolute margin relative margin risk of ‘non-sens’ Disambiguation capacity 1/2

  23. vaisseau frégate voilier frégate t.11 w.1 w.1 t.11 0,29 = d2 0,85 (oiseau) (navire) (oiseau) (oiseau) 0,7 0,65= d3 0,3= d1 0,72 0,8 0,95 0.25 = d1 w.2 w.2 (navire ancien) (navire ancien) t.12 t.12 (sanguin) 0,4= d2 (navire) 0,72 0,2= d3 0,3 1,2 Ma = d1 - d2 = 0,1 Ma = d1 - d2 = 0,04 Mr = 0,04 / d1= 0,16 Mr = 0,1 / d1= 0.33 Rns = d3 / 0,16= 4 Rns = d3 / 0,33= 0.6 w.3 w.3 (navire moderne) (navire moderne) Sense Naming 8/10 Disambiguation capacity 2/2

  24. Sense Naming 9/10 Cognitive cost survey Done for 13 terms totalizing 38 definitions  134 answers • collocations • (botte de paille, …) • co-occurrences • (Tintin  Milou) • synonyms and hyperonyms • (manger  se nourrir, mouche insecte  animal) • domain / context for technical terms • (médecine, architecture, agriculture, sport, …) Church Daille Véronis

  25. Mel’cuk Schwab Sense Naming 10/10 Results • multi-criteria approach seems adapted • easily extensible • strong precision • enhancement needed for meta-language processing • criteria implementation • (associative memory, lexical functions ) • synthesis grammar • (botte/secret/vs. botte/secrète/) ‘botte’ Useful for multilingual lexical databases

  26. Lexical Augmentation Multilingual Lexical Database Some terms are not lexicalized in some language Objective lexicalize these terms

  27. Lexical Augmentation 1/2 Papillon project Boitet Mangot-Lerebours Sérasset Lepage ACCEPTIONS ENGLISH FRANCAIS abats de volaille giblets giblets abats offal abats offal.1 beef offal abats de bœuf porc offal offal.2 abats de porc refuse refuse scrap déchet

  28. Lexical Augmentation 2/2 Procedure • Extraction from definition and sense mane (glosses of dictionaries) • abats = {‘porc’, ‘volaille’, ‘bœuf’, …} • Patterns • ‘abats de volaille’, ‘abats en volaille’, … • Patterns validation with co-occurrences • relative number de hits in Google • Difficulties • ‘dog meat’  ‘viande pour chien’ / ‘viande de chien’ ?

  29. Conclusion • Clustering • promissing results • manual evaluation on 100 difficult terms, • 70 % of proper clusters, 30 % of bad affectation  locutions • pb to increase the cluster number •  maturing of the basic clusters • Sens Naming  complementary with conceptual vectors • Good precision • manual evaluation 90 % of pertinent terms • automatic evaluation 70 % (angular distance) • Towards a synthesis grammar • botte/secret/  botte/secrète/ • Future works • More criteria • (associative memory, more lexical functions) • Enhance definition analysis (meta-language)

  30. Contribution Theoric formalisation de la ‘capacité de désambiguïsation’ et du ‘risque de non-sens’ formalisation de l’annotation en sémantique lexicale proposition d’une mesure de similarité générique entre définitions Pratical implémentation sous forme d’agents catégorisation, nommage (services sur la Toile) augmentation lexicale (en cours) Diffusion un poster à RECITAL’2003 (Batz sur Mer – 10 – 14 juin 2003) un article à Papillon’2003 (Sapporo – 2 – 6 juillet 2003) soumission pour RFIA’2004

  31. Thank you

More Related