Arabic Wordnet as a free resource: past, present and the future

Arabic Wordnet as a free resource: past, present and the future PowerPoint PPT Presentation


  • 307 Views
  • Uploaded on
  • Presentation posted in: General

Citala 2009. 2. Index of the talk. IntroductionOntologiesWordnetsBuilding wordnetsArabic WordNetSemi-automatic extensions of AWNLinking AWN with complementary resources. Citala 2009. 3. Introduction. semantic components used in NLP applications:ontologieslarge-scale knowledge-bases.Need (or

Download Presentation

Arabic Wordnet as a free resource: past, present and the future

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


1. Citala 2009 1 Arabic Wordnet as a free resource: past, present and the future Horacio Rodríguez1

2. Citala 2009 2 Index of the talk Introduction Ontologies Wordnets Building wordnets Arabic WordNet Semi-automatic extensions of AWN Linking AWN with complementary resources

3. Citala 2009 3 Introduction semantic components used in NLP applications: ontologies large-scale knowledge-bases. Need (or convenience) of developing wide-coverage domain-independent lexico-conceptual ontologies WordNet

4. Citala 2009 4 Ontologies Ontologies have become recently a core resource for many knowledge-based applications: Semantic Web e-commerce Information Retrieval Information Integration NLP

5. Citala 2009 5 Ontologies Ontologies represent static domain knowledge allowing an efficient use by multiple knowledge agents Acquiring domain knowledge for building ontologies is highly costly and time consuming. For this reason lots of methods and techniques have been developed for trying to reduce such efforts

6. Citala 2009 6 Ontologies What an ontology is: (Gruber, 1993) an ontology is an explicit specification of a conceptualization (Studer et al, 1998) an ontology is a formal explicit specification of a shared conceptualization

7. Citala 2009 7 Ontologies What an ontology is: A conceptualization is an abstract, simplified view of the world represented for some purpose An ontology is a description (formal specification) of a set of concepts and relationships for enabling knowledge sharing and reuse (to perform logical commitments) An ontology commitment is an agreement to use a vocabulary in a way that is consistent with respect to the theory specified by the ontology

8. Citala 2009 8 Ontologies

9. Citala 2009 9 Ontologies Types of Ontologies level of generality top-level, generic, domain ontologies level of detail of the domain theory domain models reduced to a taxonomic hierarchy (shallow, lightweight ontologies) including relations between concepts complex models including axioms and constraints (heavyweight ontologies)

10. Citala 2009 10 Ontologies lexico-conceptual ontologies Some authors simply reject this term, an ontology is by definition conceptual and, thus, language independent (or better, language neutral) Other authors admit that some conceptualizations are different in different languages, thus leading to different ontologies (Barbu and Barbu-Mititelu, 2005) classify these differences as accidental, systematic and cultural.

11. Citala 2009 11 Ontologies lexico-conceptual ontologies Some concepts, for instance, are lexicalized for a language but not for another (i.e. not single word or multiword can be used for uniquely referring to the concept). spliting a lexico-conceptual ontology into two linked components: the conceptual core (the true ontology) and the lexical coverage (a lexicon)

12. Citala 2009 12 Ontologies

13. Citala 2009 13 Ontologies The mapping between lexical items (words or multiwords) and concepts can be complex. Due to polysemy, most lexical items can be mapped into more than one concept. Due to synonymy, more than one word can be mapped to a concept. Usually the mapping is splitted into two steps from words into word-senses (i.e. different word meanings) and from word-senses into concepts.

14. Citala 2009 14 Ontologies Ontology Management Ontology Building including Ontology Learning Ontology Merging including Ontology Alignment and Mapping Ontology Enrichment including Ontology Refining and Instantiation

15. Citala 2009 15 Ontologies Some examples of useful ontologies Domain restricted Protege Ontology Library http://protegewiki.stanford.edu/index.php/Protege_Ontology_Library OntoSelect http://olp.dfki.de/ontoselect

16. Citala 2009 16 Ontologies Some examples of useful ontologies Domain independent (generic) Suggested Upper Merged Ontology (SUMO) Niles, Pease, 2001, 2003 ? 1,000 concepts, 4000 assertions, and 600 rules attached to WN CYC from en Cyc lopedia Guha, Lenat started in 1984 OpenCyC (40% Cyc) free ResearchCyC (80% Cyc) free for research 1.6M facts and 120k concepts

17. Citala 2009 17 Ontologies Some examples of useful ontologies Domain independent (generic) ConceptNet Liu, Singh, 2004 http://web.media.mit.edu/~hugo/conceptnet/ two versions: concise (200,000 assertions) full (1.6 million assertions) Commonsense knowledge in ConceptNet encompasses the spatial, physical, social, temporal, and psychological aspects of everyday life. generated automatically from the 700,000 sentences of the Open Mind Common Sense Project

18. Citala 2009 18 Ontologies Building lexico-conceptual ontologies A substantial part of CO is language-neutral if a LO for a source language is available, a LO for a target language can be derived for a subset of CO, working basically at lexicon and mapping level using as knowledge sources available bilingual resources

19. Citala 2009 19 Ontologies

20. Citala 2009 20 Ontologies

21. Citala 2009 21 Ontologies Building lexico-conceptual ontologies This derivation process is far to be simple. for a LO, the mapping words ? word-senses ? concepts is complex (and controversial) (Kilgarrif, 1997) arguments against the ontological status of word-senses (Edmonds and Hirst, 2002) reduce a lot the cases of absolute synonymy and propose, instead, modeling near-synonymy for fine-grained mapping between words and concepts).

22. Citala 2009 22 Wordnets Princeton's English WordNet (Miller et al, 1990), (Fellbaum, 1998) Semantic Information more than 123,000 words organised in 117,000 synsets (WN3.0) more than 235,000 relations between synsets Freely available: http://wordnet.princeton.edu/

23. Citala 2009 23 Wordnets Princeton's English WordNet Lexicalised concepts (words, compounds, multiwords) Synset: synonym set (of words) Large semantic net conecting synsets synonymy, antonymy, hyperonymy, hyponymy, meronymy, implication, causation ... Structure Noun hierarchy depth ~12 Verb hierarchy depth ~3 Adjective/adverb not in hierarchy, but in star structure

24. Citala 2009 24 Wordnets

25. Citala 2009 25 Wordnets Beyond WN EuroWordNet (Vossen 98) UE funded project Integrated local wordnets in several languages English Sheffield Dutch Amsterdam Italian Pisa Spanish UB, UPC, UNED. http://www.hum.uva.nl/~ewn/

26. Citala 2009 26 Wordnets

27. Citala 2009 27 Wordnets Top Concept Ontology of EuroWordNet Hierarchy of language independent concepts Semantic distinctions: object, place, ? abstract (not lexical) Connected to the ILI Three types of concepts: First order: entities Second order: estatic or dynamic situations Third order: abstract prepositions Ortogonal (multiple) assignments

28. Citala 2009 28 Wordnets Beyond WN EWN2 German (GermaNet), French, Chec, Swedish, Estonian ITEM, CREL Spanish, Catalan, Basque (UB, UPC) EuroTerm, Jur-Wordnet Extending EWN in particular domain Balkanet Extending EWN for the Balkan languages Hownet Chinese WN

29. Citala 2009 29 Wordnets Macro Ontologies based on WN MCR Yago Omega

30. Citala 2009 30 Wordnets MCR (Multilingual Central Repository) Meaning and Know projects

31. Citala 2009 31 Wordnets

32. Citala 2009 32 Wordnets

33. Citala 2009 33 Wordnets YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia Suchanek et al 2007

34. Citala 2009 34 Wordnets Omega Ontology http://omega.isi.edu Andrew Philpot et al (2005)

35. Citala 2009 35 Wordnets

36. Citala 2009 36 Wordnets

37. Citala 2009 37 Building wordnets Merge approach Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs Expand approach Translation of synsets: bilingual MRDs Manual revision

38. Citala 2009 38 Building wordnets EWN Building Base Concepts (BC) supposed to be the concepts that play the most important role in different languages. Two main criteria: A high position in the semantic hierarchy (abstract) Having many relations to other concepts (hub) ? 1000 synsets Vertical expansion filling gaps and assuring good overlapping

39. Citala 2009 39 Building wordnets EWN Spanish WN automatic extension with human validation Combination of 17 heuristic methods 1) simple rule 2) pair wise combination 3) Logistic Regression combination

40. Citala 2009 40 Building wordnets

41. Citala 2009 41 Building wordnets

42. Citala 2009 42 Building wordnets

43. Citala 2009 43 Arabic WordNet USA REFLEX program funded (2005-2007) Partners: Universities Princeton Manchester UPC (Barcelona) UB (Barcelona) Companies Articulate Software Irion

44. Citala 2009 44 Arabic WordNet papers Introducing the Arabic WordNet Project Black et al, 2006 Building a WordNet for Arabic Elkateb et al, 2006 Arabic WordNet: Current State and Future Extensions Rodríguez et al, 2008 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference Rodríguez et al, 2008

45. Citala 2009 45 Arabic WordNet Objectives 10,000 synsets including some amount of domain specific data linked to PWN 2.0 finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry

46. Citala 2009 46 Arabic WordNet Criteria for selecting synsets to be covered Connectivity as densely connected as possible Most of them connected to English WN counterparts the overall topology of both wordnets is expected to be similar. Relevance Frequent and salient concepts Generality Synsets on the highest levels of WN

47. Citala 2009 47 Arabic WordNet Approach described in 3rd GWC (Elkateb et al, 2006) Manually built 2 lexicographic interfaces Manchester, Barcelona guided by automatically generated suggestions of <Arabic word, English synset> pairs coming from bilingual resources.

48. Citala 2009 48 Arabic WordNet Approach BCs Covering of EWN & Balkanet Base Concepts Filling gaps Building Arabic specific synsets Covering domain specific synsets Adding NEs. (Semi) automatic extensions heuristic based Bayesian networks

49. Citala 2009 49 Arabic WordNet Resources used LOGOS database of Arabic verbs: contains 944 fully conjugated Arabic verbs Bilingual (Arabic-English) dictionaries NMSU bilingual Arabic-English lexicon: Salmoné University of Barcelona Effel Corpora Arabic GigaWord Corpus (from LDC) UN (2000-2002) bilingual Arabic-English Corpus (from LDC).

50. Citala 2009 50 Arabic WordNet Item conceptual entities, including synsets, ontology classes and instances. Word word senses Form entity that contains lexical information (not merely inflectional variation) roots broken plural forms Link relates two items, and has a type such as equivalence, subsuming, etc. interconnect sense items, e.g., a PWN synset to an AWN synset, a synset to a SUMO concept, etc.

51. Citala 2009 51 Arabic WordNet Problems found Arabic specific synsets linking to SUMO NEs Selecting domain specific synsets

52. Citala 2009 52 Arabic WordNet Current (Final ?, we hope no!!!) figures up to date statistics: http://www.lsi.upc.edu/~mbertran/arabic/awn/query/sug_statistics.php.

53. Citala 2009 53 Arabic WordNet Software Lexicographer's Web Interface http://www.lsi.upc.edu/~mbertran/arabic/awn/update/synset_browse.php User's Web Interface http://www.lsi.upc.edu/~mbertran/arabic/awn/index.html The Arabic Word Spotter http://www.lsi.upc.edu/~mbertran/arabic/wwwWn7/ AWN browser http://sourceforge.net/projects/awnbrowser/ AWN to SUMO mapping including automatic generation of Arabic paraphrases of SUMO formal axioms

54. Citala 2009 54 Arabic WordNet Ongoing research (Semi) automatic methods for enriching AWN Heuristic-based approach GWC 2008 Bayesian Networks LREC 2008 Automatically obtaining & linking NEs using Wikipedia as Knowledge Source NEs from Wikipedia citala 2009 (this conference)

55. Citala 2009 55 Arabic WordNet (Semi) automatic methods for enriching AWN key idea In Arabic many words having a common root have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules

56. Citala 2009 56 Semi-automatic Extensions of AWN

57. Citala 2009 57 Semi-automatic Extensions of AWN Lexical rules regular verbal derivative forms regular nominal and adjectival derivative forms masdar (nominal verb) masculine and feminine active and passive participles inflected verbal forms

58. Citala 2009 58 Semi-automatic Extensions of AWN Procedure for generating a set of likely <Arabic word, English synset, score>: produce an initial list of candidate word forms filter out the less likely candidates from this list generate an initial list of attachments score the reliability of these candidates manually review the best scored candidates and include the valid associations in AWN.

59. Citala 2009 59 Semi-automatic Extensions of AWN Score the reliability of the candidates build a graph representing the words, synsets and their associations associations synset-synset: explicit in WN2.0 path-based apply a set of heuristic rules that use directly the structure of the graph GWC 2008 apply Bayesian inference LREC 2008

60. Citala 2009 60 Empirical Evaluation 10 verbs randomly selected from AWN + ???

61. Citala 2009 61 Empirical Evaluation Results

62. Citala 2009 62 Results Using HEU + BN (threshold 0.07) precision 0.71 65 accepted candidates from 92 proposed average 65/11 ? 6 extrapolating the results to the set of AWN verbs (>2,500) lead to 15,000 new synsets from 20,000 candidates

63. Citala 2009 63 Conclusions We dispose now of a useful lexico-semantic resource for dealing with semantic needs in Arabic NLP AWN has a limited but carefully selected coverage We need to extend it

64. Citala 2009 64 Future work Extend AWN following the proposed lines Build APIs for make easier the use Link AWN with other already available resources: Wikipedia CyC Geonames ...

65. Citala 2009 65 Thank you for your attention

  • Login