660 likes | 1.15k Views
Citala 2009. 2. Index of the talk. IntroductionOntologiesWordnetsBuilding wordnetsArabic WordNetSemi-automatic extensions of AWNLinking AWN with complementary resources. Citala 2009. 3. Introduction. semantic components used in NLP applications:ontologieslarge-scale knowledge-bases.Need (or
E N D
1. Citala 2009 1 Arabic Wordnet as a free resource: past, present and the future Horacio Rodríguez1
2. Citala 2009 2 Index of the talk Introduction
Ontologies
Wordnets
Building wordnets
Arabic WordNet
Semi-automatic extensions of AWN
Linking AWN with complementary resources
3. Citala 2009 3 Introduction semantic components used in NLP applications:
ontologies
large-scale knowledge-bases.
Need (or convenience) of developing
wide-coverage
domain-independent
lexico-conceptual ontologies
WordNet
4. Citala 2009 4 Ontologies Ontologies have become recently a core resource for many knowledge-based applications:
Semantic Web
e-commerce
Information Retrieval
Information Integration
NLP
5. Citala 2009 5 Ontologies Ontologies represent static domain knowledge allowing an efficient use by multiple knowledge agents
Acquiring domain knowledge for building ontologies is highly costly and time consuming.
For this reason lots of methods and techniques have been developed for trying to reduce such efforts
6. Citala 2009 6 Ontologies What an ontology is:
(Gruber, 1993)
an ontology is an explicit specification of a conceptualization
(Studer et al, 1998)
an ontology is a formal explicit specification of a shared conceptualization
7. Citala 2009 7 Ontologies What an ontology is:
A conceptualization is an abstract, simplified view of the world represented for some purpose
An ontology is a description (formal specification) of a set of concepts and relationships for enabling knowledge sharing and reuse (to perform logical commitments)
An ontology commitment is an agreement to use a vocabulary in a way that is consistent with respect to the theory specified by the ontology
8. Citala 2009 8 Ontologies
9. Citala 2009 9 Ontologies Types of Ontologies
level of generality
top-level, generic, domain ontologies
level of detail of the domain theory
domain models reduced to a taxonomic hierarchy (shallow, lightweight ontologies)
including relations between concepts
complex models including axioms and constraints (heavyweight ontologies)
10. Citala 2009 10 Ontologies lexico-conceptual ontologies
Some authors simply reject this term, an ontology is by definition conceptual and, thus, language independent (or better, language neutral)
Other authors admit that some conceptualizations are different in different languages, thus leading to different ontologies
(Barbu and Barbu-Mititelu, 2005) classify these differences as accidental, systematic and cultural.
11. Citala 2009 11 Ontologies lexico-conceptual ontologies
Some concepts, for instance, are lexicalized for a language but not for another (i.e. not single word or multiword can be used for uniquely referring to the concept).
spliting a lexico-conceptual ontology into two linked components: the conceptual core (the true ontology) and the lexical coverage (a lexicon)
12. Citala 2009 12 Ontologies
13. Citala 2009 13 Ontologies The mapping between lexical items (words or multiwords) and concepts can be complex.
Due to polysemy, most lexical items can be mapped into more than one concept.
Due to synonymy, more than one word can be mapped to a concept.
Usually the mapping is splitted into two steps
from words into word-senses (i.e. different word meanings)
and from word-senses into concepts.
14. Citala 2009 14 Ontologies Ontology Management
Ontology Building
including Ontology Learning
Ontology Merging
including Ontology Alignment and Mapping
Ontology Enrichment
including Ontology Refining and Instantiation
15. Citala 2009 15 Ontologies Some examples of useful ontologies
Domain restricted
Protege Ontology Library
http://protegewiki.stanford.edu/index.php/Protege_Ontology_Library
OntoSelect
http://olp.dfki.de/ontoselect
16. Citala 2009 16 Ontologies Some examples of useful ontologies
Domain independent (generic)
Suggested Upper Merged Ontology (SUMO)
Niles, Pease, 2001, 2003
? 1,000 concepts, 4000 assertions, and 600 rules
attached to WN
CYC
from en Cyc lopedia
Guha, Lenat
started in 1984
OpenCyC (40% Cyc) free
ResearchCyC (80% Cyc) free for research
1.6M facts and 120k concepts
17. Citala 2009 17 Ontologies Some examples of useful ontologies
Domain independent (generic)
ConceptNet
Liu, Singh, 2004
http://web.media.mit.edu/~hugo/conceptnet/
two versions:
concise (200,000 assertions)
full (1.6 million assertions)
Commonsense knowledge in ConceptNet encompasses the spatial, physical, social, temporal, and psychological aspects of everyday life.
generated automatically from the 700,000 sentences of the Open Mind Common Sense Project
18. Citala 2009 18 Ontologies Building lexico-conceptual ontologies
A substantial part of CO is language-neutral
if a LO for a source language is available, a LO for a target language can be derived for a subset of CO, working basically at lexicon and mapping level using as knowledge sources available bilingual resources
19. Citala 2009 19 Ontologies
20. Citala 2009 20 Ontologies
21. Citala 2009 21 Ontologies Building lexico-conceptual ontologies
This derivation process is far to be simple.
for a LO, the mapping words ? word-senses ? concepts is complex (and controversial)
(Kilgarrif, 1997) arguments against the ontological status of word-senses
(Edmonds and Hirst, 2002) reduce a lot the cases of absolute synonymy and propose, instead, modeling near-synonymy for fine-grained mapping between words and concepts).
22. Citala 2009 22 Wordnets Princeton's English WordNet
(Miller et al, 1990), (Fellbaum, 1998)
Semantic Information
more than 123,000 words organised in 117,000 synsets (WN3.0)
more than 235,000 relations between synsets
Freely available: http://wordnet.princeton.edu/
23. Citala 2009 23 Wordnets Princeton's English WordNet
Lexicalised concepts (words, compounds, multiwords)
Synset: synonym set (of words)
Large semantic net conecting synsets
synonymy, antonymy, hyperonymy, hyponymy, meronymy, implication, causation ...
Structure
Noun hierarchy depth ~12
Verb hierarchy depth ~3
Adjective/adverb not in hierarchy, but in star structure
24. Citala 2009 24 Wordnets
25. Citala 2009 25 Wordnets Beyond WN
EuroWordNet
(Vossen 98)
UE funded project
Integrated local wordnets in several languages
English Sheffield
Dutch Amsterdam
Italian Pisa
Spanish UB, UPC, UNED.
http://www.hum.uva.nl/~ewn/
26. Citala 2009 26 Wordnets
27. Citala 2009 27 Wordnets Top Concept Ontology of EuroWordNet
Hierarchy of language independent concepts
Semantic distinctions: object, place, ?
abstract (not lexical)
Connected to the ILI
Three types of concepts:
First order: entities
Second order: estatic or dynamic situations
Third order: abstract prepositions
Ortogonal (multiple) assignments
28. Citala 2009 28 Wordnets Beyond WN
EWN2
German (GermaNet), French, Chec, Swedish, Estonian
ITEM, CREL
Spanish, Catalan, Basque (UB, UPC)
EuroTerm, Jur-Wordnet
Extending EWN in particular domain
Balkanet
Extending EWN for the Balkan languages
Hownet
Chinese WN
29. Citala 2009 29 Wordnets Macro Ontologies based on WN
MCR
Yago
Omega
30. Citala 2009 30 Wordnets MCR (Multilingual Central Repository)
Meaning and Know projects
31. Citala 2009 31 Wordnets
32. Citala 2009 32 Wordnets
33. Citala 2009 33 Wordnets YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia
Suchanek et al 2007
34. Citala 2009 34 Wordnets Omega Ontology
http://omega.isi.edu
Andrew Philpot et al (2005)
35. Citala 2009 35 Wordnets
36. Citala 2009 36 Wordnets
37. Citala 2009 37 Building wordnets Merge approach
Taxonomy construction: monolingual MRDs
Mapping taxonomies: bilingual MRDs
Expand approach
Translation of synsets: bilingual MRDs
Manual revision
38. Citala 2009 38 Building wordnets EWN
Building Base Concepts (BC)
supposed to be the concepts that play the most important role in different languages.
Two main criteria:
A high position in the semantic hierarchy (abstract)
Having many relations to other concepts (hub)
? 1000 synsets
Vertical expansion filling gaps and assuring good overlapping
39. Citala 2009 39 Building wordnets EWN
Spanish WN
automatic extension with human validation
Combination of 17 heuristic methods
1) simple rule
2) pair wise combination
3) Logistic Regression combination
40. Citala 2009 40 Building wordnets
41. Citala 2009 41 Building wordnets
42. Citala 2009 42 Building wordnets
43. Citala 2009 43 Arabic WordNet USA REFLEX program funded (2005-2007)
Partners:
Universities
Princeton
Manchester
UPC (Barcelona)
UB (Barcelona)
Companies
Articulate Software
Irion
44. Citala 2009 44 Arabic WordNet papers
Introducing the Arabic WordNet Project
Black et al, 2006
Building a WordNet for Arabic
Elkateb et al, 2006
Arabic WordNet: Current State and Future Extensions
Rodríguez et al, 2008
Arabic WordNet: Semi-automatic Extensions using Bayesian Inference
Rodríguez et al, 2008
45. Citala 2009 45 Arabic WordNet Objectives
10,000 synsets including some amount of domain specific data
linked to PWN 2.0
finally to PWN 3.0
linked to SUMO
+ 1,000 NE
manually built (or revised)
vowelized entries
including root of each entry
46. Citala 2009 46 Arabic WordNet Criteria for selecting synsets to be covered
Connectivity
as densely connected as possible
Most of them connected to English WN counterparts
the overall topology of both wordnets is expected to be similar.
Relevance
Frequent and salient concepts
Generality
Synsets on the highest levels of WN
47. Citala 2009 47 Arabic WordNet Approach
described in 3rd GWC (Elkateb et al, 2006)
Manually built
2 lexicographic interfaces
Manchester, Barcelona
guided by automatically generated suggestions of <Arabic word, English synset> pairs coming from bilingual resources.
48. Citala 2009 48 Arabic WordNet Approach
BCs
Covering of EWN & Balkanet Base Concepts
Filling gaps
Building Arabic specific synsets
Covering domain specific synsets
Adding NEs.
(Semi) automatic extensions
heuristic based
Bayesian networks
49. Citala 2009 49 Arabic WordNet Resources used
LOGOS database of Arabic verbs:
contains 944 fully conjugated Arabic verbs
Bilingual (Arabic-English) dictionaries
NMSU bilingual Arabic-English lexicon:
Salmoné
University of Barcelona
Effel
Corpora
Arabic GigaWord Corpus (from LDC)
UN (2000-2002) bilingual Arabic-English Corpus (from LDC).
50. Citala 2009 50 Arabic WordNet Item
conceptual entities, including synsets, ontology classes and instances.
Word
word senses
Form
entity that contains lexical information (not merely inflectional variation)
roots
broken plural forms
Link
relates two items, and has a type such as equivalence, subsuming, etc.
interconnect sense items, e.g., a PWN synset to an AWN synset, a synset to a SUMO concept, etc.
51. Citala 2009 51 Arabic WordNet Problems found
Arabic specific synsets
linking to SUMO
NEs
Selecting domain specific synsets
52. Citala 2009 52 Arabic WordNet Current (Final ?, we hope no!!!) figures
up to date statistics:
http://www.lsi.upc.edu/~mbertran/arabic/awn/query/sug_statistics.php.
53. Citala 2009 53 Arabic WordNet Software
Lexicographer's Web Interface
http://www.lsi.upc.edu/~mbertran/arabic/awn/update/synset_browse.php
User's Web Interface
http://www.lsi.upc.edu/~mbertran/arabic/awn/index.html
The Arabic Word Spotter
http://www.lsi.upc.edu/~mbertran/arabic/wwwWn7/
AWN browser
http://sourceforge.net/projects/awnbrowser/
AWN to SUMO mapping including automatic generation of Arabic paraphrases of SUMO formal axioms
54. Citala 2009 54 Arabic WordNet Ongoing research
(Semi) automatic methods for enriching AWN
Heuristic-based approach
GWC 2008
Bayesian Networks
LREC 2008
Automatically obtaining & linking NEs using Wikipedia as Knowledge Source
NEs from Wikipedia
citala 2009 (this conference)
55. Citala 2009 55 Arabic WordNet (Semi) automatic methods for enriching AWN
key idea
In Arabic many words having a common root have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules
56. Citala 2009 56 Semi-automatic Extensions of AWN
57. Citala 2009 57 Semi-automatic Extensions of AWN Lexical rules
regular verbal derivative forms
regular nominal and adjectival derivative forms
masdar (nominal verb)
masculine and feminine active and passive participles
inflected verbal forms
58. Citala 2009 58 Semi-automatic Extensions of AWN Procedure for generating a set of likely <Arabic word, English synset, score>:
produce an initial list of candidate word forms
filter out the less likely candidates from this list
generate an initial list of attachments
score the reliability of these candidates
manually review the best scored candidates and include the valid associations in AWN.
59. Citala 2009 59 Semi-automatic Extensions of AWN Score the reliability of the candidates
build a graph representing the words, synsets and their associations
associations synset-synset:
explicit in WN2.0
path-based
apply a set of heuristic rules that use directly the structure of the graph
GWC 2008
apply Bayesian inference
LREC 2008
60. Citala 2009 60 Empirical Evaluation 10 verbs randomly selected from AWN + ???
61. Citala 2009 61 Empirical Evaluation Results
62. Citala 2009 62 Results Using HEU + BN (threshold 0.07)
precision 0.71
65 accepted candidates from 92 proposed
average 65/11 ? 6
extrapolating the results to the set of AWN verbs (>2,500) lead to 15,000 new synsets from 20,000 candidates
63. Citala 2009 63 Conclusions We dispose now of a useful lexico-semantic resource for dealing with semantic needs in Arabic NLP
AWN has a limited but carefully selected coverage
We need to extend it
64. Citala 2009 64 Future work Extend AWN following the proposed lines
Build APIs for make easier the use
Link AWN with other already available resources:
Wikipedia
CyC
Geonames
...
65. Citala 2009 65 Thank you for your attention