1 / 17

Building Methodology

Building Methodology. © Arabic WordNet. Methodologies developed in a number of projects. EuroWordNet: English, Dutch, German, French, Spanish, Italian, Czech, Estonian 10,000 up to 50,000 synsets BalkaNet: Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian 10,000 synsets.

HarrisCezar
Download Presentation

Building Methodology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Methodology © Arabic WordNet

  2. Methodologies developed in a number of projects • EuroWordNet: • English, Dutch, German, French, Spanish, Italian, Czech, Estonian • 10,000 up to 50,000 synsets • BalkaNet: • Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian • 10,000 synsets

  3. Main strategies for building wordnets • Expand approach: translate WordNet synsets to another language and take over the structure • easier and more efficient method • compatible structure with WordNet • vocabulary and structure is close to WordNet but also biased by it • Merge approach: create an independent wordnet in another language and align it with WordNet by generating the appropriate translations • more complex and labor intensive • different structure from WordNet • language specific patterns can be maintained

  4. General criteria for approach: • The purpose of the resource: machine translation, cross-lingual information retrieval, deep semantic analysis, domain applications • Available resources for the specific language • Properties of the language • Maximize the overlap with wordnets for other languages • Maximize semantic consistency within and across wordnets • Maximally focus the manual effort where needed • Maximally exploit automatic techniques

  5. Top-down methodology • Develop a core wordnet (5,000 synsets): • all the semantic building blocks or foundation to define the relations for all other more specific synsets, e.g. building -> house, church, school • provide a formal and explicit semantics • Validate the core wordnet: • does it include the most frequent words? • are semantic constraints violated? • Extend the core wordnet: (5,000 synsets or more): • automatic techniques for more specific concepts with high-confidence results • add other levels of hyponymy • add specific domains • add ‘easy’ derivational words • add ‘easy’ translation equivalence • Validate the complete wordnet

  6. Developing a core wordnet • Define a set of concepts(so-called Base Concepts) that play an important role in wordnets: • high position in the hierarchy • high degree of connectivity • represented as English WordNet synsets • Common base concepts: shared by various wordnets in different languages • Local base concepts: not shared • EuroWordNet: 1024 synsets, shared by 2 or more languages • BalkaNet: 5000 synsets (including 1024) • Common semantic framework for all Base Concepts, in the form of a Top-Ontology • Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets) • Manually build and verify the hypernym relations for the Base Concepts • All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

  7. Top-down methodology Top-Ontology 63TCs Hypero nyms Hypero nyms CBC Represen- tatives Local BCs 1024 CBCs CBC Repre-senta. Local BCs WMs related via non-hypo nymy WMs related via non-hypo nymy Remaining WordNet1.5 Synsets First Level Hyponyms First Level Hyponyms Remaining Hyponyms Remaining Hyponyms Inter-Lingual-Index

  8. Global Wordnet Association EuroWordNet BalkaNet • Arabic • Polish • Welsh • Chinese • 20 Indian Languages • Brazilian Portuguese • Hebrew • Latvian • Persian • Kurdish • Avestan • Baluchi • Hungarian • Romanian • Bulgarian • Turkish • Slovenian • Greek • Serbian • English • German • Spanish • French • Italian • Dutch • Czech • Estonian • Danish • Swedish • Portuguese • Korean • Russian • Basque • Catalan • Thai http://www.globalwordnet.org

  9. Core wordnet 5000 synsets = 1000 Synsets 5000 Synsets WordNet Synsets 1045678-v {darrasa} Top-down methodology Hyper nyms Sumo Ontology Arabic word frequency English Arabic Lexicon teach - darrasa CBC SBC ABC EuroWordNet BalkaNet Base Concepts WordNet Synsets 1045678-v {teach} Next Level Hyponyms Arabic roots & derivation rules WordNet Synsets WordNet Domains More Hyponyms Domain “chemics” WordNet Synsets Named Entities Named Entities Easy Translations Domain Arabic Wordnet English Wordnet

  10. Advantages of the approach • Well-defined semantics that can be inherited down to more specific concepts • Apply consistency checks • Automatic techniques can use semantic basis • Most frequent concepts and words are covered • High overlap and compatibility with other wordnets • Manual effort is focussed on the most difficult concepts and words

  11. Distribution over the top ontology clusters

  12. Overview of equivalence relations to the ILI Relation POS Sources: Targets Example eq_synonym same 1:1 auto : voiture car eq_near_synonym any many : many apparaat, machine, toestel: apparatus, machine, device eq_hyperonym same many : 1 (usually) citroenjenever: gin eq_hyponym same (usually) 1 : many dedo : toe, finger eq_metonymy same many/1 : 1 universiteit, universiteitsgebouw: university eq_diathesis same many/1 : 1 raken (cause), raken: hit eq_generalization same many/1 : 1 schoonmaken : clean

  13. Filling gaps in the ILI Types of GAPS • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, • Non-productive • Non-compositional • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier) • Productive • Compositional • Universality of gaps: Concepts occurring in at least 2 languages

  14. Productive and Predictable Lexicalizations exhaustively linked to the ILI beat hypernym hypernym {doodslaanV}NL {totschlagenV}DE kill hypernym hypernym {doodstampenV}NL {tottrampelnV}DE stamp hypernym {doodschoppenV}NL kick cashier hypernym hypernym {cajeraN}ES in_state {casière}NL in_state female hypernym fish {alevínN}ES in_state young

  15. Top-down methodology Hyper nyms Sumo Ontology = Arabic word frequency English Arabic Lexicon 1000 Synsets SBC CBC ABC EuroWordNet BalkaNet Base Concepts 5000 Synsets Next Level Hyponyms Arabic roots & derivation rules WordNet Synsets WordNet Domains More Hyponyms Domain “chemics” WordNet Synsets Named Entities Named Entities Easy Translations Domain Arabic Wordnet English Wordnet

More Related