Building methodology
Download
1 / 17

Building Methodology - PowerPoint PPT Presentation


Building Methodology. © Arabic WordNet. Methodologies developed in a number of projects. EuroWordNet: English, Dutch, German, French, Spanish, Italian, Czech, Estonian 10,000 up to 50,000 synsets BalkaNet: Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian 10,000 synsets.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Building Methodology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Building Methodology

© Arabic WordNet


Methodologies developed in a number of projects

  • EuroWordNet:

    • English, Dutch, German, French, Spanish, Italian, Czech, Estonian

    • 10,000 up to 50,000 synsets

  • BalkaNet:

    • Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian

    • 10,000 synsets


Main strategies for building wordnets

  • Expand approach: translate WordNet synsets to another language and take over the structure

    • easier and more efficient method

    • compatible structure with WordNet

    • vocabulary and structure is close to WordNet but also biased by it

  • Merge approach: create an independent wordnet in another language and align it with WordNet by generating the appropriate translations

    • more complex and labor intensive

    • different structure from WordNet

    • language specific patterns can be maintained


General criteria for approach:

  • The purpose of the resource: machine translation, cross-lingual information retrieval, deep semantic analysis, domain applications

  • Available resources for the specific language

  • Properties of the language

  • Maximize the overlap with wordnets for other languages

  • Maximize semantic consistency within and across wordnets

  • Maximally focus the manual effort where needed

  • Maximally exploit automatic techniques


Top-down methodology

  • Develop a core wordnet (5,000 synsets):

    • all the semantic building blocks or foundation to define the relations for all other more specific synsets, e.g. building -> house, church, school

    • provide a formal and explicit semantics

  • Validate the core wordnet:

    • does it include the most frequent words?

    • are semantic constraints violated?

  • Extend the core wordnet: (5,000 synsets or more):

    • automatic techniques for more specific concepts with high-confidence results

    • add other levels of hyponymy

    • add specific domains

    • add ‘easy’ derivational words

    • add ‘easy’ translation equivalence

  • Validate the complete wordnet


Developing a core wordnet

  • Define a set of concepts(so-called Base Concepts) that play an important role in wordnets:

    • high position in the hierarchy

    • high degree of connectivity

    • represented as English WordNet synsets

    • Common base concepts: shared by various wordnets in different languages

    • Local base concepts: not shared

  • EuroWordNet: 1024 synsets, shared by 2 or more languages

  • BalkaNet: 5000 synsets (including 1024)

  • Common semantic framework for all Base Concepts, in the form of a Top-Ontology

  • Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

  • Manually build and verify the hypernym relations for the Base Concepts

  • All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet


Top-down methodology

Top-Ontology

63TCs

Hypero

nyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

1024 CBCs

CBC

Repre-senta.

Local

BCs

WMs

related via

non-hypo

nymy

WMs

related via

non-hypo

nymy

Remaining

WordNet1.5

Synsets

First Level Hyponyms

First Level Hyponyms

Remaining

Hyponyms

Remaining

Hyponyms

Inter-Lingual-Index


Global Wordnet Association

EuroWordNet

BalkaNet

  • Arabic

  • Polish

  • Welsh

  • Chinese

  • 20 Indian Languages

  • Brazilian Portuguese

  • Hebrew

  • Latvian

  • Persian

  • Kurdish

  • Avestan

  • Baluchi

  • Hungarian

  • Romanian

  • Bulgarian

  • Turkish

  • Slovenian

  • Greek

  • Serbian

  • English

  • German

  • Spanish

  • French

  • Italian

  • Dutch

  • Czech

  • Estonian

  • Danish

  • Swedish

  • Portuguese

  • Korean

  • Russian

  • Basque

  • Catalan

  • Thai

http://www.globalwordnet.org


Core wordnet

5000 synsets

=

1000

Synsets

5000

Synsets

WordNet

Synsets

1045678-v

{darrasa}

Top-down methodology

Hyper

nyms

Sumo

Ontology

Arabic

word

frequency

English

Arabic

Lexicon

teach

-

darrasa

CBC

SBC

ABC

EuroWordNet

BalkaNet

Base Concepts

WordNet

Synsets

1045678-v

{teach}

Next Level

Hyponyms

Arabic

roots

&

derivation

rules

WordNet

Synsets

WordNet

Domains

More

Hyponyms

Domain

“chemics”

WordNet

Synsets

Named

Entities

Named

Entities

Easy

Translations

Domain

Arabic Wordnet

English Wordnet


Advantages of the approach

  • Well-defined semantics that can be inherited down to more specific concepts

    • Apply consistency checks

    • Automatic techniques can use semantic basis

  • Most frequent concepts and words are covered

  • High overlap and compatibility with other wordnets

  • Manual effort is focussed on the most difficult concepts and words


Distribution over the top ontology clusters


Overview of equivalence relations to the ILI

RelationPOSSources: TargetsExample

eq_synonymsame1:1auto : voiture

car

eq_near_synonymanymany : manyapparaat, machine, toestel:

apparatus, machine, device

eq_hyperonymsamemany : 1 (usually)citroenjenever:

gin

eq_hyponymsame(usually) 1 : manydedo :

toe, finger

eq_metonymysamemany/1 : 1universiteit, universiteitsgebouw:

university

eq_diathesissamemany/1 : 1raken (cause), raken:

hit

eq_generalizationsamemany/1 : 1schoonmaken :

clean


Filling gaps in the ILI

Types of GAPS

  • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin,

    • Non-productive

    • Non-compositional

  • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier)

    • Productive

    • Compositional

  • Universality of gaps: Concepts occurring in at least 2 languages


Productive and Predictable Lexicalizations exhaustively linked to the ILI

beat

hypernym

hypernym

{doodslaanV}NL

{totschlagenV}DE

kill

hypernym

hypernym

{doodstampenV}NL

{tottrampelnV}DE

stamp

hypernym

{doodschoppenV}NL

kick

cashier

hypernym

hypernym

{cajeraN}ES

in_state

{casière}NL

in_state

female

hypernym

fish

{alevínN}ES

in_state

young


Top-down methodology

Hyper

nyms

Sumo

Ontology

=

Arabic

word

frequency

English

Arabic

Lexicon

1000

Synsets

SBC

CBC

ABC

EuroWordNet

BalkaNet

Base Concepts

5000

Synsets

Next Level

Hyponyms

Arabic

roots

&

derivation

rules

WordNet

Synsets

WordNet

Domains

More

Hyponyms

Domain

“chemics”

WordNet

Synsets

Named

Entities

Named

Entities

Easy

Translations

Domain

Arabic Wordnet

English Wordnet


ad
  • Login