Building methodology
Download
1 / 17

Building Methodology - PowerPoint PPT Presentation


  • 243 Views
  • Updated On :

Building Methodology. © Arabic WordNet. Methodologies developed in a number of projects. EuroWordNet: English, Dutch, German, French, Spanish, Italian, Czech, Estonian 10,000 up to 50,000 synsets BalkaNet: Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian 10,000 synsets.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Building Methodology' - HarrisCezar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Building methodology l.jpg

Building Methodology

© Arabic WordNet


Methodologies developed in a number of projects l.jpg
Methodologies developed in a number of projects

  • EuroWordNet:

    • English, Dutch, German, French, Spanish, Italian, Czech, Estonian

    • 10,000 up to 50,000 synsets

  • BalkaNet:

    • Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian

    • 10,000 synsets


Main strategies for building wordnets l.jpg
Main strategies for building wordnets

  • Expand approach: translate WordNet synsets to another language and take over the structure

    • easier and more efficient method

    • compatible structure with WordNet

    • vocabulary and structure is close to WordNet but also biased by it

  • Merge approach: create an independent wordnet in another language and align it with WordNet by generating the appropriate translations

    • more complex and labor intensive

    • different structure from WordNet

    • language specific patterns can be maintained


General criteria for approach l.jpg
General criteria for approach:

  • The purpose of the resource: machine translation, cross-lingual information retrieval, deep semantic analysis, domain applications

  • Available resources for the specific language

  • Properties of the language

  • Maximize the overlap with wordnets for other languages

  • Maximize semantic consistency within and across wordnets

  • Maximally focus the manual effort where needed

  • Maximally exploit automatic techniques


Top down methodology l.jpg
Top-down methodology

  • Develop a core wordnet (5,000 synsets):

    • all the semantic building blocks or foundation to define the relations for all other more specific synsets, e.g. building -> house, church, school

    • provide a formal and explicit semantics

  • Validate the core wordnet:

    • does it include the most frequent words?

    • are semantic constraints violated?

  • Extend the core wordnet: (5,000 synsets or more):

    • automatic techniques for more specific concepts with high-confidence results

    • add other levels of hyponymy

    • add specific domains

    • add ‘easy’ derivational words

    • add ‘easy’ translation equivalence

  • Validate the complete wordnet


Developing a core wordnet l.jpg
Developing a core wordnet

  • Define a set of concepts(so-called Base Concepts) that play an important role in wordnets:

    • high position in the hierarchy

    • high degree of connectivity

    • represented as English WordNet synsets

    • Common base concepts: shared by various wordnets in different languages

    • Local base concepts: not shared

  • EuroWordNet: 1024 synsets, shared by 2 or more languages

  • BalkaNet: 5000 synsets (including 1024)

  • Common semantic framework for all Base Concepts, in the form of a Top-Ontology

  • Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

  • Manually build and verify the hypernym relations for the Base Concepts

  • All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet


Top down methodology7 l.jpg
Top-down methodology

Top-Ontology

63TCs

Hypero

nyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

1024 CBCs

CBC

Repre-senta.

Local

BCs

WMs

related via

non-hypo

nymy

WMs

related via

non-hypo

nymy

Remaining

WordNet1.5

Synsets

First Level Hyponyms

First Level Hyponyms

Remaining

Hyponyms

Remaining

Hyponyms

Inter-Lingual-Index


Global wordnet association l.jpg
Global Wordnet Association

EuroWordNet

BalkaNet

  • Arabic

  • Polish

  • Welsh

  • Chinese

  • 20 Indian Languages

  • Brazilian Portuguese

  • Hebrew

  • Latvian

  • Persian

  • Kurdish

  • Avestan

  • Baluchi

  • Hungarian

  • Romanian

  • Bulgarian

  • Turkish

  • Slovenian

  • Greek

  • Serbian

  • English

  • German

  • Spanish

  • French

  • Italian

  • Dutch

  • Czech

  • Estonian

  • Danish

  • Swedish

  • Portuguese

  • Korean

  • Russian

  • Basque

  • Catalan

  • Thai

http://www.globalwordnet.org


Top down methodology9 l.jpg

Core wordnet

5000 synsets

=

1000

Synsets

5000

Synsets

WordNet

Synsets

1045678-v

{darrasa}

Top-down methodology

Hyper

nyms

Sumo

Ontology

Arabic

word

frequency

English

Arabic

Lexicon

teach

-

darrasa

CBC

SBC

ABC

EuroWordNet

BalkaNet

Base Concepts

WordNet

Synsets

1045678-v

{teach}

Next Level

Hyponyms

Arabic

roots

&

derivation

rules

WordNet

Synsets

WordNet

Domains

More

Hyponyms

Domain

“chemics”

WordNet

Synsets

Named

Entities

Named

Entities

Easy

Translations

Domain

Arabic Wordnet

English Wordnet


Advantages of the approach l.jpg
Advantages of the approach

  • Well-defined semantics that can be inherited down to more specific concepts

    • Apply consistency checks

    • Automatic techniques can use semantic basis

  • Most frequent concepts and words are covered

  • High overlap and compatibility with other wordnets

  • Manual effort is focussed on the most difficult concepts and words



Overview of equivalence relations to the ili l.jpg
Overview of equivalence relations to the ILI

Relation POS Sources: Targets Example

eq_synonym same 1:1 auto : voiture

car

eq_near_synonym any many : many apparaat, machine, toestel:

apparatus, machine, device

eq_hyperonym same many : 1 (usually) citroenjenever:

gin

eq_hyponym same (usually) 1 : many dedo :

toe, finger

eq_metonymy same many/1 : 1 universiteit, universiteitsgebouw:

university

eq_diathesis same many/1 : 1 raken (cause), raken:

hit

eq_generalization same many/1 : 1 schoonmaken :

clean


Filling gaps in the ili l.jpg
Filling gaps in the ILI

Types of GAPS

  • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin,

    • Non-productive

    • Non-compositional

  • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier)

    • Productive

    • Compositional

  • Universality of gaps: Concepts occurring in at least 2 languages


Productive and predictable lexicalizations exhaustively linked to the ili l.jpg
Productive and Predictable Lexicalizations exhaustively linked to the ILI

beat

hypernym

hypernym

{doodslaanV}NL

{totschlagenV}DE

kill

hypernym

hypernym

{doodstampenV}NL

{tottrampelnV}DE

stamp

hypernym

{doodschoppenV}NL

kick

cashier

hypernym

hypernym

{cajeraN}ES

in_state

{casière}NL

in_state

female

hypernym

fish

{alevínN}ES

in_state

young


Top down methodology17 l.jpg
Top-down methodology

Hyper

nyms

Sumo

Ontology

=

Arabic

word

frequency

English

Arabic

Lexicon

1000

Synsets

SBC

CBC

ABC

EuroWordNet

BalkaNet

Base Concepts

5000

Synsets

Next Level

Hyponyms

Arabic

roots

&

derivation

rules

WordNet

Synsets

WordNet

Domains

More

Hyponyms

Domain

“chemics”

WordNet

Synsets

Named

Entities

Named

Entities

Easy

Translations

Domain

Arabic Wordnet

English Wordnet