wordnet eurowordnet global wordnet
Download
Skip this Video
Download Presentation
Wordnet, EuroWordNet, Global Wordnet

Loading in 2 Seconds...

play fullscreen
1 / 117

Wordnet, EuroWordNet, Global Wordnet - PowerPoint PPT Presentation


  • 204 Views
  • Uploaded on

Wordnet, EuroWordNet, Global Wordnet. Piek Vossen [email protected] http://www.globalwordnet.org. Overview. Princeton WordNet (1980 - ongoing) EuroWordNet (1996 - 1999) The database design The general building strategy Towards a universal index of meaning

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Wordnet, EuroWordNet, Global Wordnet' - enid


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
wordnet eurowordnet global wordnet

Wordnet, EuroWordNet, Global Wordnet

Piek Vossen

[email protected]

http://www.globalwordnet.org

overview
Overview
  • Princeton WordNet (1980 - ongoing)
  • EuroWordNet (1996 - 1999)
    • The database design
    • The general building strategy
    • Towards a universal index of meaning
  • Global WordNet Association (2001 - ongoing)
    • Other wordnets
    • BalkaNet (2001 - 2004)
    • IndoWordnet (2002 - ongoing)
    • Meaning (2002 - 2005)
wordnet1 5
WordNet1.5
  • Developed at Princeton by George Miller and his team as a model of the mental lexicon.
  • Semantic network in which concepts are defined in terms of relations to other concepts.
  • Structure:
      • organized around the notion of synsets (sets of synonymous words)
      • basic semantic relations between these synsets
  • Initially no glosses
  • Main revision after tagging the Brown corpus with word meanings: SemCor.
  • http://www.cogsci.princeton.edu/~wn/w3wn.html
eurowordnet
EuroWordNet
  • The development of a multilingual database with wordnets for several European languages
  • Funded by the European Commission, DG XIII, Luxembourg as projects LE2-4003 and LE4-8328
  • March 1996 - September 1999
  • 2.5 Million EURO.
  • URL: http://www.hum.uva.nl/~ewn
objectives of eurowordnet
Objectives of EuroWordNet
  • Languages covered:
    • EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian
    • EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian.
  • Size of vocabulary:
    • EuroWordNet-1: 30,000 concepts - 50,000 word meanings.
    • EuroWordNet-2: 15,000 concepts- 25,000 word meaning.
  • Type of vocabulary:
    • the most frequent words of the languages
    • all concepts needed to relate more specific concepts
the basic principles of eurowordnet
The basic principles of EuroWordNet
  • the structure of the Princeton WordNet
  • the design of the EuroWordNet database
  • wordnets as language-specific structures
  • the language-internal relations
  • the multilingual relations
specific features of eurowordnet
Specific features of EuroWordNet
  • it contains semantic lexicons for other languages than English.
  • each wordnet reflects the relations as a language-internal system, maintaining cultural and linguistic differences in the wordnets.
  • it contains multilingual relations from each wordnet to English meanings, which makes it possible to compare the wordnets, tracking down inconsistencies and cross-linguistic differences.
  • each wordnet is linked to a language independent top-ontology and to domain labels.
autonomous language specific

object

artifact, artefact

(a man-made object)

natural object (an

object occurring

naturally)

block

instrumentality

body

box

spoon

bag

device

implement

container

tool

instrument

Autonomous & Language-Specific

Wordnet1.5

Dutch Wordnet

voorwerp

{object}

blok

{block}

lichaam

{body}

werktuig{tool}

bak

{box}

lepel

{spoon}

tas

{bag}

differences in structure
Differences in structure
  • Artificial Classes versus Lexicalized Classes:
  • instrumentality; natural object
  • Lexicalization differences of classes:
  • container and artifact (object) are not lexicalized in Dutch
  • What is the purpose of different hierarchies?
  • Should we include all lexicalized classes from all (8) languages?
linguistic versus conceptual ontologies
Linguistic versus Conceptual Ontologies
  • Conceptual ontology:
  • A particular level or structuring may be required to achieve a better control or performance, or a more compact and coherent structure.
    • introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool),
    • neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise).
  • What properties can we infer for spoons?
  • spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking
linguistic versus conceptual ontologies1
Linguistic versus Conceptual Ontologies

Linguistic ontology:

Exactly reflects the relations between all the lexicalized words and expressions in a language. It therefore captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language.

What words can be used to name spoons?

spoon -> object, tableware, silverware, merchandise, cutlery,

separate wordnets and ontologies

WordNet1.5

container

box

object

container

box

Separate Wordnets and Ontologies

Language-Neutral Ontology

Language-Specific Wordnets

ReferenceOntologyClasses:

BOX

ContainerProduct;

SolidTangibleThing

Dutch Wordnet

voorwerp

doos

EuroWordNet Top-Ontology:

Form: Cubic

Function: Contain

Origin: Artifact

Composition: Whole

wordnets versus ontologies
Wordnets versus ontologies

Wordnets:

autonomous language-specific lexicalization patterns in a relational network.

Usage: to predict substitution in text for information retrieval,

text generation, machine translation, word-sense-disambiguation.

Ontologies:

data structure with formally defined concepts.

Usage: making semantic inferences.

wordnets as linguistic ontologies
Wordnets asLinguistic Ontologies

Classical Substitution Principle:

Any word that is used to refer to something can be replaced by its synonyms, hyperonyms and hyponyms:

horse  stallion, mare, pony, mammal, animal, being.

It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms:

horse X cat, dog, camel, fish, plant, person, object.

Conceptual Distance Measurement:

Number of hierarchical nodes between words is a measurement of closeness, where the level and the local density of nodes are additional factors.

linguistic principles for deriving relations
Linguistic Principles for deriving relations
  • 1. Substitution tests (Cruse 1986):
    • 1 a. It is a fiddle therefore it is a violin.
      • b It is a violin therefore it is a fiddle.
    • 2 a. It is a dog therefore it is an animal.
      • b *It is an animal therefore it is a dog.
    • 3 a to kill (/a murder) causes to die (/ death)
    • to kill (/a murder) has to die (/ death) as a consequence
    • b *to die / death causes to kill
    • *to die / death has to kill as a consequence
linguistic principles for deriving relations1
Linguistic Principles for deriving relations
  • 2. Principle of Economy (Dik 1978):
    • If a word W1 (animal) is the hyperonym of W2 (mammal) and W2 is the hyperonym of W3 (dog) then W3 (dog) should not be linked to W1 (animal) but to W2 (mammal).
  • 3. Principle of Compatibility
    • If a word W1 is related to W2 via relation R1, W1 and W2 cannot be related via relation Rn, where Rn is defined as a distinct relation from R1.
slide19

Domains

Ontology

bewegen

gaan

move

go

2OrderEntity

Traffic

III

Location

Dynamic

Air

Road`

rijden

ride

drive

Lexical Items Table

Lexical Items Table

Lexical Items Table

Lexical Items Table

ILI-record

{drive}

conducir

cavalcare

cabalgar

jinetear

III

mover

transitar

andare

muoversi

Architecture of the

EuroWordNet Data Base

III

berijden

I

I

III

III

II

II

III

III

II

II

guidare

Inter-Lingual-Index

III

I = Language Independent link

II = Link from Language Specific

to Inter lingual Index

III = Language Dependent Link

language internal relations
Language Internal Relations
  • WN 1.5 starting point
  • The ‘synset’ as a weak notion of synonymy:
        • “two expressions are synonymous in a linguistic context C
        • if the substitution of one for the other in C does not alter
        • the truth value.” (Miller et al. 1993)
  • Relations between synsets:
  • Relation POS-combination Example
  • ANTONYMY adjective-to-adjective
  • verb-to-verb open/ close
  • HYPONYMY noun-to-noun car/ vehicle
  • verb-to-verb walk/ move
  • MERONYMY noun-to-noun head/ nose
  • ENTAILMENT verb-to-verb buy/ pay
  • CAUSE verb-to-verb kill/ die
differences eurowordnet wordnet1 5
Differences EuroWordNet/WordNet1.5
  • Added Features to relations
  • Cross-Part-Of-Speech relations
  • New relations to differentiate shallow hierarchies
  • New interpretations of relations
ewn relationship labels
EWN Relationship Labels
  • Disjunction/Conjunction of multiple relations of the same type
  • WordNet1.5
    • door1 -- (a swinging or sliding barrier that will close the entrance to a room or building; "he knocked on the door"; "he slammed the door as he left") PART OF: doorway, door, entree, entry, portal, room access
    • door 6 -- (a swinging or sliding barrier that will close off access into a car; "she forgot to lock the doors of her car") PART OF: car, auto, automobile, machine, motorcar.
ewn relationship labels1
EWN Relationship Labels

{airplane} HAS_MERO_PART: conj1 {door}

HAS_MERO_PART: conj2 disj1 {jet engine}

HAS_MERO_PART: conj2 disj2 {propeller}

{door} HAS_HOLO_PART: disj1 {car}

HAS_HOLO_PART: disj2 {room}

HAS_HOLO_PART: disj3 {entrance}

{dog} HAS_HYPERONYM: conj1 {mammal}

HAS_HYPERONYM: conj2 {pet}

{albino} HAS_HYPERONYM: disj1 {plant}

HAS_HYPERONYM: dis2 {animal}

Default Interpretation: non-exclusive disjunction

ewn relationship labels2
EWN Relationship Labels
  • Disjunction/Conjunction of multiple relations of the same type
  • {{dog}
  • HAS_HYPONYM: dis1 {poodle}
  • HAS_HYPONYM: dis1 {labrador}
  • HAS_HYPONYM: {sheep dog} (Orthogonal)
  • HAS_HYPONYM: {watch dog} (Orthogonal)
  • Default Interpretation: non-exclusive disjunction
ewn relationship labels3
EWN Relationship Labels
  • Factive/Non-factive CAUSES (Lyons 1977)
    • factive (default interpretation):
  • “to kill causes to die”:
      • {kill} CAUSES {die}
    • non-factive: E1 probably or likely causes event E2 or E1 is intended to cause some event E2:
    • “to search may cause to find”.
  • {search} CAUSES {find} non-factive
ewn relationship labels4
EWN Relationship Labels

Reversed

In the database every relation must have a reverse counter-part but there is a difference between relations which are explicitly coded as reverse and automatically reversed relations:

{finger} HAS_HOLONYM {hand}

{hand} HAS_MERONYM {finger}

{paper-clip} HAS_MER_MADE_OF {metal}

{metal} HAS_HOL_MADE_OF {paper-clip} reversed

Negation

{monkey} HAS_MERO_PART {tail}

{ape} HAS_MERO_PART {tail} not

cross part of speech relations
Cross-Part-Of-Speech relations
  • WordNet1.5: nouns and verbs are not interrelated by basic semantic relations such as hyponymy and synonymy:
    • adornment 2 change of state-- (the act of changing something)
    • adorn 1 change, alter-- (cause to change; make different)
  • EuroWordNet: words of different parts of speech can be inter-linked with explicit xpos-synonymy, xpos-antonymy and xpos-hyponymy relations:
    • {adorn V} XPOS_NEAR_SYNONYM {adornment N}
cross part of speech relations1
Cross-Part-Of-Speech relations

The advantages of such explicit cross-part-of-speech relations are:

  • similar words with different parts of speech are grouped together.
  • the same information can be coded in an NP or in a sentence. By unifying higher-order nouns and verbs in the same ontology it will be possible to match expressions with very different syntactic structures but comparable content
  • by merging verbs and abstract nouns we can more easily link mismatches across languages that involve a part-of-speech shift. Dutch nouns such as “afsluiting”, “gehuil” are translated with the English verbs “close” and “cry”, respectively.
entailment in wordnet
Entailment in WordNet

WordNet1.5: Entailment indicates the direction of the implication or entailment:

a. + Temporal Inclusion (the two situations partially or totally overlap)

a.1 co-extensiveness (e. g., to limp/to walk) hyponymy/troponymy

a.2 proper inclusion (e.g., to snore/to sleep) entailment

b. - Temporal Exclusion (the two situations are temporally disjoint)

b.1 backward presupposition (e.g., to succeed/to try) entailment

b.2 cause (e.g., to give/to have)

subevents in eurowordnet
Subevents in EuroWordNet

EuroWordNet

Direction of the entailment is expressed by the labels factive and reversed:

{to succeed} is_caused_by {to try} factive

{to try} causes {to succeed} non-factive

Proper inclusion is described by the has_subevent/ is_subevent_of relation in combination with the label reversed:

{to snore} is_subevent_of {to sleep}

{to sleep} has_subevent {to snore} reversed

{to buy} has_subevent {to pay}

{to pay} is_subevent_of {to buy} reversed

the interpretation of the cause relation
The interpretation of the CAUSE relation
  • WordNet1.5: The causal relation only holds between verbs and it should only apply to temporally disjoint situations:
  • EuroWordNet: the causal relation will also be applied across different parts of speech:
    • {to kill} V causes {death} N
    • {death} n is_caused_by {to kill} v reversed
    • {to kill } v causes {dead} a
    • {dead} a is_caused_by {to kill} v reversed
    • {murder} n causes {death}n
    • {death} a is_caused_by {murder} n reversed
the interpretation of the cause relation1
The interpretation of the CAUSE relation
  • Various temporal relationships between the (dynamic/non-dynamic) situations may hold:
    • Temporally disjoint: there is no time point when dS1 takes place and also S2 (which is caused by dS1) (e.g. to shoot/to hit);
    • Temporally overlapping: there is at least one time point when both dS1 and S2 take place, and there is at least one time point when dS1 takes place and S2 (which is caused by dS1) does not yet take place (e.g. to teach/to learn);
    • Temporally co-extensive: whenever dS1 takes place also S2 (which is caused by dS1) takes place and there is no time point when dS1 takes place and S2 does not take place, and vice versa (e.g. to feed/to eat).
role relations
Role relations

In the case of many verbs and nouns the most salient relation is not the hyperonym but the relation between the event and the involved participants. These relations are expressed as follows:

{hammer} ROLE_INSTRUMENT {to hammer}

{to hammer} INVOLVED_INSTRUMENT {hammer} reversed

{school} ROLE_LOCATION {to teach}

{to teach} INVOLVED_LOCATION {school} reversed

These relations are typically used when other relations, mainly hyponymy, do not clarify the position of the concept network, but the word is still closely related to another word.

co role relations
Co_Role relations

guitar player HAS_HYPERONYM player

CO_AGENT_INSTRUMENT guitar

player HAS_HYPERONYM person

ROLE_AGENT to play music

CO_AGENT_INSTRUMENT musical instrument

to play music HAS_HYPERONYM to make

ROLE_INSTRUMENT musical instrument

guitar HAS_HYPERONYM musical instrument

CO_INSTRUMENT_AGENT guitar player

ice saw HAS_HYPERONYM saw

CO_INSTRUMENT_PATIENT ice

saw HAS_HYPERONYM saw

ROLE_INSTRUMENT to saw

ice CO_PATIENT_INSTRUMENT ice saw REVERSED

co role relations1
Co_Role relations

Examples of the other relations are:

criminal CO_AGENT_PATIENT victim

novel writer/ poet CO_AGENT_RESULT novel/ poem

dough CO_PATIENT_RESULT pastry/ bread

photograpic camera CO_INSTRUMENT_RESULT photo

slide37

BE_IN_STATE and STATE_OF

Example: the poor are the ones to whom the state poor applies

Effect: poor N HAS_HYPERONYM person N

poor N BE_IN_STATE poor A

poor A STATE_OF poor N reversed

IN_MANNER and MANNER_OF

Example: to slurp is to eat in a noisely manner

Effect: slurp V HAS_HYPERONYM eat V

slurp V IN_MANNER noisely Adverb

noisely Adverb MANNER_OF slurp V reversed

overview of the language internal relations in eurowordnet
Overview of the Language Internal relations in EuroWordnet
  • Same Part of Speech relations:
  • NEAR_SYNONYMY apparatus - machine
  • HYPERONYMY/HYPONYMY car - vehicle
  • ANTONYMY open - close
  • HOLONYMY/MERONYMY head - nose
  • Cross-Part-of-Speech relations:
  • XPOS_NEAR_SYNONYMY dead - death; to adorn - adornment
  • XPOS_HYPERONYMY/HYPONYMY to love - emotion
  • XPOS_ANTONYMY to live - dead
  • CAUSE die - death
  • SUBEVENT buy - pay; sleep - snore
  • ROLE/INVOLVED write - pencil; hammer - hammer
  • STATE the poor - poor
  • MANNER to slurp - noisily
  • BELONG_TO_CLASS Rome - city
thematic networks
Thematic networks

organisme (organism)

Causes

genezen

(to get well)

Patient

Part of

wezen(being)

ziekte

(disease)

Patient

orgaan

(organ)

persoon (person)

behandelen(treat)

Agent

scalpel

Patient

arts (doctor)

Instrument

opereren

(operate)

zieke (sick person, patient)

maagaandoening

(stomach disease)

maag

(stomach)

Involves

the multilingual design
The Multilingual Design
  • Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages;
  • Index-records are mainly based on WordNet1.5 synsets and consist of synonyms, glosses and source references;
  • Various types of complex equivalence relations are distinguished;
  • Equivalence relations from synsets to index records: not on a word-to-word basis;
  • Indirect matching of synsets linked to the same index items;
ewn interlingual relations
EWN Interlingual Relations
  • EQ_SYNONYM: there is a direct match between a synset and an ILI-record
  • EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously,
  • HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record.
  • HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI-records.
  • other relations:

CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE, EQ_IS_STATE_OF/EQ_BE_IN_STATE

equivalent near synonym
Equivalent Near Synonym
  • 1. Multiple Targets
    • One sense for Dutch schoonmaken (to clean) which simultaneously matches with at least 4 senses of clean in WordNet1.5:
    • {make clean by removing dirt, filth, or unwanted substances from}
    • {remove unwanted substances from, such as feathers or pits, as of chickens or fruit}
    • (remove in making clean; "Clean the spots off the rug")
    • {remove unwanted substances from - (as in chemistry)}
    • The Dutch synset schoonmaken will thus be linked with an eq_near_synonym relation to all these sense of clean.
equivalent near synonym1
Equivalent Near Synonym
  • 2. Multiple Source meanings
    • Synsets inter-linked by a near_synonym relation can be linked to same target ILI-record(s), either with an eq_synonym or an eq_near_synonym relation:
    • Dutch wordnet:
    • toestel near_synonym apparaat
    • ILI-records: {machine}; {device}; {apparatus}; {tool}
equivalent hyponymy
Equivalent Hyponymy

has_eq_hyperonym

Typically used for gaps in WordNet1.5 or in English:

  • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin,
  • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop only refers to animal head, English uses head for both.

has_eq_hyponym

Used when wordnet1.5 only provides more narrow terms. In this case there can only be a pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both finger and toe.

complex mappings across languages

= normal equivalence

=

eq

_has_hyponym

=

eq

_has_hyperonym

Complex mappings across languages

GB-Net

IT-Net

toe

dito

toe

{

: part of foot }

finger

finger

{

: part of hand }

head

dedo

dito

{

,

:

finger or toe }

head

{

: part of body }

NL-Net

ES-Net

hoofd

{

: human head }

kop

{

: animal head }

hoofd

dedo

kop

overall building process
Overall Building Process

Machine Readable Dictionaries

Wordnets, Taxonomies,

Corpora

Loaded in local databases

Ia

Ib

Specification of selection criteria

Subset of

word meanings

Improve and extend the wordnet fragments

Encoding of

language internal and equivalence relations

Ia

Wordnet fragment with

links to WordNet1.5

in local database

Adjust coverage

improve encoding

II

Load wordnet in the EuroWordNet Database

Ic

Verification

by users

Wordnet fragment in

EuroWordNet database

Demonstration

in

Information

Retrieval

Comparing and restructuring the wordnet

Verification

Report

III

main methods
Main Methods
  • Expand approach: translate WordNet1.5 synsets to another language and take over the structure
    • easier and more efficient method
    • compatible structure with WordNet1.5
    • structure is close to WordNet1.5 but also biased by it
  • Merge approach: create an independent wordnet in another language and align the separate hierarchies by generating the appropriate translations
    • more complex and labour intensive
    • different structure from WordNet1.5
    • lanuage specific patterns can be maintained
methods for extracting language internal relations
Methods for extracting language-internal relations
  • editors and database for manually encoding relations;
  • comparison with WordNet1.5 structure;
  • definition patterns in monolingual dictionaries;
  • co-occurrences in corpora;
  • morphology;
  • bilingual dictionaries;
  • lexical semantic substitution tests
slide51

Methods for extracting equivalence relations

  • extract monosemeous translations of English synsets, e.g. a Spanish word has only 1 translation to an English word which has only one sense and vice versa;
  • disambiguation of multiple ambivalent translations by measuring their conceptual-distance between the senses of these translations in the WordNet1.5 hierarchy (Rigau and Aguirre, 95);
  • disambiguation of ambivalent translations by measuring the conceptual-distance directly in the WordNet1.5 hierarchy between alternative translations and the translations of the direct semantic context in the source wordnet;
  • disambiguation of ambivalent translations by measuring the overlap in top-concepts inherited in the source wordnet and inherited for the different senses of translations in WordNet1.5;
aligning wordnets

object

artifact object

natural object

instrument

muziekinstrument

musical instrument

orgel

organ

?

organ

organ

hammond orgel

hammond organ

Aligning wordnets
inheriting semantic features
Inheriting Semantic Features

hart 1

orgaan 1 (Living Part) deel 2 (Part) iets 1 LEAF

-----------------------------------------------------------------------------------------------------

heart 1

playing card 1 card 1 (Artifact Function Object) paper 6 (Artifact Solid)

material 5 (Substance) matter 1 inanimate object 1 entity 1 LEAF

heart 2

disposition 2 (Dynamic Experience Mental)nature 1

trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF

heart 3

bravery 1 spirit 1 character 1 trait 1 (Property) attribute 1 (Property)

abstraction 1 LEAF

heart 4

internal organ 1 organ 4 (Living Part) body part 1 (Living Part)

part 10 entity 1 LEAF

conflicting starting points
Conflicting Starting points
  • 1. There should be a maximum of flexibility:
      • the wordnets should be able to reflect language-specific relations and patterns
      • the wordnets should be built relatively independently because each sites has different starting points:
        • different tools, database and resources (Machine Readable Dictionaries)
        • differences in the languages
  • 2. The wordnets have to be compatible in terms of coverage and relations to be useful for multilingual information retrieval and translations tools and to be able to compare the wordnets.
measures to achieve maximal compatibility
Measures to achieve maximal compatibility
  • The results are loaded into a common Multilingual Database (Polaris):
    • consistency checks and types of incompatibility
    • specific comparison options to measure consistency and overlap in coverage
  • User-guides for building wordnets in each language:
    • the steps to encode the relations for a word meaning.
    • common tests and criteria for all the relations.
    • overview of problems and solutions.
  • A set of common Base-Concepts which are shared by all the sites, having:
    • most relations and the most-important positions in the wordnets
    • most meanings and badly defined
  • Classification of the common Base Concept in terms of a Top-Ontology of 63 basic Semantic Distinctions
  • Top-Down Approach, where first the Base Concepts and their direct context are (manually) encoded and next the wordnets are (semi-automatically) extended top-down to include more specific concepts that depend on these Base Concept.
top ontology and base concepts
Top-Ontology and Base Concepts
  • Top-Ontology with 63 higher-level concepts
  • Existing Ontologies:
      • WordNet1.5 top-levels
      • Aktions-Art models (Vendler, Verkuyl)
      • Acquilex and Sift ontologies (EC-projects)
      • Qualia-structure (Pustejovsky)
      • Upper-Model, MikroKosmos, Cyc, Ad Hoc ANSI-Committee on ontologies
    • The ontology was adapted to represent the variety of concepts in the set of Common Base Concepts, across the 4 language:.
      • homogenous Base-Concept Clusters
      • average size of Base Concept Cluster
      • apply to both nouns and verbs
  • Set of 1024 common Base Concepts making up the core of the separate wordnets.
base concepts
Base Concepts
  • Procedure:
  • Each site determined the set of word meanings with most relations (up to 15% of all relations) and high positions in the hierarchy.
  • This set was extended with all meanings used to define the first selection.
  • The local selection was translated to WordNet1.5 equivalences: 4 lists of WordNet1.5 synsets (between 450 – 2000 synsets per selection).
  • These sets of WordNet1.5 translations have been compared.
  • Concepts selected by all sites:
  • 30 synsets (24 nouns synsets, 6 verb synsets).
  • Explanations:
  • The individual selections are not representative enough.
  • There are major differences in the way meanings are classified, which have an effect on the frequency of the relations.
  • The translations of the selection to WordNet1.5 synsets are not reliable
  • The resources cover very different vocabularies
slide60

Concepts selected by at least two sites: intersections of pairs

NOUNS VERBS

NL ES IT GB/WN NL ES IT GB/WN

NL 1027 103 182 333 323 36 42 86

ES 103 523 45 284 36 128 18 43

IT 182 45 334 167 42 18 104 39

GB/WN 333 284 167 1296 86 43 39 236

Total Set of shared Base Concepts : Union of intersection pairs

Nouns Verbs Total

1stOrderEntities 491 491

2ndOrderEntities 272 228 500

3rdOrderEntities 33 33

Total 796 228 1024

slide61

Table 4: Number of Common BCs represented in the local wordnets

Related to CBCs Eq_synonym Eq_near_ CBCs Without

Relations Synonym relations Direct Equivalent

AMS 992 725 269 97

FUE 1012 1009 0 15

PSA 878 759 191 9

Table 5: BC4 Gaps in at least two wordnets (10 synsets)

body covering#1 mental object#1; cognitive content#1; content#2

body substance#1 natural object#1

social control#1 place of business#1; business establishment#1

change of magnitude#1 plant organ#1

contractile organ#1 Plant part#1

psychological feature#1 spatial property#1; spatiality#1

slide62

Table 6: Local senses with complex equivalence relations to CBCs

  • NL ES IT
  • Eq_has_hyperonym 61 40 4
  • eq_has_hyponym 34 14 20
  • Eq_has_holonym 2 0
  • Eq_has_meronym 3 2
  • Eq_involved 3
  • Eq_is_caused_by 3
  • Eq_is_state_of 1
  • Example of complex relation
      • CBC: cause to feel unwell#1, Verb
      • Closest Dutch concept: {onwel#1}, Adjective (sick)
  • Equivalence relation: eq_is_caused_by
adaptation of base concepts in eurowordnet 2
Adaptation of Base Concepts in EuroWordNet-2
  • A similar selection of fundamental concepts has been made in EuroWordNet-2
  • The selected concepts have been compared among German, French, Czech and Estonian and with the EuroWordNet-1 selection
  • The EuroWordNet-1 set has been extended to 1310 Base Concepts
  • A distinction has been made between Hard and Soft Base Concepts
    • Hard: represented by only a single Index-record
    • Soft: represented by several close Index-records
  • The final set has been used as starting point in EuroWordNet-2
starting points for the top ontology
Starting points for the Top-Ontology
  • The ontology should support the building and encoding of semantic networks as linguistic ontologies: networks of lexicalized words and expressions in a language.
  • The classification of the Base Concepts in terms of the Top Ontology should apply to all the involved languages.
  • Enforce uniformity and compatibility of the different wordnets, by providing a common framework. Divide the Base Concepts (BCs) into coherent clusters to enable contrastive-analysis and discussion of closely related word meanings
  • Customize the database by assigning features to the top-concepts, irrespective of language-specific structures.
  • Provide an anchor point for connecting other ontologies to the Inter-Lingual-Index, such as CYC, MikroKosmos, the Upper-Model, by linking them to the corresponding ILI-records.
principles for deciding on the distinctions
Principles for deciding on the distinctions
  • Starting point is that the wordnets are linguistic ontologies:
  • Semantic classifications common in linguistic paradigms: Aktionsart models [Vendler 1967, Verkuyl 1972, Verkuyl 1989, Pustejovsky 1991], entity-orders [Lyons 1977], Aristotle’s Qualia-structure [Pustejovsky 1995].
  • Ontologies developed in previous EC-projects, which had a similarbasisand are well-known in the project consortium: Acquilex (BRA 3030, 7315), Sift (LE-62030, [Vossen and Bon 1996].
  • The ontology should be capable of reflecting the diversity of the set of common BCs, across the 4 languages. In this sense the classification of the common BCs in terms of the top-concepts should result in:
    • Homogeneous Base Concept Clusters: classifications in WordNet1.5 and the other wordnets.
    • Average-sized Base Concept Clusters: not extremely large or small.
other important characteristics
Other important characteristics:
  • The distinctions apply to both nouns, verbs and adjectives, because these can be related in the language-specific wordnets via a xpos_synonymy relation, and the ILI-records can be related to any part-of-speech.
  • The top-concepts are hierarchically ordered by means of a subsumption relation but there can only be one super-type linked to each top-concept: multiple inheritance between top-concepts is not allowed.
  • In addition to the subsumption relation top-concepts can have an opposition-relation to indicate that certain distinctions are disjunct, whereas others may overlap.
  • There may be multiple relations from ILI-records to top-concepts: the Base Conceptss can be cross-classified in terms of multiple top-concepts (as long as these have no opposition-relation between them): i.e. multiple inheritance from Top-Concept to Base Concept is allowed.
  • Result: the TCs function as cross-classifying features rather than conceptual classes.
    • Meanings for bodyparts are not linked to a single class BodyPart but to two features: Living and Part.
the eurowordnet top ontology 63 concepts excluding the top
The EuroWordNet Top-Ontology: 63 concepts (excluding the top)
  • First Level [Lyons 1977]:
  • 1stOrderEntity(491 BC synsets, all nouns)
    • Any concrete entity (publicly) perceivable by the senses and located at any point in time, in a three-dimensional space.
  • 2ndOrderEntity(500 BC synsets, 272 nouns and 228 verbs)
    • Any Static Situation (property, relation) or Dynamic Situation, which cannot be grasped, heart, seen, felt as an independent physical thing. They can be located in time and occur or take place rather than exist; e.g. continue, occur, apply
  • 3rdOrderEntity(33 BC synsets, all nouns)
    • An unobservable proposition that exists independently of time and space. They can be true or false rather than real. They can be asserted or denied, remembered or forgotten. E.g. idea, though, information, theory, plan.
test to distinguish 1st 2nd and 3rd orderentities
Test to distinguish 1st, 2nd and 3rd OrderEntities
  • Third-order entities cannot occur, have no temporal duration and therefore fail on both tests:
  • a The same person was here again to-day
  • b The same thing happened/occurred again to-day
  • *? The idea, fact, expectation, etc.... was here/occurred/ took place
  • A positive test for a 3rdOrderEntity is based on the properties that can be predicated:
  • ok The idea, fact, expectation, etc.. is true, is denied, forgotten
  • The first division of the ontology is disjoint: BCs cannot be classified as combinations of these TCs. This distinction cuts across the different parts of speech in that:
  • 1stOrderEntities are always (concrete) nouns.
  • 2ndOrderEntities can be nouns, verbs and adjectives, where adjectives are always non-dynamic (refer to states and situations not involving a change of state).
  • 3rdOrderEntities are always (abstract) nouns.
base concepts classified as 3rdorderentities
Base Concepts classified as 3rdOrderEntities
  • theory; idea; structure; evidence; procedure; doctrine; policy; data point; content; plan of action; concept; plan; communication; knowledge base; cognitive content; know-how; category; information; abstract; info;
slide72

1stOrderEntity1

Origin 0 the way in which an entity has come about

Natural21 Living30 Plant18

Human106

Creature2

Animal123

Artifact144

Function0 the typical activity or role that is associated with an entity

Vehicle8 Occupation23 Covering8

Garment3 Software4 Furniture6

Place45 Container12 Comestible32

Instrument18 Container12 Building13

Representation12: MoneyRepresentation10; LanguageRepresentation34; Image Representation9

Form0 a-morf or fixed shape.

Substance32 Solid63

Liquid13

Gas1

Object62

Composition0 group of self-contained wholes or as a part of such a whole

Part86

Group63

conjunctive classes of 1storderentities
Conjunctive classes of 1stOrderEntities

Frequent combinations

5 Comestible;Solid;Artifact 7 LanguageRepresentation

5 Container;Part;Solid;Living 7 Vehicle;Object;Artifact

5 Furniture;Object;Artifact 10 Instrument;Object;Artifact

5 Instrument;Artifact 12 Part

5 Living 14 Place

5 Plant 14 Place;Part

6 Liquid 15 Substance

6 Object;Artifact 19 LanguageRepresentation;Artifact

6 Part;Living 20 Occupation;Object;Human

6 Place;Part;Solid 22 Object;Animal; Function

7 Building;Object;Artifact 38 Group;Human

7 Group 42 Object;Human

conjunctive classes of 1storderentities1
Conjunctive classes of 1stOrderEntities
  • Low Frequent combinations
  • fruit: Comestible (Function) life: Group (Composition)
  • Object (Form) Living (Natural, Origin)
      • Part (Composition) cell: Part (Composition)
      • Plant (Natural, Origin) Living (Natural, Origin)
  • skin: Covering (Covering) arms: Instrument (Function)
  • Solid (Form) Group (Composition)
      • Part (Composition) Object (Form)
      • Living (Natural, Origin) Artifact (Origin)
1storderentities classified as function only
1stOrderEntities classified as Function only

barrier 1; belonging 2;building material 1;causal agency 1;commodity 1;consumer goods 1;creation 3;curative 1;decoration 2;device 4;fastener 1;force 6;force 7;form 5;impediment 1;

medicament 1;piece of work 1;possession 1;protection 4;remains 2;restraint 2;support 6;support; 7;supporting structure 1;thing 3

2ndorderentity 0
2ndOrderEntity0

SituationType6 (the event-structure in terms of which a situation can be characterized as a conceptual unit over time; Disjoint features)

Dynamic134

(he sat down quickly. a quick meeting)

BoundedEvent183

UnboundedEvent48

Static28

(?he sits quickly.)

Property61

Relation38

SituationComponent0

(the most salient semantic component(s) that characterize(s) a situation; Conjuncted Features)

Cause67 Communication50 Condition62 Physical140

Agentive170Existence27 Experience43 Possession23

Phenomenal17Location76 Manner21 Purpose137

Stimulating25Mental90 Modal10 Quantity39

Social102 Time24 Usage8

conjunctive classes of 2ndorderentities
Conjunctive classes of 2ndOrderEntities

Static

5 Property;Physical;Condition

5 Property;Stimulating;Physical

5 Relation

5 Relation;Social

6 Static;Quantity

7 Property;Condition

8 Relation;Location

9 Property

10 Relation;Physical;Location:

adjoin 1; aim 4; blank space 1; course 7; direction 8; distance 1; elbow room 1; path 3; spatial property 1; spatial relation 1

conjunctive classes of 2ndorderentities1
Conjunctive classes of 2ndOrderEntities

Dynamic

5 BoundedEvent;Cause;Physical

5 BoundedEvent;Cause;Physical;Location

5 BoundedEvent;Time

5 Dynamic

5 Dynamic;Location

5 Dynamic;Phenomenal

5 Dynamic;Phenomenal;Physical

6 BoundedEvent;Agentive

6 BoundedEvent;Location

6 BoundedEvent;Physical;Location

6 Dynamic;Agentive;Communication

6 Dynamic;Cause

8 BoundedEvent;Agentive;Mental;Purpose

8 BoundedEvent;Quantity;Time

9 BoundedEvent;Cause

9 Dynamic;Experience;Mental experience 7; find 3;affect 5; arouse 5; excite 2; cognition 1; desire 2; disposition 2; disposition 4; disturbance 7; emotion 1; feeling 1; humor 3; pleasance 1; process 4; look 8; phenomenon 1; cause to appear 1; perception 2; sensation 1; feel 12; experience 8; trouble 3; reality 1

top down building procedure
Top-Down Building Procedure
  • 1) Construction of a core wordnet from the common set of Base Concepts
  • Find Representatives in the local language for the Common Base Concepts (1310 synsets)
  • Add local Base Concepts that are not selected as Common Base Concepts
  • Specify the hyperonyms of the local and common Base Concepts
  • 2) Extend the Core Wordnets
  • Add the first level of hyponyms to the core wordnets
  • Add other hyponyms which have many sub-hyponyms
  • Add other types of relations: XPOS, roles, meronymy, subevents, causes.
  • 3) Verify the Selection
  • Corpus frequency: Parole lexicons and corpora
  • Top-Concept clustering
  • Intersection of ILI-records
  • Overlap in ILI-chains
top down building
Top-Down Building

Top-Ontology

63TCs

Hypero

nyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

1310 CBCs

149 new ILIs

CBC

Repre-senta.

Local

BCs

WMs

related via

non-hypo

nymy

WMs

related via

non-hypo

nymy

Remaining

WordNet1.5

Synsets

First Level Hyponyms

First Level Hyponyms

Remaining

Hyponyms

Remaining

Hyponyms

Inter-Lingual-Index

comparison of wordnets
Comparison of wordnets
  • In depth comparison of major semantic fields
  • Comparison of the intersection of the associated ILI-records Distribution of the associated ILI-records over the different top ontology clusters
  • Comparison of the hyponymy relations in the wordnets, projected on the associated ILI-records
intersection of the associated ili records

Nouns

Verbs

Total

62780

32520

Total

12215

7455

frequency

% of 

(WN,IT, NL, ES)

% of 

(IT, NL, ES)

frequency

% of 

(WN,IT, NL, ES)

% of 

(IT, NL, ES)

ES

24596

39.2%

75.6%

4654

38.1%

62.4%

IT

14272

22.7%

43.9%

4673

38.3%

62.7%

NL

21259

33.9%

65.4%

6416

52.5%

86.1%

Ç (ES, IT)

10907

17.4%

33.5%

3272

26.8%

43.9%

Ç (ES, NL)

14773

23.5%

45.4%

3870

31.7%

51.9%

Ç (IT, NL)

9862

15.7%

30.3%

3950

32.3%

53.0%

Ç (ES, IT, NL)

8183

13.0%

25.2%

3051

25.0%

40.9%

Intersection of the associated ILI-records
comparison of the hyponymy relations projected on the associated ili records
Comparison of the hyponymy relations, projected on the associated ILI-records

To be able to compare hyponymy chains, each word sense in the chain has been replaced by the ILI-records that are linked to these synsets which gives the following result:

veranderen (change)  bewegen (move intransitive)  bewegen (move reflexive)  voortbewegen (move location)  verplaatsen (move from A to B)  stijgen (move to a higher position)  opstijgen (take off)

00064108 01046072 01046072 01046072 01055491 01094615 00257753

towards an efficient condensed and universal index of sense distinctions
Towards an efficient, condensed and universal index of sense-distinctions
  • Independently of the wordnet structures in each language, we can manipulate the mapping across languages via the ILI.
  • We can use the information of all the languages to correct incompleteness and inconsistencies of the individual resources
  • Ultimately, we should try to find a minimal and sufficient set of concepts to provide an efficient mapping.
characteristics of the inter lingual index
Characteristics of the Inter-Lingual-Index
  • The Inter-lingual-Index (ILI) is an unstructured fund of concepts with the sole purpose of providing an efficient mapping of senses across languages.
  • Requirements:
  • 1. efficient level of granularity
  • ILI Wordnets
  • {break} “He broke the glass” breken Dutch
  • {break; cause to break} breken Dutch
  • {break; damage} inflict damage upon.romper Spanish
  • rompere Italian
  • 2. superset of concepts that occur across languages
    • ILI Wordnets
  • {cashier}eq_hyperonym cassière Dutch
  • eq_hyperonym cajera Spanish
  • {female cashier} eq_synonym cassière Dutch
  • eq_synonym cajera Spanish
a minimal and efficient set of concepts
A Minimal and Efficient set of concepts
  • Globalizing the sense-differentiation:
    • create metonymic clusters
    • abstract from contextual specialization and grammatical perspectives
    • abstract from part-of-speech realization
    • abstract from productive and predictable meanings
  • Extending the Inter-Lingual-Index to become the superset of concepts occurring in two or more wordnets only if:
    • concepts are unpredictable and unproductive
    • concepts cannot be linked exhaustively and uniquely to the ILI
under specified concepts metonymic clusters
Under-specified conceptsMetonymic clusters

eq_metonym

eq_metonym

club

metonym#

club: organization

metonym#

club: building

{vereniging}NL

eq_synonym

eq_synonym

{club}EN

{club;

verenigingsgebouw}NL

under specified concepts generalization and diathesis clusters
Under-specified conceptsGeneralization and Diathesis clusters

eq_diatheis

eq_diathesis

break

diathesis#

break:

inchoative

diathesis#

break:

causative

{breken; kapotgaan}NL

{rompere}IT

{breken; kapotmaken}NL

eq_synonym

eq_synonym

{rompersi}IT

under specified for pos
Under-specified for POS

eq_xpos_synonym

eq_xpos_synonym

depart

xpos#

departure

xpos#

depart

{vertrekkenV}NL

{departV}EN

eq_synonym

eq_synonym

{departureN}EN

{vertrekN}NL

overview of equivalence relations to the ili
Overview of equivalence relations to the ILI

Relation POS Sources: Targets Example

eq_synonym same 1:1 auto : voiture

car

eq_near_synonym any many : many apparaat, machine, toestel:

apparatus, machine, device

eq_hyperonym same many : 1 (usually) citroenjenever:

gin

eq_hyponym same (usually) 1 : many dedo :

toe, finger

eq_metonymy same many/1 : 1 universiteit, universiteitsgebouw:

university

eq_diathesis same many/1 : 1 raken (cause), raken:

hit

eq_generalization same many/1 : 1 schoonmaken :

clean

progress on restructuring the ili
Progress on restructuring the ILI

Clusters added manually and automatically based on:

  • structural properties of WN1.5
  • mapping to other sources: Levin’s classes, WN1.6
  • cross-lingual mapping

clusters words word senses synsets

Nouns 1703 1398 3205 2895

Verbs 2905 1799 5134 3839

New ILIs from other wordnets have not yet been added. We estimated that for verbs hardly any new ILIs are needed, for nouns about 30% of non-translated concepts (2,000 synsets based on Dutch).

effects of ili clusters
Effects of ILI-clusters

Intersection of ILI-references for Dutch, Spanish, Italian and English

Nouns 2895 clustered synsets (4,6% of 62780 WN1.5 noun synsets)

intersection increased from 7736 (23,8%) to 8183 (25,2%) out of the union of 32520 synsets

Verbs 3839 clustered synsets (31,4% of 12215 WN1.5 verb synsets)

intersection increased from 1632 (21,9%) to 3051 (40,9%) out of the union of 7455 synsets

superset of all concepts
Superset of all concepts.
  • Procedure:
          • Initially, the ILI will only contain WordNet1.5 synsets.
          • a site that cannot find a proper equivalent among the available ILI-concepts will link the meaning to another ILI-record using a so-called complex-equivalence relation and will generate a potential new ILI-record:
  • Dutch Meaning Definition Complex-equivalence Target concept
  • klunen to walk on skates has_eq_hyperonym walk
          • after a building-phase all potentially-new ILI-records are collected and verified for overlap by one site;
          • a proposal for updating the ILI is distributed to all sites and has to be verified;
          • the ILI is updated and all sites have to reconsider the equivalence relations for all meanings that can potentially be linked to the new ILI-records;
filling gaps in the ili
Filling gaps in the ILI

Types of GAPS

  • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin,
    • Non-productive
    • Non-compositional
  • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier)
    • Productive
    • Compositional
  • Universality of gaps: Concepts occurring in at least 2 languages
productive and predictable lexicalizations exhaustively linked to the ili
Productive and Predictable Lexicalizations exhaustively linked to the ILI

beat

eq_has_hyperonym

eq_has_hyperonym

{doodslaanV}NL

{totschlagenV}DE

eq_has_hyperonym

eq_has_hyperonym

kill

{doodstampenV}NL

{tottrampelnV}DE

eq_has_hyperonym

eq_has_hyperonym

stamp

eq_has_hyperonym

{doodschoppenV}NL

kick

eq_has_hyperonym

eq_has_hyperonym

eq_has_hyperonym

cashier

{casière}NL

{cajeraN}ES

eq_in_state

female

eq_in_state

eq_has_hyperonym

fish

{alevínN}ES

young

eq_in_state

towards an efficient condensed and universal index of sense distinctions1

Metonymy/

Generalization

clusters

Universal

Core meanings

WordNet1.5

POS

Independent

Non-predictable

90,000

concepts

Productive derivations and compounds linked

exhaustively

Universal systematic polysemy and level of granularity

Language and domain specific lexicalizations that do not occur in a large variety of languages

Language specific realizations in grammatical forms

Towards an efficient, condensed and universal index of sense-distinctions
the eurowordnet database
The EuroWordNet database

1.) The actual wordnets in Flaim database format: an indexing and compression format of Novell.

2.) Polaris (Louw 1997): Re-implementation of the Novell ConceptNet toolkit (Díez-Orzas et al 1995) adapted to the EuroWordNet architecture.

  • import and export wordnets or wordnet selections from/to ASCII files.
  • resolve links for imported concepts.
  • edit and add concepts, variants and relations in the wordnets.
  • access to the ILI and ontologies and to switch between the wordnets and ontologies via the ILI.
  • extract, import and export clusters of senses based on relations.
  • project synsets or clusters from one wordnet to another wordnet
  • compare clusters of synsets.
  • import new or adapted ILI-records.
  • update ILI-references to updated ILI.

3. Periscope (Cuypers and Adriaens 1997): a graphical interface for viewing the EuroWordNet database.

global wordnet association http www globalwordnet org
Global Wordnet Associationhttp://www.globalwordnet.org
  • provide a standardized framework to link, compare and build complete wordnets for all the European languages and dialects.
  • initialize the development of wordnets in non-European languages
  • develop more specific definitions, tests and procedures for evaluating and developing wordnets.
  • extend the specification of EuroWordNet to lexical units which are not yet covered (adjectives/adverbs, lexicalized phrases and multi-words).
  • develop (axiomatized) ontologies for Domains and World-Knowledge that can be shared by all languages via the ILI.
  • develop an efficient ILI for linking, sharing, consistency checking and cross-language technology applications. This ILI could function as a gold-standard of sense-distinctions.
  • organize a (annual/bi-annual) workshop or conference.
2nd global wordnet conference
2nd Global Wordnet Conference
  • Location: Masaryk University, Brno (Czech Republic),
  • January, 20 - 23, 2004.
  • http://www.fi.muni.cz/gwc2004/
other wordnet initiatives
Other wordnet initiatives
  • Welsh
  • Basque, Catalan
  • Chinese
  • BalkaNet
  • IndoWordnet
  • Meaning
  • Danish
  • Norway
  • Swedish
  • Portuguese
  • Arabic
  • Korean
  • Russian
balkanet
BalkaNet
  • Funded by the European Union as project IST-2000-29388.
  • 3-year project: 2001 - 2004
  • Follows a strict EuroWordNet approach:
    • Expanded set of base concepts
    • Top-down building approach
  • EWN database extended with:
    • Greek, Romanian, Serbian, Turkish, Bulgarian, Czech
  • Development of new wordnet database system: VisDic
  • http://www.ceid.upatras.gr/Balkanet/.
indowordnet
IndoWordnet
  • Current Wordnet development in India:
    • Hindi and Marathi at IIT Bombay,
    • Tamil at Anna University-K.B Chandrashekhar Research Centre (AU-KBC) Chennai and Tamil University Tanjavur,
    • Gujarathi at MS University Baroda, Oriya at Utkal University Bhubaneswar and Bengali at IIT Kharagpur.
  • The Hindi WordNet is at an advanced stage of development with about 11000 semantically linked synsets and with associated software and user interface.
indowordnet1
IndoWordnet
  • By the end of 2003 each Indian language will create a WordNet of 5000 synsets. These will be for about 2000 most frequent content words in each language. Use will be made of the wordlist sorted by frequency- available with the CIIL
  • Language specific WordNets developed by the following institutions:
    • CIIL, Mysore: Kannada, Kashmiri, Punjabi, Urdu, Himachali, Malayalam.
    • IIT Bombay: Hindi, Marathi and Konkani
    • AU-KBC Chenai and Tamil University Tanjavur: Tamil and Malayalam
    • University of Hyderabad: Telegu
    • University of Baroda: Gujarati
    • Utkal University Bhubaneswar: Oriya
    • IIT Kharagpur: Bengali
  • Reserach groups have to be identified for building the WordNets of Assamese, Nepali and Languages of the North East.
developing multilingual web scale language technologies http www lsi upc es nlp meaning

Meaning

Developing Multilingual Web-scale Language Technologies

http://www.lsi.upc.es/~nlp/meaning/

meaning objectives
Meaning Objectives
  • Funded by the European Union as project IST-2001-34460
  • 3 -year project: April 2002 - April 2005
  • Large-scale (Lexical) Knowledge Bases
    • Automatic enrichment of EWN
    • Mixed approach (KB + ML)
    • Applied to Q/A, CLIR
  • Problem
    • structural and lexical ambiguity
meaning approach
Meaning Approach
  • automatic collection of sense examples (Leacock et al. 98, Mihalcea y Moldovan 99)
  • Large-scale WSD (Boosting, SVM, transductives)
  • Large-scale Knowledge Acquisition (McCarthy 01, Agirre & Martinez 02)
slide115

English

Web Corpus

Italian

Web Corpus

English

EWN

Italian

EWN

Multilingual

Central Repository

Spanish

EWN

Basque

EWN

Spanish

Web Corpus

Catalan

EWN

Basque

Web Corpus

Catalan

Web Corpus

Meaning

Architecture

WSD

WSD

ACQ

UPLOAD

UPLOAD

ACQ

PORT

PORT

PORT

PORT

UPLOAD

UPLOAD

ACQ

ACQ

WSD

PORT

UPLOAD

WSD

ACQ

WSD

slide116

Meaning

WP6: Word Sense Disambiguation

  • A combination of unsupervised Knowledge-based and supervised Machine Learning techniques that will provide a high-precision system that is able to tag running text with word senses
  • A system that acquires a huge number of examples per word from the web
  • The use of sophisticated linguistic information, such as, syntactic relations, semantic classes, selectional restrictions, subcategorization information, domain, etc.
  • Efficient margin-based Machine Learning algorithms.
  • Novel algorithms that combine tagged examples with huge amounts of untagged examples in order to increase the precision of the system.
ad