Using wordnet predicates for multilingual named entity recognition
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Using WordNet Predicates for Multilingual Named Entity Recognition PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Using WordNet Predicates for Multilingual Named Entity Recognition. Matteo Negri and Bernardo Magnini ITC-irst Centro per la Ricerca Scientifica e Tecnologica, Trento - Italy [negri,magnini]@itc.it GWC’04 - Brno (Czech Republic), January 23 2004. Outline. Named Entity Recognition (NER)

Download Presentation

Using WordNet Predicates for Multilingual Named Entity Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Using wordnet predicates for multilingual named entity recognition

Using WordNet Predicates for Multilingual Named Entity Recognition

Matteo Negri and Bernardo Magnini

ITC-irst

Centro per la Ricerca Scientifica e Tecnologica, Trento - Italy

[negri,magnini]@itc.it

GWC’04 - Brno (Czech Republic), January 23 2004


Outline

Outline

  • Named Entity Recognition (NER)

  • Rule-based approach using WordNet information

    • WordNet Predicates (language independent)

    • Internal evidence: Word_Instances

    • External evidence: Word_Classes

  • System architecture

  • Experiments and results on English and Italian

  • Future work

GWC'04 - Brno (Czech Republic)


Named entity recognition ner

Named Entity Recognition (NER)

  • Given a written text, identify and categorize:

    • Entity names (e.g. persons, organizations, location names)

    • Temporal expressions (e.g. dates and time)

    • Numerical expressions (e.g. monetary values and percentages)

  • NER is crucial for Information Extraction, Question Answering and Information Retrieval

    • Up to 10% of a newswire text may consist of proper names , dates, times, etc.

GWC'04 - Brno (Czech Republic)


Using wordnet predicates for multilingual named entity recognition

NER for Question Answering

Q1848: What was the name of the plane that dropped the

Atomic Bomb on Hiroshima?

PERSON

DATE

LOCATION

OTHER

Tibbets piloted the Boeing B-29 Superfortress Enola Gay,

which dropped the atomic bomb on Hiroshima on Aug. 6, 1945, causing an estimated 66,000 to 240,000 deaths. He named the plane after his mother, Enola Gay Tibbets.

GWC'04 - Brno (Czech Republic)


Named entity hierarchy

Named Entity Hierarchy

PERSON

NAMEX

ORGANIZATION

LOCATION

DATE

TIMEX

TIME

ENTITY

DURATION

MONEY

CARDINAL

MEASURE

PERCENT

OTHER

GWC'04 - Brno (Czech Republic)


Motivations

Motivations

  • Experiment how far can we go with NER using WordNet as the main source of semantic knowledge for one language

  • Isolate language-independent relevant knowledge for the NER task

  • Experiment a multilingual approachtaking advantage of aligned wordnets (e.g. English/Italian)

GWC'04 - Brno (Czech Republic)


Knowledge based ner

Knowledge-Based NER

  • Combination of a wide range of knowledge sources

    • lexical, syntactic, and semantic features of the input text

    • world knowledge (e.g. gazetteers)

    • discourse level information (e.g. co-reference resolution)

GWC'04 - Brno (Czech Republic)


Rule based approach

Rule-Based approach

1 2 3 4

Rome is the capital of Italy

  • <LOCATION> Rome <\LOCATION> is the capital of Italy

GWC'04 - Brno (Czech Republic)


Wordnet predicates 1 wn preds

WordNet Predicates(1)(WN-preds)

lake#1

  • WN-preds are defined over a set of WordNet synsets which express a certain concept

  • Location#1

  • Solid_ground#1

location-p

  • Mandate#2

  • Geological_formation#1

person-p

  • Road#1

  • Body_of_water#1

measure-p

GWC'04 - Brno (Czech Republic)


Wordnet predicates 2

WordNet Predicates(2)

  • Input

    • A word w and a language L

  • Output

    • A boolean value (TRUE or FALSE)

    • TRUE if there exist at least one sense of w which is subsumed by at least one of the synsets defining the predicate

location-p[<“lake”>,<English>]

TRUE

because there exists a sense of “lake”(lake#1) which is subsumed by one of the synset that define the predicate (i.e. body_of_water#1)

GWC'04 - Brno (Czech Republic)


Wordnet predicates 3

WordNet Predicates(3)

  • WN-preds have been created for the following NE categories:

    • PERSON:person-name-p (person#1, spiritual-being#1)

      person-class-p (person#1, spiritual-being#1)

      first-name-p (person#1, spiritual-being#1)person-product-p (artifact#1)

    • LOCATION: location-name-p (location#1, road#1, mandate#1, body_of_water#1, solid_ground#1, geological_formation#1)

      location-class-p (location#1, road#1, mandate#1, body_of_water#1, solid_ground#1, geological_formation#1)

      movement-verb-p (locomote#1)

    • ORGANIZATION: org-name-p (organization#1)

      org-class-p (organization#1)

      org-representative-p (trainer#1, top_dog, spokesperson#1)

    • MEASURE: measure-unit-p (measure#1,

      number-p (digit#1, large_integer#1, common_fraction#1)

    • MONEY: money-p (monetary_unit#1,coin#1)

    • DATE: date-p (time_period#1)

GWC'04 - Brno (Czech Republic)


Wordnet predicates 4

WordNet Predicates(4)

  • The definition of a wordnet-predicate is language-independent.

  • In case of aligned wordnet w-preds can be easily parametrized with respect to a certain language without changing the predicate definition

    • E.g. (Location-p lakeEnglish)

      (Location-p lagoItalian)

GWC'04 - Brno (Czech Republic)


Knowledge based ner1

Knowledge-Based NER

  • Two kinds of information are usually distinguished in Named Entity Recognition(McDonald, 1996):

    • Internal Evidences: provided by the candidate string itself (e.g. Rome)

    • Drawbacks:

      • Dimension of reliable gazetteers

      • Maintenance (gazetteers are never “exhaustive”)

      • Overlap among the lists (“Washington”: person or location?)

    • Limited availability for languages other than English

    • ExternalEvidence: provided by the context into which the string appears (e.g. capital)

GWC'04 - Brno (Czech Republic)


Mining evidence from wordnet

Mining Evidence from WordNet

  • Both IE and EE can be mined from WordNet

    • Low coverage of Internal evidences (e.g. person names)

    • High coverage of trigger words

  • Approach: distinguishing between Word_Instances(e.g. “Nile#1”)and Word_Classes(e.g. “river#1”)

  • Problem: in WordNet such a distinction is not explicit!

GWC'04 - Brno (Czech Republic)


Word classes and word instances i

Word Classes and Word Instances I

...

person

...

intellectual

Italian

...

scientist

...

...

physicist

...

astronomer

Kepler

Galileo_Galilei

  • In WordNet, the hyponyms of the synset “person#1” are a mixture of concepts (e.g. “astronomer”, “physicist”, etc.) and individuals (e.g. “Galileo Galilei”, “Kepler”, etc.)

GWC'04 - Brno (Czech Republic)


Word classes and word instances 1

Word Classes and Word Instances (1)

...

...

person

...

intellectual

IE

(Word_Instances)

Italian

...

scientist

...

...

physicist

EE

(Word_Classes)

...

astronomer

Kepler

Galileo_Galilei

  • - NOTE: in WordNet, the hyponyms of the synset “person#1” are a mixture of concepts (e.g. “astronomer”, “physicist”, etc.) and individuals (e.g. “Galileo Galilei”, “Kepler”, etc.)

GWC'04 - Brno (Czech Republic)


Word classes and word instances 2

Word Classes and Word Instances (2)

  • Semi-automatic procedure to distinguish Word_Instances and Word_Classes in WordNet

  • 3 steps:

    • 1) collect all the hyponyms of several high-level synsets (e.g. “person#1”, “social_group#1”, “location#1”, “measure#1”, etc.)

    • 2) separate capitalized words from lower case words:

      capitalized words Word_Instances

      lower case words Word_Classes

    • 3) manual filter is necessary:

      “Italian” is not an Instance!

GWC'04 - Brno (Czech Republic)


Distribution of word classes and word instances in multiwordnet

Distribution of Word Classes and Word Instances in MultiWordNet

GWC'04 - Brno (Czech Republic)


System architecture nerd

System Architecture (NERD)

  • Preprocessing

    • tokenization

    • POS tagging

    • multiwords recognition

  • Basic rules application

    • 400 language-specific basic rules, both for English and Italian, are applied to find and tag all the possible NEs present in the input text

  • Composition rules application

    • higher level language-independent rules for handling ambiguities between possible multiple tags and for co-reference resolution

GWC'04 - Brno (Czech Republic)


Basic rules i

Basic Rules I

  • English basic rule for capturing IE

    • Example: “Galileo invented the telescope”

  • NOTE: the WN-pred person-name-pis satisfied by any of the 1202 English Instances of the category PERSON

GWC'04 - Brno (Czech Republic)


Basic rules ii

Basic Rules II

  • Italian basic rule for capturing IE

    • Example: “il telescopio fu inventato da Galileo”

  • NOTE: here, the WN-pred person-name-pis satisfied by any of the 1550 Instances (1202 for English + 348 for Italian) of the category PERSON

GWC'04 - Brno (Czech Republic)


Basic rules iii

Basic Rules III

  • Basic rule for capturing EE(via trigger words)

    • Example: “Roma è la capitale italiana”

  • NOTE: the WN-pred location-pis satisfied by any of the 979 Italian Classes of the category LOCATION

GWC'04 - Brno (Czech Republic)


Basic rules iv

Basic Rules IV

  • Basic rule for capturing EE(via sentence structure)

    • Example: “Bowman, who was appointed by Reagan …

  • NOTE: External Evidence can be captured from the context also in absence of particular word senses

GWC'04 - Brno (Czech Republic)


Composition rules

Composition Rules

  • Input: tagged text with all the possible Named Entities

  • Out: a tagged text, where:

    • overlaps and inclusions between tags are removed

    • co-references are resolved

GWC'04 - Brno (Czech Republic)


Composition rules ii

Composition Rules II

  • Composition rule for handling tag inclusions

    • Example: “... 200 miles from New York...”

B = CARDINAL

A = MEASURE

GWC'04 - Brno (Czech Republic)


Composition rules iii

Composition Rules III

  • Composition rule for co-reference resolution

    • Example: “…with Judge Pasco Bowman. Bowman was ...”

GWC'04 - Brno (Czech Republic)


Experiment

Experiment

  • DARPA/NIST HUB4 competition test corpora and scoring software

  • Categories: PERSON,LOCATION,ORGANIZATION

  • Reference tagged corpora

    • English: 365 Kb of newswire texts

    • Italian: 77 Kb of transcripts from two Italian broadcast news shows (~7000 words, 322 NEs)

  • F-measure, Precision and Recall computed comparing reference corpora with automatically tagged ones

    • type, content, and extension of each NE are considered

GWC'04 - Brno (Czech Republic)


Results

Results

GWC'04 - Brno (Czech Republic)


Conclusion and future work

Conclusion and Future Work

  • We presented a NE recognition system based on information represented in Wordnet

  • Language independent predicates for NE have been defined

  • Results on two languages show that the approach performs as state of art rule based systems

  • The system has been successfully integrated in a QA system

  • Future work:

    • move to WN 2.0

    • integrate gazetteers

    • use Sumo concepts

GWC'04 - Brno (Czech Republic)


  • Login