Automatic event extraction from text on the base of linguistic and semantic annotation
Download
1 / 34

Automatic event extraction from text on the base of linguistic and semantic annotation - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

Automatic event extraction from text on the base of linguistic and semantic annotation. Thierry Declerck DFKI – Language Technology Lab. Events …. Involve entities and relations between then Implies a change of states Example: The striker of Liverpool shot a wonderful goal in the 87. Minute.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Automatic event extraction from text on the base of linguistic and semantic annotation' - lazar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Automatic event extraction from text on the base of linguistic and semantic annotation l.jpg

Automatic event extraction from text on the base of linguistic and semantic annotation

Thierry Declerck

DFKI – Language Technology Lab

JRC 2005/05/10


Events l.jpg
Events … linguistic and semantic annotation

  • Involve entities and relations between then

  • Implies a change of states

    • Example: The striker of Liverpool shot a wonderful goal in the 87. Minute.

      • 1 event (goal-shot)

      • 2 entities (person and team)

      • 1 change of state (the scoring)

JRC 2005/05/10


Events in textual documents l.jpg
Events in textual documents linguistic and semantic annotation

  • Various types of text

    • Structured: Example and Example_2

      • For processing, pattern matching techniques required. Very few linguistic knowledge needed

    • Semi-structured: Example

      • Requires a mixture of pattern matching and more linguistic knowledge

    • Unstructured: Example

      • Requires a mixture of layout analysis and linguistic knowledge

  • All types of text require a domain specific knowledge base (ontology) for event extraction

JRC 2005/05/10


Domain knowledge l.jpg
Domain Knowledge linguistic and semantic annotation

  • Domain knowledge can be organised in terminologies, thesauri, taxonomies or ontologies. Example of a (non-formal) multingual ontology for the soccer domain.

  • More on ontology engineering in the talk by Borislav

JRC 2005/05/10


Automatic event extraction from text is l.jpg
Automatic Event Extraction from Text is linguistic and semantic annotation

  • A combination of human language technology (HLT) and semantic web technologies (ontologies)

  • Can also be done on the base of purely statistical means (with minimal linguistic knowledge), but we concentrate here on the HLT-based approach

JRC 2005/05/10


What is human language technology l.jpg
What is Human Language Technology linguistic and semantic annotation

JRC 2005/05/10


Slide7 l.jpg

Linguistic Analysis linguistic and semantic annotation

Language technology tools are needed to support the upgrade of the actual web to the Semantic Web (SW) by providing an automatic analysis of the linguistic structure of textual documents. Free text documents undergoing linguistic analysis become available as semi-structured documents, from which meaningful units can be extracted automatically (information extraction) and organized through clustering or classification (text mining). Here we focus on the following linguistic analysis steps that underlie the extraction tasks: tokenization,morphological analysis, part-of-speech tagging, chunking, dependency structure analysis, semantic tagging.

JRC 2005/05/10


Slide8 l.jpg

Tokenisation linguistic and semantic annotation

Tokenisation deals with the detection of the word units in a text and with the detection of sentence boundaries.

The markets acknowledge the measures taken on the 24th of September by the CEO of XYZ Corp.

JRC 2005/05/10


Slide9 l.jpg

Morphological Analysis linguistic and semantic annotation

Morphological analysis is concerned with the inflectional, derivational, and compounding processes in word formation in order to determine properties such as stem and inflectional information. Together with part-of-speech (PoS) information this process delivers the morpho-syntactic properties of a word.

While processing the German word Häusern (houses) the following morphological information should be analysed:

[PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS]

JRC 2005/05/10


Slide10 l.jpg

Part-of-Speech Tagging linguistic and semantic annotation

Part-of-Speech (PoS) tagging is the process of determining the correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for a particular word given its current context. The word “works” in the following sentences will be either a verb or a noun:

He works[N,V] the whole day for nothing.

His works[N,V]have all been sold abroad.

PoS tagging involves disambiguation between multiple part-of-speech tags, next to guessing of the correct part-of-speech tag for unknown words on the basis of context information.

JRC 2005/05/10


Slide11 l.jpg

Chunking linguistic and semantic annotation

Chunks are sequences of words which are grouped on the base of linguistic properties, such as nominal, prepositional, adjectival and adverbial phrases and verb groups.

[NP His works] [VG have] [NP all] [VG been sold] [AdvP abroad].

JRC 2005/05/10


Slide12 l.jpg

Named Entities detection linguistic and semantic annotation

Related to chunking is the recognition of so-called named entities (names of institutions and companies, date expressions, etc.). The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) with the definition of regular expression patterns. Named entity recognition can be included as part of the linguistic chunking procedure and the following sentence fragment:

“…the secretary-general of the United Nations, Kofi Annan,…”

will be annotated as a nominal phrase, including two named entities: United Nations with named entity class: organization,and Kofi Annan with named entity class: person

JRC 2005/05/10


Slide13 l.jpg

Dependency Structure Analysis linguistic and semantic annotation

A dependency structure consists of two or more linguistic units that immediately dominate each other in a syntax tree. The detection of such structures is generally not provided by chunking but is building on the top of it.

There are two main types of dependencies that are relevant for our purposes: On the one hand, the internal dependency structure of phrasal units or chunks and on the other hand the so-called grammatical functions (like subject and direct object).

JRC 2005/05/10


Slide14 l.jpg

Internal Dependency Structure linguistic and semantic annotation

In linguistic analysis, for this we use the terms head, complements and modifiers, where the head is the dominating node in the syntax tree of a phrase (chunk), complements are necessary qualifiers thereof, and modifiers are optional qualifiers.Consider the following example:

“The shot by Christian Ziege goes over the goal.”

The prepositional phrase “by Christian Ziege” (containing the named entity Christian Ziege) depends on (and modifies) the head noun “shot”.

.

JRC 2005/05/10


Slide15 l.jpg

Grammatical Functions linguistic and semantic annotation

Determine the role (function) of each of the linguistic chunks in the sentence and allow to identify the actors involved in certain events. So for example in the following sentence, the syntactic (and also the semantic) subject is the NP constituent “The shot by Christian Ziege”:

“The shot by Christian Ziege goes over the goal.”

This nominal phrase depends on (and complements) the verb “goes”, whereas the Noun “shot” is the head of the NP (it this the shot going over the goal, and not Christian Ziege!)

JRC 2005/05/10


Slide16 l.jpg

Semantic Tagging linguistic and semantic annotation

Automatic semantic annotation has developed within language technology in recent years in connection with more integrated tasks like information extraction, which require a certain level of semantic analysis. Semantic tagging consists in the annotation of each content word in a document with a semantic category. Semantic categories are assigned on the basis of a semantic resources like WordNet for English or EuroWordNet, which links words between many European languages through a common inter-lingua of concepts.

JRC 2005/05/10


Slide17 l.jpg

Semantic Resources linguistic and semantic annotation

  • Semantic resources are captured in dictionaries, thesauri, and semantic networks, all of which express, either implicitly or explicitly, an ontology of the world in general or of more specific domains, such as medicine.

  • They can be roughly distinguished into the following three groups:

  • Thesauri: Semantic resources that group together similar words or terms according to a standard set of relations, including broader term, narrower term, sibling, etc. (like Roget)

  • Semantic Lexicons: Semantic resources that group together words (or more complex lexical items) according to lexical semantic relations like synonymy, hyponymy, meronymy, and antonymy (like WordNet)

  • Semantic Networks: Semantic resources that group together objects denoted by natural language expressions (terms) according to a set of relations that originate in the nature of the domain of application (like UMLS in the medical domain)

JRC 2005/05/10


Slide18 l.jpg

The MeSH Thesaurus linguistic and semantic annotation

MeSH (Medical Subject Headings) is a thesaurus for indexing articles and books in the medical domain, which may then be used for searching MeSH-indexed databases. MeSH provides for each term a number of term variants that refer to the same concept. It currently includes a vocabulary of over 250,000 terms. The following is a sample entry for the term gene library (MH is the term itself, ENTRY are term variants):

MH = Gene Library

ENTRY = Bank, Gene

ENTRY = Banks, Gene

ENTRY = DNA Libraries

ENTRY = Gene Bank

etc.

JRC 2005/05/10


Slide19 l.jpg

The WordNet Semantic Lexicon linguistic and semantic annotation

WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses)

JRC 2005/05/10


Slide20 l.jpg

The WordNet Semantic Lexicon linguistic and semantic annotation

WordNet has primarily been designed as a computational account of the human capacity of linguistic categorization and covers an extensive set of semantic classes (called synsets). Synsets are collections of synonyms, grouping together lexical items according to meaning similarity. Synsets are actually not made up of lexical items, but rather of lexical meanings (i.e. senses)

JRC 2005/05/10


Slide21 l.jpg

WordNet: An Example linguistic and semantic annotation

The word 'tree' has two meanings that roughly correspond to the classes of plants and that of diagrams, each with their own hierarchy of classes that are included in more general super-classes:

09396070 tree 0

09395329 woody_plant 0 ligneous_plant 0

09378438 vascular_plant 0 tracheophyte 0

00008864 plant 0 flora 0 plant_life 0

00002086 life_form 0 organism 0 being 0 living_thing 0

00001740 entity 0 something 0

10025462 tree 0 tree_diagram 0

09987563 plane_figure 0 two-dimensional_figure 0

09987377 figure 0

00015185 shape 0 form 0

00018604 attribute 0

00013018 abstraction 0

JRC 2005/05/10


What is the semantic web l.jpg
What is the Semantic Web linguistic and semantic annotation

  • “The Semantic Web is a new initiative to transform the web into a structure that supports more intelligent querying and browsing, both by machines and by humans. This transformation is to be supported through the generation and use of metadata constructed via web annotation tools using user-defined ontologies that can be related to one another.”

    Somewhere on the web

JRC 2005/05/10


Slide23 l.jpg

End User linguistic and semantic annotation

Semantic Web

Ontology Articulation

Toolkit

Agents

Ontology Construction

Tool

Ontologies

Community Portal

x C  D

Inference

Engine

Web-Page Annotation

Tool

Annotated Web Pages

Metadata Repository

Based on www.semanticweb.org

JRC 2005/05/10


Extracting events from structured documents l.jpg
Extracting Events from Structured Documents linguistic and semantic annotation

  • Detecting Metadata in our Example:

    • Type of game: N/A

    • Teams involved: England - Deutschland

    • Players: Deutschland: Kahn (2) - Matthaeus (3) - Babbel (3,5),

    • Final (and intermediate) score:1:0 (0:0)

    • Referee:Schiedsrichter: Collina, Pierluigi (Viareggio)

    • Date: N/A

    • Etc…

JRC 2005/05/10


Extracting events from structured documents 2 l.jpg
Extracting Events from Structured Documents (2) linguistic and semantic annotation

  • Detecting Events in our Example:

    • Substitution: Eingewechselt: 61. Gerrard fuer Owen,

    • Goal: Tore: 1:0 Shearer (53., Kopfball, Vorarbeit Beckham)

    • Cards: Gelbe Karten: Beckham - Babbel, Jeremies

JRC 2005/05/10


Results in xml l.jpg
Results in XML linguistic and semantic annotation

  • Automatically extracted events (and entities and relations) from structured text, on the base of patterns (DTD) of typical expressions and the soccer ontology. Example and Example_2

  • Since various results are available in XML files, those results can be merged automatically, guided by the ontology. Example. This is supporting an incremental and dynamic extraction.

JRC 2005/05/10


Extracting events from semi structured documents l.jpg
Extracting Events from Semi-Structured Documents linguistic and semantic annotation

  • Need of linguistic processing, for providing of a basic structure of the document, which allows the domain specific annotation. Example.

JRC 2005/05/10


Extracting events from semi structured documents 2 l.jpg
Extracting Events from Semi-Structured Documents (2) linguistic and semantic annotation

  • Using as well the results from the semantic annotation of the structured documents, supporting incremental extraction: Example.

JRC 2005/05/10


Actual development l.jpg
Actual Development linguistic and semantic annotation

  • Extracting information from multilingual balance sheets (WINS eTen project), extending this to unstructured text and extracting relations and events from annexes to balance sheets (upcoming Project MUSING).

  • Detecting positive/negative mentioning of entities in news documents (project Direct-Info on Media Monitoring). Example.

JRC 2005/05/10


Further challenge for hlt l.jpg
Further Challenge for HLT linguistic and semantic annotation

  • Not only use HLT for the semantic annotation of web pages (or other documents), but use HLT for supporting ontology extraction/learning from the web (or other documents)

JRC 2005/05/10


Example of semantic relation extraction in bio medicine l.jpg
Example of semantic relation extraction in bio-medicine linguistic and semantic annotation

  • [Rheumatoid arthritis][is characterized][by progressive synovial inflammation

  • and joint destruction][.]

JRC 2005/05/10


Open issues for hlt and sw l.jpg
Open issues for HLT and SW linguistic and semantic annotation

  • To achieve a better coordination for improving semantic annotation results

  • Development and use of standards for interelated linguistic and semantic annotation (see eContent Project LIRICS for standards for language resources)

JRC 2005/05/10


Interoperable standards l.jpg
Interoperable Standards? linguistic and semantic annotation

JRC 2005/05/10


Slide34 l.jpg

Thank you! linguistic and semantic annotation

JRC 2005/05/10