Where the Web Went Wrong
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on
  • Presentation posted in: General

Where the Web Went Wrong http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Dept. Computer Science, University of Sheffield Graz, May 2004. The Web, presentation, and syndication A Semantic Web for eCulture annoy half the audience annoy the other half

Download Presentation

Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Where the web went wrong gate ac uk nlp shef ac uk hamish cunningham

Where the Web Went Wrong

http://gate.ac.uk/http://nlp.shef.ac.uk/

Hamish Cunningham

Dept. Computer Science, University of Sheffield

Graz, May 2004


Contents

The Web, presentation, and syndication

A Semantic Web for eCulture

annoy half the audience

annoy the other half

eCulture, metadata and human language

motivation

Information Extraction: quantified language computing

MUMIS, GATE, ...

Cultural memory is not a luxury

Contents

2(21)


Syndication and mediation

The web promotes diversity, but also fragmentation

Original web: separate content and presentation (“this is a header”, not “set in 20 point bold font”)

Now: many incompatible/inaccessible interfaces

Memory Institutions (museums, libraries, archives) need to:

pool their impact: syndication in networked communities

support repurposable content

Therefore data must be presentation independent

Candidate technologies: DC, CIDOC, XML, RSS, RDF, OWL (“semantic web”)...

Syndication and Mediation

3(21)


Semantic web 1

Memory Institutions (museums, libraries, archives) host massively diverse content

Fortunately, the differences are primarily at the level of data structure and syntax. Significant conceptual overlaps exist between the descriptive schema used by memory institutions; elemental concepts such as objects, people, places, events, and the interrelationships between them are almost universal.Building semantic bridges between museums, libraries and archives: The CIDOC Conceptual Reference Model, T. Gill, April 2004

Therefore we can add a semantic metadata layer to provide generalised inter-institution resource location

Syndication and mediation for free!

Semantic Web (1)

4(21)


Semantic web 2 good news and bad news

The good news: SW focus of AI and metadata work

The bad news: AI always fails

How does the machine tell the difference between “Mother Theresa is a saint” and “Tony Blair is a saint”?(Or, who tells Google which statement is important?)

Other web users do, by linking (also cf. Amazon)

Two solutions to the AI problem:

allow curators and users to build their own (simple specific models can succeed, but the cost may be too high)

use recommender systems to make the user a curator’s assistant (researchers and students may barter for access)

Any route to searchable content!

Semantic Web (2):good news and bad news

5(21)


It context the knowledge economy and human language

Gartner, December 2002:

taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications

through 2012 more than 95% of human-to-computer information input will involve textual language

A contradiction:

to deal with the information deluge we need formal knowledge in semantics-based systems

our archived history is in informal and ambiguous natural language

The challenge: to reconcile these two phenomena

IT context: the Knowledge Economy and Human Language

6(21)


Where the web went wrong gate ac uk nlp shef ac uk hamish cunningham

HLT: Closing the Loop

KEY

MNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE

(M)NLG

Semantic

Web;

Semantic

Grid;Semantic

Web

Services

Formal Knowledge(ontologies andinstance bases)

HumanLanguage

OIE

(A)IE

ControlledLanguage

CLIE

7(21)


Information extraction

Information Extraction (IE) pulls facts and structured information from the content of large text collections.

Contrast IE and Information Retrieval

NLP history: from NLU to IE

Progress driven by quantitative measures

MUC: Message Understanding Conferences

ACE: Advanced Content Extraction

Information Extraction

8(21)


Ie example

“The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.”

ST: rocket launch event with various participants

IE Example

  • NE: "rocket", "Tuesday", "Dr. Head“, "We Build Rockets"

  • CO:"it" = rocket; "Dr. Head" = "Dr. Big Head"

  • TE: the rocket is "shiny red" and Head's "brainchild".

  • TR: Dr. Head works for We Build Rockets Inc.

9(21)


Performance levels

(Extensive quantitative evaluation since early ’90s; mainly on text, ASR; now also video OCR)

Vary according to text type, domain, scenario, language

NE: up to 97% (tested in English, Spanish, Japanese, Chinese, others)

CO: 60-70% resolution

TE: 80%

TR: 75-80%

ST: 60% (but: human level may be only 80%)

Performance levels

10(21)


Ontology based ie

Bulgaria

London

XYZ

UK

Ontology-based IE

XYZ was establishedon 03 November 1978 in London. It opened a plant in Bulgaria in …

Ontology & KB

Location

Company

HQ

partOf

City

Country

type

type

HQ

type

type

establOn

partOf

“03/11/1978”

11(21)


Where the web went wrong gate ac uk nlp shef ac uk hamish cunningham

Classes, instances & metadata

Entity

Person

Job-title

president

G.Brown

minister

chancellor

“Gordon Brown met George Bush during his two day visit.

<metadata>

<DOC-ID>http://… 1.html</DOC-ID>

<Annotation>

<s_offset> 0 </s_offset>

<e_offset> 12 </e_offset>

<string>Gordon Brown</string>

<class>…#Person</class>

<inst>…#Person12345</inst>

</Annotation>

<Annotation>

<s_offset> 18 </s_offset>

<e_offset> 32 </e_offset>

<string>George Bush</string>

<class>…#Person</class>

<inst>…#Person67890</inst>

</Annotation>

</metadata>

Classes+instances before

Bush

Classes+instances after

12(21)


An example the mumis project

Multimedia Indexing and Searching Environment

Composite index of a multimedia programme from multiple sources in different languages

ASR, video processing, Information Extraction (Dutch, English, German), merging, user interface

University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA

An important experimental result: multiple sources for same events can improve extraction quality

PrestoSpace applications in news and sports archiving

An example: the MUMIS project

13(21)


Semantic query

Semantic Query

Not “goal Beckham”

(includes e.g. missed goals, or “this was not a goal”)

Instead: “goal events with scorer David Beckham”

14(21)


The results england win

The results: England win!

15(21)


Gate a general architecture for text engineering is

An architecture A macro-level organisational picture for LE software systems.

A framework For programmers, GATE is an object-oriented class library that implements the architecture.

A development environment For language engineers, a graphical development environment.

GATE comes with...

Free components, and wrappers for other peoples’ stuff

Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.

Free software (LGPL) at http://gate.ac.uk/download/

Used by thousands of people at hundreds of sites

GATE, a General Architecture for Text Engineering is...

16(21)


A bit of a nuisance gate users

GATE team projects. Past:

Conceptual indexing: MUMIS: automatic semantic indices for sports video

MUSE, cross-genre entitiy finder

HSL, Health-and-safety IE

Old Bailey: collaboration with HRI on 17th century court reports

Multiflora: plant taxonomy text analysis for biodiversity research e-science

ACE/ TIDES: Arabic, Chinese NE

JHU summer w/s on semtagging

EMILLE: S. Asian languages corpus

hTechSight: chemical eng. K. portal

Present:

Advanced Knowledge Technologies: €12m UK five site collaborative project

SEKT Semantic Knowledge Technology

PrestoSpace MM Preservation/Access

KnowledgeWeb Semantic Web

Future:

New eContent project LIRICS

Thousands of users at hundreds of

sites. A representative sample:

the American National Corpus project

the Perseus Digital Library project, Tufts University, US

Longman Pearson publishing, UK

Merck KgAa, Germany

Canon Europe, UK

Knight Ridder, US

BBN (leading HLT research lab), US

SMEs inc. Sirma AI Ltd., Bulgaria

Stanford, Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities

UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia...

A bit of a nuisance (GATE users)

17(21)


Gate infrastructure for semantic metadata extraction

Combines learning and rule-based methods (new work on mixed-initiative learning)

Allows combination of IE and IR

Enables use of large-scale linguistic resources for IE, such as WordNet

Supports ontologies as part of IE applications - Ontology-Based IE

Supports languages from Hindi to Chinese, Italian to German

GATE – infrastructure for semantic metadata extraction

18(21)


Prestospace semantics architecture

Ontology-BasedMetadata

Merging

Formal

Text

Formal

Text

Formal

Text

Anno-tations

PrestoSpace Semantics Architecture

IE

...

Formal

Text

Formal

Text

Formal

Text

Final Annotations

IE

Formal

Text

IT

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Text

Sources

EN

IE

Multilingual

Conceptual

Q & A

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

AV

Signals

Formal

Text

Signal md, Transcr-iptions

ASR,

etc.

19(21)


Memory is not a luxury

C21st: all the C20th mistakes but bigger & better?

If you don’t know where you’ve been, how can you know where you’re going?

Archives: ammunition in the war on ignorance

Ammunition is useless if you can’t find it: new technology must make our history accessible to all, for all our futures

Memory is not a luxury

20(21)


Links

This talk:

http://gate.ac.uk/sale/talks/eculture-graz-may2004.ppt

Related projects:

Links

21(21)


  • Login