Where the Web Went Wrong http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Dept. Computer Science, University of Sheffield Graz, May 2004. The Web, presentation, and syndication A Semantic Web for eCulture annoy half the audience annoy the other half
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Where the Web Went Wrong
Dept. Computer Science, University of Sheffield
Graz, May 2004
The Web, presentation, and syndication
A Semantic Web for eCulture
annoy half the audience
annoy the other half
eCulture, metadata and human language
Information Extraction: quantified language computing
MUMIS, GATE, ...
Cultural memory is not a luxury
The web promotes diversity, but also fragmentation
Original web: separate content and presentation (“this is a header”, not “set in 20 point bold font”)
Now: many incompatible/inaccessible interfaces
Memory Institutions (museums, libraries, archives) need to:
pool their impact: syndication in networked communities
support repurposable content
Therefore data must be presentation independent
Candidate technologies: DC, CIDOC, XML, RSS, RDF, OWL (“semantic web”)...
Memory Institutions (museums, libraries, archives) host massively diverse content
Fortunately, the differences are primarily at the level of data structure and syntax. Significant conceptual overlaps exist between the descriptive schema used by memory institutions; elemental concepts such as objects, people, places, events, and the interrelationships between them are almost universal.Building semantic bridges between museums, libraries and archives: The CIDOC Conceptual Reference Model, T. Gill, April 2004
Therefore we can add a semantic metadata layer to provide generalised inter-institution resource location
Syndication and mediation for free!
The good news: SW focus of AI and metadata work
The bad news: AI always fails
How does the machine tell the difference between “Mother Theresa is a saint” and “Tony Blair is a saint”?(Or, who tells Google which statement is important?)
Other web users do, by linking (also cf. Amazon)
Two solutions to the AI problem:
allow curators and users to build their own (simple specific models can succeed, but the cost may be too high)
use recommender systems to make the user a curator’s assistant (researchers and students may barter for access)
Any route to searchable content!
Gartner, December 2002:
taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications
through 2012 more than 95% of human-to-computer information input will involve textual language
to deal with the information deluge we need formal knowledge in semantics-based systems
our archived history is in informal and ambiguous natural language
The challenge: to reconcile these two phenomena
HLT: Closing the Loop
MNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE
Formal Knowledge(ontologies andinstance bases)
Information Extraction (IE) pulls facts and structured information from the content of large text collections.
Contrast IE and Information Retrieval
NLP history: from NLU to IE
Progress driven by quantitative measures
MUC: Message Understanding Conferences
ACE: Advanced Content Extraction
“The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.”
ST: rocket launch event with various participants
(Extensive quantitative evaluation since early ’90s; mainly on text, ASR; now also video OCR)
Vary according to text type, domain, scenario, language
NE: up to 97% (tested in English, Spanish, Japanese, Chinese, others)
CO: 60-70% resolution
ST: 60% (but: human level may be only 80%)
XYZ was establishedon 03 November 1978 in London. It opened a plant in Bulgaria in …
Ontology & KB
Classes, instances & metadata
“Gordon Brown met George Bush during his two day visit.
<s_offset> 0 </s_offset>
<e_offset> 12 </e_offset>
<s_offset> 18 </s_offset>
<e_offset> 32 </e_offset>
Multimedia Indexing and Searching Environment
Composite index of a multimedia programme from multiple sources in different languages
ASR, video processing, Information Extraction (Dutch, English, German), merging, user interface
University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA
An important experimental result: multiple sources for same events can improve extraction quality
PrestoSpace applications in news and sports archiving
Not “goal Beckham”
(includes e.g. missed goals, or “this was not a goal”)
Instead: “goal events with scorer David Beckham”
An architecture A macro-level organisational picture for LE software systems.
A framework For programmers, GATE is an object-oriented class library that implements the architecture.
A development environment For language engineers, a graphical development environment.
GATE comes with...
Free components, and wrappers for other peoples’ stuff
Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.
Free software (LGPL) at http://gate.ac.uk/download/
Used by thousands of people at hundreds of sites
GATE team projects. Past:
Conceptual indexing: MUMIS: automatic semantic indices for sports video
MUSE, cross-genre entitiy finder
HSL, Health-and-safety IE
Old Bailey: collaboration with HRI on 17th century court reports
Multiflora: plant taxonomy text analysis for biodiversity research e-science
ACE/ TIDES: Arabic, Chinese NE
JHU summer w/s on semtagging
EMILLE: S. Asian languages corpus
hTechSight: chemical eng. K. portal
Advanced Knowledge Technologies: €12m UK five site collaborative project
SEKT Semantic Knowledge Technology
PrestoSpace MM Preservation/Access
KnowledgeWeb Semantic Web
New eContent project LIRICS
Thousands of users at hundreds of
sites. A representative sample:
the American National Corpus project
the Perseus Digital Library project, Tufts University, US
Longman Pearson publishing, UK
Merck KgAa, Germany
Canon Europe, UK
Knight Ridder, US
BBN (leading HLT research lab), US
SMEs inc. Sirma AI Ltd., Bulgaria
Stanford, Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities
UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia...
Combines learning and rule-based methods (new work on mixed-initiative learning)
Allows combination of IE and IR
Enables use of large-scale linguistic resources for IE, such as WordNet
Supports ontologies as part of IE applications - Ontology-Based IE
Supports languages from Hindi to Chinese, Italian to German
Q & A
Signal md, Transcr-iptions
C21st: all the C20th mistakes but bigger & better?
If you don’t know where you’ve been, how can you know where you’re going?
Archives: ammunition in the war on ignorance
Ammunition is useless if you can’t find it: new technology must make our history accessible to all, for all our futures