Extracting Metadata for Spatially-Aware Information Retrieval on the Internet
Download
1 / 20

Extracting Metadata for Spatially-Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK - PowerPoint PPT Presentation


  • 217 Views
  • Uploaded on

Extracting Metadata for Spatially-Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK. Presented By Mayank Singh. Overview :. The importance of the experiment. Introduction to SPIRIT and GATE. Techniques employed – Geo Parsing and Geo Coding. Pros Cons

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Extracting Metadata for Spatially-Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK' - richard_edik


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Extracting Metadata for Spatially-Aware Information Retrieval on the InternetClough, PaulUniversity of Sheffield, UK

Presented By

Mayank Singh


Overview l.jpg
Overview : Retrieval on the Internet

  • The importance of the experiment.

  • Introduction to SPIRIT and GATE.

  • Techniques employed – Geo Parsing and Geo Coding.

  • Pros

  • Cons

  • What it leads to.


The importance of the experiment l.jpg
The importance of the experiment: Retrieval on the Internet

  • A novel system.

  • Geospatial information extraction from the Web documents.

  • Annotating the retrieved documents with the spatial data.

  • Using the annotated documents to power a working GIR system.


How does it work summary l.jpg
How does it work (summary) Retrieval on the Internet

Extracting geospatial references from document involves:

  • Identifying geographic references

  • Assigning them spatial co-ordinates

  • Factors influencing the above:

    speed, reliability, flexibility and multilingualism.


Introduction to spirit l.jpg
Introduction to SPIRIT Retrieval on the Internet

  • Spatial Information Retrieval on the Internet

  • The main aim of the project is to create tools and

    techniques to help people find information that

    relates to specified geographical locations.


Slide6 l.jpg

1TB crawl of about 9million web documents focused Retrieval on the Internet

on UK, Germany, France and Switzerland. Support

of Ontology of places.

Relevance ranking of web documents catering to

needs of:

  • Documents referring some place of interest

  • Digital geospatial resources


Slide7 l.jpg
GATE Retrieval on the Internet

It’s a java suite for tasks related to Natural Language

Processing and particularly useful and widely used in

the area of Information Extraction. ANNIE (A

nearly-new Information Extraction system) is the

highlight of this experiment which is employed by

SPIRIT.


Annie l.jpg
ANNIE Retrieval on the Internet

  • Tokenizer

  • Gazetter

  • Sentence splitter

  • Part-of-speech tagger

  • Named-Entity transducer


Spatial markup l.jpg
Spatial Markup Retrieval on the Internet

Sources of Spatial markup:

  • OS – Ordnance Survey (UK, point)

  • TGN – Getty Thesaurus of Geographical names (Global, point)

  • SABE – Seamless administrative boundaries of Europe (Europe, polygon)


Geo parsing l.jpg
Geo-Parsing Retrieval on the Internet

  • Named-Entity Recognition – lists + rules

  • List lookup inefficient

  • First gazetter lookup then use of contextual evidence to realize this.

  • JAPE (Java Patterns Annotation Engine) – rules defined w.r.t terms of entities identified within GATE.

  • Rules are language independent (using Systran system)


Hurdles faced l.jpg
Hurdles faced Retrieval on the Internet

  • Filtering out commonly used words – specially which are used in a non-geographical sense.

  • Using person-name list to filter out ambiguity between places and names.


Geo coding l.jpg
Geo-Coding Retrieval on the Internet

  • Gazetter lookup to assign co-ordinates

  • Removing ambiguity in place names: by feature hierarchy and feature type provided by OS.

  • Actual grounding done by SABE and OS.

  • TGN used to resolve global ambiguity.


Experimental setup l.jpg
Experimental Setup Retrieval on the Internet

  • Total annotated collection of about 8.8million pages

  • 22 out of top 50 domains from Europe

  • About 1.6 million doc containing 5-10 unique footprints selected. Further 10% chosen from this and then those only from UK (130)

  • All geographic names (1864) manually identified and stored as benchmark


Geo parsing results l.jpg
Geo-parsing Results Retrieval on the Internet

SPIRIT + SABE + OS:

  • Correct – 1340

  • Missing – 479

  • False Hits – 596

  • Precision – 0.6966

  • Recall – 0.7820

  • F1 – 0.7184


Geo coding results l.jpg
Geo-Coding Results Retrieval on the Internet

  • TGN ineffective due to global scope – 1021 found, 68% ambiguous.

  • UK SABE good – 942 found, 11% ambiguous.

  • 1137 places assigned a UID correctly. That is not only correct geo sense but resource order too.


Conclusions l.jpg
Conclusions Retrieval on the Internet

  • Promising as success rate of 89% is there.

  • Geo-parsing can be improved by enhancing gazetter matching methods and filtering of non-geographic entries

  • Geo-coding can be improved by finding better methods for combining geog. resources.


Slide17 l.jpg
Pros Retrieval on the Internet

  • Novel system and high success rate.

  • Towards a geospatial search engine.

  • Spatial markup resources in abundance.


Slide18 l.jpg
Cons Retrieval on the Internet

  • Ambiguity (geographical)

  • Matching correct geographical sense.

  • Large overhead required to build such systems.

  • Inherent NLP problems.


What it all leads to l.jpg
What it all leads to Retrieval on the Internet

  • Creating geographical ontology to assist in GIR (Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa 1749-016 Lisboa, Portugal)

  • More focused Local and topical search (Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany)


References l.jpg
References Retrieval on the Internet

  • Extracting Metadata for Spatially-Aware Information Retrieval on the Internet - Clough, Paul

  • GATE - http://gate.ac.uk/overview.html

  • SPIRIT - http://www.geo-spirit.org/project_full.html

  • Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa 1749-016 Lisboa, Portugal

  • Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany