A model for geographic knowledge extraction on web documents
Download
1 / 22

A Model for Geographic Knowledge Extraction on Web Documents - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

SECOGIS – ER 2009 Gramado – RS- Brazil, 13th November 2009. A Model for Geographic Knowledge Extraction on Web Documents. Cláudio E. C. Campelo and Cláudio de Souza Baptista University of Campina Grande Computer Science Department Information Systems Laboratory

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Model for Geographic Knowledge Extraction on Web Documents' - harrison-patton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A model for geographic knowledge extraction on web documents

SECOGIS – ER 2009

Gramado – RS- Brazil, 13th November 2009

A Model for Geographic Knowledge Extraction on Web Documents

Cláudio E. C. Campelo and

Cláudio de Souza Baptista

University of Campina Grande

Computer Science Department

Information Systems Laboratory

http://www.lsi.dsc.ufcg.edu.br

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Agenda
Agenda

  • Introduction

  • Main Challenges

  • Detection of Geographic References

  • The Geographic Scope

  • GeoSEn Prototype

    • Architecture

    • GUI

  • Experiments

  • Conclusion and Future Work

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Introduction
Introduction

  • Web: need for searching using the geographic context;

  • Traditional search engines: search based on keywords only;

  • Example:

    • A Web document: “...With the arrival of the industry in Gramado, one thousand of new jobs for Java programmers will be created...”;

    • User query: “Java programmer jobs Brazil”;

  • The mentioned document will not be retrieved in the previous query!

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Introduction1
Introduction

  • What is the Geographic Context of Web documents?

    • The place where the information was created?

    • The places mentioned in the document content?

    • Where are people who are most interested in a particular information?

    • etc…

  • Several documents have this context:

  • Research in Portugal in which only occurrence of names of Portuguese cities was considered (308 in total):

    • Total of about 4 millions pages analyzed.

    • Occurrence of 2.2 references per document;

    • 4% of the queries submitted had a reference to one of those cities.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Main challenges
Main Challenges

  • Detection of geographic references in the documents;

  • Modeling of geographic scope of documents;

  • Relevance ranking according to geographic context;

  • Need for efficient index techniques which cope with both textual and spatial dimensions

  • Development of user interfaces which provide usability to deal with both dimensions

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Detection of geographic references
Detection of Geographic References

  • Aim: to identify document features which may be mapped to a geographic place name;

  • Challenge: elimination of ambiguities, ex:

    • Place with a name of a thing; (Ex. Gramado, Canela)

    • Place with name of a Person (Ex. Garibaldi);

    • Places with same names and same types: (Ex. Cachoeirinha-Pe e Cachoeirinha-Rs);

    • Places with same names and different types (ex. city of Rio de Janeiro and state of Rio de Janeiro

    • Places and gentilics with the same names (ex. city of Paulista-Pe and paulista (who is born in São Paulo)

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Detection of geographic references1
Detection of Geographic References

  • Another example of ambiguity:

    • São Paulo as a State

    • São Paulo as a City

    • São Paulo as a football team

    • São Paulo as the name of a hospital

    • São Paulo as the Saint!

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Detection of geographic references2
Detection of Geographic References

  • Explored detected points: page content, page title, URL;

  • Types of detected places: all of the spatial hierarchy: (from city to region);

  • Types of detected references: place names, postal code, telephone code area, gentilic.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Definitions
Definitions

  • Confidence Rate (CR) represents the probability of a given reference be a valid place name.

  • Confidence Factor (CF) a measure associated to each analyzed feature during the detection of geographic reference.

CR

1

N

CF

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Confidence factor
Confidence Factor

  • CFST – analyzes the occurrence of special terms associated to geographic references;

    • Examples of STs include: “in" (e.g. “in Gramado); "city" (e.g. "city of São Paulo"); “ZIP” (e.g. “ZIP: 58109-000”);

    • Storage of special terms:

      • Term;

      • Type of geographic reference (zip code, telephone area code, place name, etc,);

      • Type of place (city, state, region);

      • Minimum distance (DMIN);

      • Maximum distance (DMAX);

      • Maximum confidence grade (CMAX).

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Confidence factor1
Confidence Factor

  • CFTS – considers the probability of a term be a geographic reference using a traditional search engine;

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Confidence factor2
Confidence Factor

  • CFCROSS :

    • analyzes the occurrence of cross references based on topological relationships (inside, contains, etc);

  • CFFMT – evaluates the syntax used to describe the geographic references;

    • Abbreviation of place names (R. de Janeiro, RJ);

    • The use of uppercase in the place names;

    • Telephone format ( 083)-999-3456;

    • Postal code format 58.104-867

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Modeling of the geographic scope
Modeling of the Geographic Scope

  • A document may be associated to one or more places;

  • A geographic scope may have places that are not mentioned directly in a document (geographic expansion)

  • Each place which is part of the scope has an associated relevance value;

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Geographic dispersion rate
Geographic Dispersion Rate

  • Another factor used in the composition of the geographic relevance value;

  • Hypothesis: references dispersed may characterize regions that share common features (e.g. cultural, economic, social);

(a)

(b)

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Geosen an overview
GeoSEn – an overview

  • Geographic Search Engine:

    • Indexes a subset of the Brazilian Web;

    • Deals with 6,291 places in Brazil, which are organized in a five-levels hierarchy: from city to region.

      • Region: ex. South

      • State: ex. Rio Grande do Sul

      • MesoRegion: ex. Metropolitana de Porto Alegre

      • MicroRegion: ex. Gramado-Canela

      • Municipality: ex. Gramado

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Geosen architecture
GeoSEn - Architecture

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


A model for geographic knowledge extraction on web documents

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Query example
Query Example http://lsi.dsc.ufcg.edu.br

  • Example of query using a user defined area of interest

    SELECT id

    FROM places plc1

    WHERE

    within(plc1.geometry, specified_geometry)

    AND NOT EXISTS (

    SELECT id

    FROM places plc2

    WHERE

    within(plc2.geometry, specified_geometry)

    AND within(plc1.geometry, plc2.geometry))

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Experiments
Experiments http://lsi.dsc.ufcg.edu.br

  • Experiments using 66,531 indexed documents;

  • 5 classes: .edu, .gov, blogs, tourism, arts;

  • Detection of terms:

    • Documents from the Web manually analyzed;

    • Documents with strong ambiguities created for the test bed;

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Conclusion
Conclusion http://lsi.dsc.ufcg.edu.br

  • We have presented a heuristic based approach to implement a GIR system.

  • The techniques presented may be combined with others already known.

  • Precomputed relevance values may be used aiming to simplify the search process;

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Future work
Future Work http://lsi.dsc.ufcg.edu.br

  • Retrieval of georeferenced images and videos;

  • Recognition of other kinds of places;

  • Integration of other data sources;

  • Evaluation using large data set collections.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


A model for geographic knowledge extraction on web documents

Thank you very much! http://lsi.dsc.ufcg.edu.br

Questions?

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br