A model for geographic knowledge extraction on web documents
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

A Model for Geographic Knowledge Extraction on Web Documents PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

SECOGIS – ER 2009 Gramado – RS- Brazil, 13th November 2009. A Model for Geographic Knowledge Extraction on Web Documents. Cláudio E. C. Campelo and Cláudio de Souza Baptista University of Campina Grande Computer Science Department Information Systems Laboratory

Download Presentation

A Model for Geographic Knowledge Extraction on Web Documents

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A model for geographic knowledge extraction on web documents

SECOGIS – ER 2009

Gramado – RS- Brazil, 13th November 2009

A Model for Geographic Knowledge Extraction on Web Documents

Cláudio E. C. Campelo and

Cláudio de Souza Baptista

University of Campina Grande

Computer Science Department

Information Systems Laboratory

http://www.lsi.dsc.ufcg.edu.br

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Agenda

Agenda

  • Introduction

  • Main Challenges

  • Detection of Geographic References

  • The Geographic Scope

  • GeoSEn Prototype

    • Architecture

    • GUI

  • Experiments

  • Conclusion and Future Work

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Introduction

Introduction

  • Web: need for searching using the geographic context;

  • Traditional search engines: search based on keywords only;

  • Example:

    • A Web document: “...With the arrival of the industry in Gramado, one thousand of new jobs for Java programmers will be created...”;

    • User query: “Java programmer jobs Brazil”;

  • The mentioned document will not be retrieved in the previous query!

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Introduction1

Introduction

  • What is the Geographic Context of Web documents?

    • The place where the information was created?

    • The places mentioned in the document content?

    • Where are people who are most interested in a particular information?

    • etc…

  • Several documents have this context:

  • Research in Portugal in which only occurrence of names of Portuguese cities was considered (308 in total):

    • Total of about 4 millions pages analyzed.

    • Occurrence of 2.2 references per document;

    • 4% of the queries submitted had a reference to one of those cities.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Main challenges

Main Challenges

  • Detection of geographic references in the documents;

  • Modeling of geographic scope of documents;

  • Relevance ranking according to geographic context;

  • Need for efficient index techniques which cope with both textual and spatial dimensions

  • Development of user interfaces which provide usability to deal with both dimensions

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Detection of geographic references

Detection of Geographic References

  • Aim: to identify document features which may be mapped to a geographic place name;

  • Challenge: elimination of ambiguities, ex:

    • Place with a name of a thing; (Ex. Gramado, Canela)

    • Place with name of a Person (Ex. Garibaldi);

    • Places with same names and same types: (Ex. Cachoeirinha-Pe e Cachoeirinha-Rs);

    • Places with same names and different types (ex. city of Rio de Janeiro and state of Rio de Janeiro

    • Places and gentilics with the same names (ex. city of Paulista-Pe and paulista (who is born in São Paulo)

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Detection of geographic references1

Detection of Geographic References

  • Another example of ambiguity:

    • São Paulo as a State

    • São Paulo as a City

    • São Paulo as a football team

    • São Paulo as the name of a hospital

    • São Paulo as the Saint!

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Detection of geographic references2

Detection of Geographic References

  • Explored detected points: page content, page title, URL;

  • Types of detected places: all of the spatial hierarchy: (from city to region);

  • Types of detected references: place names, postal code, telephone code area, gentilic.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Definitions

Definitions

  • Confidence Rate (CR) represents the probability of a given reference be a valid place name.

  • Confidence Factor (CF) a measure associated to each analyzed feature during the detection of geographic reference.

CR

1

N

CF

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Confidence factor

Confidence Factor

  • CFST – analyzes the occurrence of special terms associated to geographic references;

    • Examples of STs include: “in" (e.g. “in Gramado); "city" (e.g. "city of São Paulo"); “ZIP” (e.g. “ZIP: 58109-000”);

    • Storage of special terms:

      • Term;

      • Type of geographic reference (zip code, telephone area code, place name, etc,);

      • Type of place (city, state, region);

      • Minimum distance (DMIN);

      • Maximum distance (DMAX);

      • Maximum confidence grade (CMAX).

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Confidence factor1

Confidence Factor

  • CFTS – considers the probability of a term be a geographic reference using a traditional search engine;

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Confidence factor2

Confidence Factor

  • CFCROSS :

    • analyzes the occurrence of cross references based on topological relationships (inside, contains, etc);

  • CFFMT – evaluates the syntax used to describe the geographic references;

    • Abbreviation of place names (R. de Janeiro, RJ);

    • The use of uppercase in the place names;

    • Telephone format ( 083)-999-3456;

    • Postal code format 58.104-867

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Modeling of the geographic scope

Modeling of the Geographic Scope

  • A document may be associated to one or more places;

  • A geographic scope may have places that are not mentioned directly in a document (geographic expansion)

  • Each place which is part of the scope has an associated relevance value;

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Geographic dispersion rate

Geographic Dispersion Rate

  • Another factor used in the composition of the geographic relevance value;

  • Hypothesis: references dispersed may characterize regions that share common features (e.g. cultural, economic, social);

(a)

(b)

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Geosen an overview

GeoSEn – an overview

  • Geographic Search Engine:

    • Indexes a subset of the Brazilian Web;

    • Deals with 6,291 places in Brazil, which are organized in a five-levels hierarchy: from city to region.

      • Region: ex. South

      • State: ex. Rio Grande do Sul

      • MesoRegion: ex. Metropolitana de Porto Alegre

      • MicroRegion: ex. Gramado-Canela

      • Municipality: ex. Gramado

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Geosen architecture

GeoSEn - Architecture

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


A model for geographic knowledge extraction on web documents

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Query example

Query Example

  • Example of query using a user defined area of interest

    SELECT id

    FROM places plc1

    WHERE

    within(plc1.geometry, specified_geometry)

    AND NOT EXISTS (

    SELECT id

    FROM places plc2

    WHERE

    within(plc2.geometry, specified_geometry)

    AND within(plc1.geometry, plc2.geometry))

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Experiments

Experiments

  • Experiments using 66,531 indexed documents;

  • 5 classes: .edu, .gov, blogs, tourism, arts;

  • Detection of terms:

    • Documents from the Web manually analyzed;

    • Documents with strong ambiguities created for the test bed;

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Conclusion

Conclusion

  • We have presented a heuristic based approach to implement a GIR system.

  • The techniques presented may be combined with others already known.

  • Precomputed relevance values may be used aiming to simplify the search process;

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


Future work

Future Work

  • Retrieval of georeferenced images and videos;

  • Recognition of other kinds of places;

  • Integration of other data sources;

  • Evaluation using large data set collections.

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


A model for geographic knowledge extraction on web documents

Thank you very much!

Questions?

Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br


  • Login