1 / 21

Integration of Term, Spatial and Temporal References for IR

Integration of Term, Spatial and Temporal References for IR. A Prototype System. Outline. Motivation Problem Statement Prior Work Work Effort Overall System Architecture TREC 2004 Documents MySQL MetaCarta Temporal Tagging Future Work Thanks. Motivation.

yoshe
Download Presentation

Integration of Term, Spatial and Temporal References for IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integration of Term, Spatial and Temporal References for IR A Prototype System

  2. Outline • Motivation • Problem Statement • Prior Work • Work Effort • Overall System Architecture • TREC 2004 Documents • MySQL • MetaCarta • Temporal Tagging • Future Work • Thanks Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  3. Motivation • Term, spatial, and temporal references have not been brought together for IR • Term: standard IR • Term-Spatial: Geographic IR • Spatial-Temporal: Moving Object Databases • Several unanswered questions • Occurrence frequency of spatial and temporal terms (like Heap’s Law?) • Resolving spatial ambiguity, based on time • e.g. Which political boundary defines the area covered by a specific European country at time x? • Combined term/spatial/temporal indexing • F-measure improvement using spatial and temporal references? http://users.erols.com/mwhite28/20centry.htm Historical Atlas of the Twentieth Century by Matthew White Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  4. Problem Statement • There are many questions concerning the combination of term, spatial and temporal references for information retrieval. • However, before these questions can be answered, these three areas need to be first brought together as a system. Various independent components need to be brought together that can process documents, analogous to how commercial IR systems do today, but include spatial and temporal analysis. • The system produced needs to be accessible to the IR research community, and provide a foundation for future work. Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  5. Prior Work • There is no prior work in combining term, spatial and temporal references for IR • Prior work is divided into pairs of the three reference types Term/GeoSpatial • Database support for text and spatial storage, indexing, and querying. • MySQL • PostgreSQL • Geoparsers • MetaCarta • Google API Spatial/Temporal • Moving Object Databases (MOD) • Mainly academic • 1-to-1 spatial-temp • Spatio-Temporal Access Methods Temporal Processing • Temporal Databases • TQL2 • NLP Temporal Taggers • Various formats for tagging • TimeML • Timex • Corpus focused on training, not IR • Various tools, but mainly manual processes Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  6. Overall Systems Architecture • The GeoSpatial/Temporal Information Retrieval (GTIR) prototype system is based on: • Open source components • Internet services • TREC 2004 standard corpus provides the basic documents • Perl is the main programming environment, as the temporal/ NLP tools are also written in Perl • Components: • GTIR Processor – glueware • MySQL – Text and spatial IR storage and indexing • MetaCarta – Geocoding of TREC documents • SVM-Tagger – Part-Of-Speech (POS) Natural Language Processing (NLP) – preprocessing for GUTime • GUTime - Temporal Tagging of time references in TREC documents Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  7. Project Steps • Designed and created MySQL tables using SQL Administrator • Created individual functions • Processed TREC files into individual documents • Database inserts • Geo-parsing calls • POS Tagger calls • Temporal Tagger calls • Merged Perl functions into GTIRProc • Loaded MySQL 5.0 • Loaded Perl Environment • Active Perl • DzSoft Perl Editor • Installed LCW module • Installed DBI • Installed DBD::MySQL • Installed SVMTagger • Installed GUTime • Loaded TREC 2004 files on HD • Examined input and output file formats Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  8. Development Tools/Environment • Perl Programming Env. • DzSoft Perl Editor • ActivePerl 5.8.1 • DBI module • DBD::MySQL module • LCW module • SVMTagger module • GUTime module • Operating Environment • Dell Latitude D610 • Pentium M 2.13 GHz • 1 GB RAM • 60 GB HD (NTFS) • Windows XP Pro SP2 • MySQL 5.0 Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  9. TREC 2004 • Used as a document corpus to judge IR systems • Set of standard and “track” specific corpus • Set of known information needs (topics) • Known relevance judgments per topic • Four types of documents • Foreign Business Information Service • Federal Register • LA Times • Financial Times • Why TREC? • Foundation for future work – Judge the effects of spatial and temporal references to relevance judgments by an IR system • Example: Examine Query operations – expansion to spatial and temporal references, term to spatial or term to temporal thesaurus, etc. Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  10. TREC 2004 Document Structure • Analyzed TREC standard corpus document structure. • Needed for DB design • Internally tagged using SGML • Different tag structure per type, except for: • <DOC></DOC> • <DOCNO> Unique per document • Hard to extract the date or the doc title or subject Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  11. MySQL Database • Characteristics • Open Source • Large user base • Why a Database • Store docs already extracted from files • Look to combine IR and Database capabilities • Why MySQL • Strong support of Perl programming • Small admin requirement • Already provides full-text and spatial index and search • Pluggable storage engine architecture • All elements modifiable for future work. Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  12. MySQL Database - Coding • Used both the Perl DBI and DBD::MySQL (DataBase Driver) modules • Installed via Perl Package Manager (PPM) • Code Example: • Establish DB connection use DBI;$datasource = "DBI:mysql:database=gtirstore;host=localhost";$user = “devillie";$password = “demo123$";$dbh = DBI->connect($datasource, $user, $password, {'RaiseError' => 1});#Application processing code$dbh->disconnect(); Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  13. MySQL Database – Table Definitions CREATE TABLE `gtirstore`.`DOCTEXT` ( `NUM` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT, `DOCFULLTEXT` MEDIUMTEXT NOT NULL, `DOCSTEMTEXT` MEDIUMTEXT, PRIMARY KEY(`NUM`), FULLTEXT `Index_Text`(`DOCFULLTEXT`) ) ENGINE = MYISAM CHARACTER SET latin1 COLLATE latin1_general_ci; CREATE TABLE `gtirstore`.`DOCGEOREF` ( `TAGNO` INTEGER UNSIGNED NOT NULL, `NUM` INTEGER UNSIGNED NOT NULL, `LOCATION` POINT NOT NULL, `CONFIDENSE` FLOAT NOT NULL, PRIMARY KEY(`TAGNO`, `NUM`), SPATIAL `Geo_Index` USING RTREE(`LOCATION`) ) ENGINE = MYISAM CHARACTER SET latin1 COLLATE latin1_general_ci; CREATE TABLE `gtirstore`.`DOCINFO` ( `NUM` INTEGER UNSIGNED NOT NULL DEFAULT NULL AUTO_INCREMENT, `SOURCE` VARCHAR(255) NOT NULL, `NUMWORDS` INTEGER UNSIGNED, `NUMSTEMS` INTEGER UNSIGNED, `DOCNO` VARCHAR(45) NOT NULL, `DOCDATE` DATE NOT NULL, `DOCTYPE` VARCHAR(45) NOT NULL, `DOCTITLE` VARCHAR(128) NOT NULL, `DOCSIZE` INTEGER UNSIGNED NOT NULL, PRIMARY KEY(`NUM`) ) ENGINE = MYISAM • Tables created using the MySQL Administrator • Must use MyISAM storage engine • Only engine to support full-text and spatial R-Tree indexing • Does not support row-level locking or referential integrity checks DOCGEOREF Table DOCTEXT Table DOCINFO Table MySQL Administrator Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  14. MetaCarta Geo-Parsing/Geo-Coding • MetaCarta provides a geo-parsing and geo-coding capability via http GET/POST functionality, and provides an XML output of highest probability matches to all input geographic references. • How is it different than other geo-coders, like Google API? • Works at the document level • NLP – does not require a set format • Does its own disambiguation • Stratford, CA or Stratford, England? Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  15. MetaCarta GeoTag <GeoTag TagType="GEO" Confidence="0.758494"> <TextExtent AnchorStart="379" AnchorEnd="425" Anchor="the Former Yugoslav Republic of Macedonia"> <TextComponent AnchorStart="379" AnchorEnd="425" Anchor="the Former Yugoslav Republic of Macedonia"/> </TextExtent> <Disjunct Weight="1.000000" FeatureType="DOT" GazetteerID="MetaCarta Gazetteer v3.7.0"> <Conjunct Country="MK" CountryConfidence="0.758494" Class="A" Type="PCLI"> <Dot Latitude="41.9603" Longitude="21.6214" DotWeight="1.0"/> </Conjunct> </Disjunct> </GeoTag> • One GeoTag per spatial reference. • Zero to many GeoTags per document. • Interested in storing the highlighted areas of the GeoTag in the MySQL database Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  16. TREC 2004 Temporal Tagging • Intent: Find document temporal references and anchor to a specific time representation, akin to geo-parsing. • Temporal tagging requires a two-step NLP process • Part-of-Speech (POS) tagging • Temporal Tagging based on POS-tagged input • Tools were chosen for both based on availability of source code and programming language (Perl) • Each type of tool has its own type of corpus for training. • The accuracy of the tools are taken “as-is” SVM-Tagger GUTime Output tagged sentence. Collect sentences into a doc Temporal tag a doc with separated sentences and POS tagged words Parse tagged doc for temporal tags Parse doc into tokenized sentences GTIRProc Process by file Store taggs as text MySQL 5.0 (MyISAM Engine) Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  17. Challenges • MetaCarta Geo-Parsing/Geo-Coding Access • Normally restricted to 100 calls/day/IP address • Required manual transfer (email) with MetaCarta • Reduced sample size (35.0 MB from 1.91 GB) • Restructured code to do batch processing • Issues with Parsing TREC and MetaCarta files • Perl XML::Parser did not work! Syntax error – line 187 • Required custom parsing code • POS and Temporal Tagging Tools • Hard to integrate together • Non-standard input/output requirements between tools • <lex pos=NN> word </lex> vs word PP • Token vs sentence vs document input • Lack of documentation • Perl is hard to read Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  18. Contributions • Created an initial design using temporal NLP tools as well full-text indexing and spatial geo-parsing/geo-coding for IR • Foundation for further research in combined term, spatial and temporal IR • Open source tools • TREC 2004 corpus indexed for spatial and temporal references • MySQL is a vehicle for widespread use of additional capability from future work • Proposed several areas for continued work in adding temporal support to the GIR field Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  19. Future Work • Near-term • Complete the indexing of all TREC 2004 documents • Conduct and expand analysis to other corpus • E.g. TREC Web corpus – 30 GB • User Interface design • Port to higher-performing environment than Perl (Java, C++) • Mid-term • Relevance Determination using both spatial and temporal references • Combined indexing for term, spatial and temporal references • Integrated spatial-temporal NLP analysis • Geoparsers/coders with a temporal dimension Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  20. Future Work (Continued) • Technologies to watch • MySQL and other database incorporation of temporal and IR capabilities • Full-Text indexing – including stopword and stemming processes • TQL2 – Temporal Query Language • NLP temporal tagging and anchoring • Geo-parsing/Geo-coding Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

  21. Thanks Thanks to John Frank, Chief Technology Officer (CTO) and Co-Founder of MetaCarta, for help in processing the TREC 2004 files for this project, and his technical guidance in using the MetaCarta Geoparsing Service. Integration of Term, Spatial and Temporal References for IR Edward M. DeVilliers, December 7, 2006

More Related