Querying the Web for Genealogical Information

Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF

Genealogical Information on the Web • Hundreds of thousands of sites • Some professional (Ancestry.com, Familysearch.org) • Mostly hobbyist (Cyndislist.com) • Search engines • “Walker genealogy” on Google: 199,000 results • 1 page/minute = 5 months to go through • Why not enlist the help of a computer?

Problems • No standard way of presenting data • Text formatted with HTML tags • Tables • Forms to access information • Each site has its own idea of what genealogical information is—differing schemas

Proposed solution • Based on Ontos and other work done at the BYU Data Extraction Group • Able to extract from: • Semi-structured or unstructured text • Tables • Forms • Scalable and robust to changes in pages • Built for genealogy but easily adaptable to other domains

Text

Tables

Forms

System Overview URL Database Unstructured or Semi-Structured Text Engine User Query Document Retriever Document Structure Recognizer Table Engine Data Extraction Engine Result Filter Form Engine Mapping Information To be implemented To be improved To be integrated

User Query • Form generated from ontology • Query by example

URL Databaseand Document Retriever • Contains Genealogy URLs • Search each URL—too much time • Filter likely URLs

Method Selector • Analyze page • Select appropriate method

Preprocessing Engines • Text • Improved record-separation • Ability to handle single-record pages • Table • Forms

Extraction Engine • Ontos • Cache schema matches

Result Filter • Filters objects relevant to query • Presents to user

Conclusion • Integrates, builds on previous DEG work • Extracts from: • Semi-structured or unstructured text • Tables • Forms • Scalable—only searches probable pages • Robust to changes in pages • Ontology based—easily adapted to other domains

Querying the Web for Genealogical Information

Querying the Web for Genealogical Information

Presentation Transcript

Genealogical services

Querying the Semantic Web with RQL *

 -Queries: Enabling Querying for Semantic Associations on the Semantic Web

Querying for relations from the semi-structured Web

Querying Probabilistic Information Extraction

Querying Incomplete Geospatial Information in RDF

Querying the Web of Data: a Formal Approach

Automating the Extraction of Genealogical Information from Historical Documents

Index Structures for Querying the Deep Web

Reasoning and Querying for the Web urq.deri.ie

Querying the deep Web

Deep Web Integration: Querying Structured Data on the Deep Web

Genealogical Dates

Natural Language Querying of the Semantic Web

Efficiently Querying Contradictory and Uncertain Genealogical Data

Tool for Ontology Paraphrasing, Querying and Visualization on the Semantic Web

Querying Web Data – The WebQA Approach

Optimized Index Structures for Querying RDF from the Web

Querying Text Databases for Efficient Information Extraction

Querying The Web Database

IWM14 Information Services for the Web

Chapter 3 Querying the Semantic Web