A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton Computer Sciences Department, University of Wisconsin-Madison 33rd International Conference on Very Large Data Bases, September 23-27 2007, University of Vienna, Austria 2008. 01. 18. Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

I want to know how cold Madison gets during winter

After searching with keywordand browsing… I found a wiki page, “Madison, Wisconsin”

There is a temperate table in the page… but how can I query over this?

In fact… it’s the structure implicit in unstructured documents!

Goal: Exploit the Structure! • Improve keyword search and browsing • Structured queries • “Which universities are in places that are very cold in winter?” • Queries that keyword search and browsing are not good at answering

Challenges Manage this evolving structure incrementally How to enable users to query using as much structure as is currently known?

Proposal A Relational Workbench A way to store an expanding set of documents and attributes Tools to incrementally process the data A way to exploit structure in queries

Advantages Data always available for querying Supports incremental data processing Can pose increasingly sophisticated queries over time Exploit strength of a RDBMS

Relational Workbench Data Storage Data Processing Case Study: Wikipedia

Data Storage – Wide Table • Problems: keeps evolving, heterogeneous… • No good schema! • So.. use single table! • One document per row • New attribute? create new column in the table! • No overhead for storing nulls with interpreted storage or column-oriented database

Data Storage – Wide Table (more) Allow attributes to have internal structure

Data Storage – Mapping Table • Store mappings for different attributes that correspond to the same real-world concept • Query evaluation: • Look up mapping table • Rewrite query to include matching attributes

Data Processing • Workbench does not decide how to process data • Provides three basic operators: • Extract • “address” => “city”, “state”, “zip code” • Integrate • “address” = “sent-to” • Cluster • DBAs decide what operators to use and set the parameters

Data Processing (more) Operator Interaction Example Improving clustering via iteration • Extract section names from Wikipedia pages • the page without a section name cannot be clustered properly! • Cluster pages based on section names • Extract and cluster pages based on the other attributes

Case Study: Wikipedia (1/2) Dataset: American cities, universities, male tennis players • Stage 1: Initial Loading • 5 columns: PageId, PageText, RevisionId, ContributorName, LastModificationDate • Can do keyword search over PageText immediately! • Stage 2: Extracting Sections • 1,253 new columns - Explosion of new columns! • Integrate found many aliases – 350 of all attributes belonged to 1 of 14 attribute groups

Case Study: Wikipedia (2/2) • Stage 3: Attribute Clustering • Grouped together attributes that were either both null or both non-null in a row • Views on clusters • ~25 ms for each cluster • 44s sec for wide table • Stage 4: Extracting Wiki Tables • we can extract temperature_wiki!

Now we can query! SELECT AVG(Low_F) FROM temperature_wiki as T WHERE T.city = ‘Madison Wisconsin’ AND Month = 1 OR Month = 2 OR Month = 12;

Conclusion • Relational workbench to incrementally extract and query structure from unstructured data • Wide table • Mapping table • Operators Many problems ahead!

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data