1 / 19

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data. Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton Computer Sciences Department, University of Wisconsin-Madison

thom
Download Presentation

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton Computer Sciences Department, University of Wisconsin-Madison 33rd International Conference on Very Large Data Bases, September 23-27 2007, University of Vienna, Austria 2008. 01. 18. Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

  2. I want to know how cold Madison gets during winter

  3. After searching with keywordand browsing… I found a wiki page, “Madison, Wisconsin”

  4. There is a temperate table in the page… but how can I query over this?

  5. In fact… it’s the structure implicit in unstructured documents!

  6. Goal: Exploit the Structure! • Improve keyword search and browsing • Structured queries • “Which universities are in places that are very cold in winter?” • Queries that keyword search and browsing are not good at answering

  7. Challenges Manage this evolving structure incrementally How to enable users to query using as much structure as is currently known?

  8. Proposal A Relational Workbench A way to store an expanding set of documents and attributes Tools to incrementally process the data A way to exploit structure in queries

  9. Advantages Data always available for querying Supports incremental data processing Can pose increasingly sophisticated queries over time Exploit strength of a RDBMS

  10. Relational Workbench Data Storage Data Processing Case Study: Wikipedia

  11. Data Storage – Wide Table • Problems: keeps evolving, heterogeneous… • No good schema! • So.. use single table! • One document per row • New attribute? create new column in the table! • No overhead for storing nulls with interpreted storage or column-oriented database

  12. Data Storage – Wide Table (more) Allow attributes to have internal structure

  13. Data Storage – Mapping Table • Store mappings for different attributes that correspond to the same real-world concept • Query evaluation: • Look up mapping table • Rewrite query to include matching attributes

  14. Data Processing • Workbench does not decide how to process data • Provides three basic operators: • Extract • “address” => “city”, “state”, “zip code” • Integrate • “address” = “sent-to” • Cluster • DBAs decide what operators to use and set the parameters

  15. Data Processing (more) Operator Interaction Example Improving clustering via iteration • Extract section names from Wikipedia pages • the page without a section name cannot be clustered properly! • Cluster pages based on section names • Extract and cluster pages based on the other attributes

  16. Case Study: Wikipedia (1/2) Dataset: American cities, universities, male tennis players • Stage 1: Initial Loading • 5 columns: PageId, PageText, RevisionId, ContributorName, LastModificationDate • Can do keyword search over PageText immediately! • Stage 2: Extracting Sections • 1,253 new columns - Explosion of new columns! • Integrate found many aliases – 350 of all attributes belonged to 1 of 14 attribute groups

  17. Case Study: Wikipedia (2/2) • Stage 3: Attribute Clustering • Grouped together attributes that were either both null or both non-null in a row • Views on clusters • ~25 ms for each cluster • 44s sec for wide table • Stage 4: Extracting Wiki Tables • we can extract temperature_wiki!

  18. Now we can query! SELECT AVG(Low_F) FROM temperature_wiki as T WHERE T.city = ‘Madison Wisconsin’ AND Month = 1 OR Month = 2 OR Month = 12;

  19. Conclusion • Relational workbench to incrementally extract and query structure from unstructured data • Wide table • Mapping table • Operators Many problems ahead!

More Related