1 / 32

Spreadsheets: The Good Parts

Managing Spreadsheets Michael Cafarella Zhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo University of Michigan New England Database Summit February 1, 2013. Spreadsheets: The Good Parts. A “ Swiss Army Knife ” for data: storing, sharing, transforming Sophisticated users who are not DBAs

akira
Download Presentation

Spreadsheets: The Good Parts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing SpreadsheetsMichael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan PrevoUniversity of MichiganNew England Database SummitFebruary 1, 2013

  2. Spreadsheets: The Good Parts • A “Swiss Army Knife” for data: storing, sharing, transforming • Sophisticated users who are not DBAs • Contain lots of data, found nowhere else • Everyone uses them; almost wholly ignored by DB community • Thanks, Jeremy!

  3. Spreadsheets: The Awful Parts • Users toss in data, worry about schemas later (well, never) • Spreadsheets designed for humans, not query processors • No explicit schemas: • Poor data integrity(Zeeberget al, 2004) • Integration very hard • Tumor suppresor gene Deleted In Esophogeal Cancer 1 • aka, DEC1 • aka, (according to Excel) 01-DEC

  4. Spreadsheets: The Awful Parts • Users toss in data, worry about schemas later (well, never) • Spreadsheets designed for humans, not query processors • No explicit schemas: • Poor data integrity(Zeeberget al, 2004) • Integration very hard

  5. A Data Tragedy • Spreadsheets build, then entomb, our best, most expensive, data • >400,000 just from ClueWeb09 • From gov’ts, WTO, many other sources • How many inside firewall? • Application vision: Ad-hoc integration & analysis for any dataset • Challenge: recover relations from any spreadsheet, w/little human effort

  6. Closeup • One hierarchy error yields many bad tuples • Too many datasets to process manually Desired tuple:

  7. Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work

  8. Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work

  9. Extracting Tuples • Extract frame, attribute hierarchy trees • Map values to attributes; create tuples • Apply manual repairs, repeat • How many repairs for 100% accuracy? • Yields tuples, not relations • We won’t discuss: relation assembly

  10. 1. Frame Detection • Key assumption: inputs are data frames • Locate metadata in top/left regions • Locate data in center block

  11. Closeup

  12. 1. Frame Detection • Key assumption: inputs are data frames • Locate metadata in top/left regions • Locate data in center block • ~72% of spreadsheets fit; others not relational • Each non-empty row labeled one of TITLE, HEADER, DATA, FOOTNOTE • Reconstruct regions from labels • Infer labels with linear-chain Conditional Random Field (Lafferty et al, 2001) • Layout features: has bold cell? Merged cell? • Text features: contains ‘table’, ‘total’? Indented text? Numeric cells? Year cells?

  13. 2. Hierarchy Extraction

  14. Closeup

  15. 2. Hierarchy Extraction • One task for TOP, one for LEFT • Create boolean random var for each candidate parent relationship • Build conditional random field to obtain best variable assignment

  16. 2. Hierarchy Extraction

  17. 2. Hierarchy Extraction • CRFs use potential functions to incorporate features • Node potentials represent single parent/child match • Share style? Near each other? WS-separated? • Edge potentials tie pairs of parent/child decisions • Share style pairs? Share text? Indented similiarly? • Spreadsheet potentials ensure a legal tree • One-parent potential: -∞ weight for multiple parents • Directional potential: -∞ weight when parent edges go in opposite directions • Run Loopy Belief Propagation for node + edge; post-inference test and repair for spreadsheet • Real sheets yielded 1K-8K variables; inference <0.13 sec • Approach adapted from (Pimplikar, Sarwagi, 2012)

  18. 3. Manual Repair • User reviews, repairs extraction • Goal: reduce user burden • Extractor makes repeated mistakes, either within spreadsheet or within corpus • Headache for user to repeat fixes • Our sol’n: after each repair, add repair potentials to CRF • Links user-repaired nodes to a set of nodes throughout CRF • Incorporates info on node similarity • Edges are generated heuristically • After each repair, re-run inference

  19. Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work

  20. Experiments • General survey of spreadsheet use • Evaluate: • Standalone extraction accuracy • Manual repair effectiveness • Test sets: • SAUS: 1,322 files from 2010 Statistical Abstract of the United States • WEB: 410,554 files from 51,252 domains, crawled from ClueWeb09

  21. Spreadsheets in the Wild • Very common for Web-published gov’t data

  22. Spreadsheets in the Wild

  23. Standalone Extraction • 100 random H-Sheets from SAUS, WEB • Three metrics • Pairs: parent/child pairs labeled correctly (F1) • Tuples: relational tuples labeled correctly (F1) • Sheets: % of sheets labeled 100% correctly • Two methods • Baseline uses just formatting, position • Hierarchy uses our approach

  24. Standalone Extraction

  25. Manual Repair: Effectiveness • Gather 10 topic areas from SAUS, WEB • Expert provides ground-truth hierarchies • Extract; repeatedly repair and recompute

  26. Manual Repair: Ordering • Good ordering: errors steadily decrease • Bad: extended periods of slow decrease

  27. End-To-End Extraction • What is overall utility of our extractor? • Final metric: Correct tuples per manual repair

  28. Agenda • Spreadsheets: An Overview • Extracting Data • Hierarchy Extraction • Manual Repairs • Experimental Results • Demo • Related and Future Work

  29. Demo Details • Ran SAUS corpus through extractor • Simple ad hoc integration analysis tool on top of extracted data • Early version of relation reconstruction • Early version of data ranking, join finding

  30. Related Work • Spreadsheet as interface(Witkowski et al, 2003), (Liu et al, 2009) • Spreadsheet extraction • User-provided rules(Ahmad et al, 2003), (Hung et al, 2011) • No explicit user rules (Abraham and Erwig, 2007), (Cunha et al, 2009) • Ad hoc integration for found data (Cafarella et al, 2009), (Pimplikar and Sarawagi, 2012), (Yakout et al, 2012) • Semi-automatic data programming • Wrangler (Guo, et al, 2011)

  31. Conclusions and Future Work • Spreadsheet extraction opens new datasets • Manual repair ensures accuracy, low user burden • Ongoing and Future Work • Relation assembly • Data relevance ranking • Join finding

More Related