310 likes | 433 Views
Data Integration for the Relational Web. Michael J. Cafarella, Alon Halevy, Nodira Khoussainova Work done while at Google, Inc. Presenter: Michael J. Cafarella, University of Michigan VLDB August 27, 2009. Web Challenge. Try to create a database of all “VLDB program committee members”.
E N D
Data Integration for the Relational Web Michael J. Cafarella, Alon Halevy, Nodira Khoussainova Work done while at Google, Inc. Presenter: Michael J. Cafarella, University of Michigan VLDB August 27, 2009
Web Challenge • Try to create a database of all“VLDB program committee members”
Data Integration for Web • Can we combine tables to create new data sources? • Existing mashup, data integration tools ignore realities of Web data • A lot of useful data is not in XML • User cannot know all sources in advance • Transient integrations • Data semantics semi-tied to src page
Octopus • Our system uses data from: • WebTables[WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al] • Harvesting Relational Tables from Lists[VLDB09, “Harvesting Relational Tables from Lists…”, Elmeleegy et al] Crawl Web Extract Tables Integrate Tables Obtain Database • Octopus • Our test system has over 200M src tables • Lots of table/list-extraction work, e.g., • [VLDB09, “Answering Table Augmentation…”, Gupta & Sarawagi] • [JAIR08, “Creating relational data…”, Michelson & Knoblock] • [WWW07, “Towards domain-independent…”, Gatterbauer et al] • [WWW02, “A machine learning based…”, Wang & Hu]
Outline • Introduction • Data Sources • Octopus Operators • SEARCH • CONTEXT • EXTEND • Algorithms & Experiments • Conclusions
Outline • Introduction • Data Sources • Octopus Operators • SEARCH • CONTEXT • EXTEND • Algorithms & Experiments • Conclusions
Outline • Introduction • Data Sources • Octopus Operators • SEARCH • CONTEXT • EXTEND • Algorithms & Experiments • Conclusions
Octopus • Provides “workbench” of data integration operators to build target database • Most operators are not correct/incorrect, but high/low quality • Some prosaic operators: project, select, … • Three original operators • SEARCH • CONTEXT • EXTEND • Under covers, each operator recovers different aspect of implicit GLAV src desc.
Operator #1 - SEARCH • SEARCH(“VLDB program committee members”)
Operator #2 - CONTEXT • Recover relevant data CONTEXT() CONTEXT()
Operator #2 - CONTEXT • Recover relevant data CONTEXT() CONTEXT()
Prosaic Operator - Union • Combine datasets Union()
Operator #3 - EXTEND • Add column to data • Similar to “join” but join target is a topic “publications” EXTEND( “publications”, col=0)
Straightforward Sequence • SEARCH(“VLDB program committee members”) • CONTEXT • CONTEXT
Straightforward Sequence union • CONTEXT • CONTEXT
Straightforward Sequence union • EXTEND • User integrated data sources with 4 operations • No wrappers; data was never intended for reuse • User never visited source web pages
Outline • Introduction • Data Sources • Octopus Operators • SEARCH • CONTEXT • EXTEND • Algorithms & Experiments • Conclusions
Experiments • ~50 queries, suggested and evaluated by Amazon Mechanical Turk
SEARCH Algorithms - Ranking • SimpleRank - search engine ranking • SCPRank - symmetric conditional probability between query, table data • Similar to Pointwise Mutual Information • [Lopes, DaSilva, 1999], multiword units
SEARCH Algorithms - Ranking • See paper for clustering results
CONTEXT Algorithms • Input: table and source page • Output: data values to add to table • SignificantTerms sorts terms in source page by “importance” (tf-idf)
Related View Partners • Looks for different “views” of same data
EXTEND Algorithms • Input: src table, src column, dst topic • EXTEND(t, col=0, “publications”) • JoinTest: • Tests a single table for join-compatibility • “City mayors”: yes • “VLDB publications”: no • Rank all tables by relevance to query topic • Select tables that are joinable to query column • MultiJoin • Finds a join-target tuple for each src tuple • “City mayors”: maybe • “VLDB publications”: yes • For each cell in src column, perform topic search • Cluster resulting tables, rank by column coverage
EXTEND Early Experiments • JoinTest • 3 of 7 source tables • 60% of source tuples • Single extension for each extended tuple • MultiJoin • All 7 source tables • 33% of source tuples • Avg 45.5 extensions for each extended tuple • 113 NYC mayors • 12 albums by Led Zeppelin
Related Work • Octopus relies on info extraction work • Substantial work in data integration • Mashup Tools • Yahoo! Pipes • Marmite - [Wong and Hong, 2007] • Karma - [Tuchinda, et al., 2007] • CIMPLE - [DeRose, et al., 2007] • Potter’s Wheel - [Raman and Hellerstein, 2001]
Octopus Contributions • Basic operators that enable Web data integration with very small user burden • Realistic and useful implementations for all three operators • Future work: • Efficient large-scale implementation