1 / 54

WebTables: Exploring the Power of Tables on the Web

WebTables: Exploring the Power of Tables on the Web . Michael J. Cafarella , University of Washington (presently with University of Michigan) Alon Halevy, Google Daisy Zhe Wang, UC Berkeley Eugene Wu, MIT Yang Zhang, MIT Proceedings of VLDB '08, Auckland, New Zealand

amaris
Download Presentation

WebTables: Exploring the Power of Tables on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, University of Washington (presently with University of Michigan) Alon Halevy, Google Daisy Zhe Wang, UC Berkeley Eugene Wu, MIT Yang Zhang, MIT Proceedings of VLDB '08, Auckland, New Zealand Presented by : Udit Joshi

  2. Introduction • Web : Acorpus of unstructured documents • Relational data often encountered • 14.1 billion HTML tables extracted by crawl • Non-relational tables filtered out • Corpus of 154M (1%) high quality relations • Searching and Ranking • Leveraging the statistical information

  3. A typical use of the table tag to describe relational data

  4. Contribution • Ample user demand for structured data, visualisation • Around 30 million queries from Google’s 1-day log • Extracting a corpus of high quality relations (previous work) • Determining effective Relation Ranking methods for search • Analyzing and leveraging this corpus

  5. Outline

  6. Data Model • Relation Extraction • Attribute Correlation Statistics Database (ACSDb)

  7. Relation Recovery • Crawl based on <table> tag • Filter out non relational data • Relation extraction pipeline

  8. Use of Table Tag to Describe Relational Data

  9. Deep Web • Tables behind HTML forms • http://factfinder.census.gov/, http://www.cars.com/ • Most deep web data not crawlable • Data in the Deep Web is huge • Google’s Deep Web Crawl Project uses ‘Surfacing’ • Precomputes set of relevant form submissions • Search query for “citibank atm 94043” returns a parameterized URL: http://locations.citibank.com/citibankV2/Index.aspx?zip=94022 • Corpus 40% from deep web sources

  10. Relational Recovery • Two stages for extraction system: • Relational filtering (for “good” relations) • Metadata detection (in top row of table) • HTML parser on a page crawl • 14.1B instances of the <table> tag. • Script to disregard tables used for layout, forms, calendars, etc.

  11. Relational Filtering • Human judgment needed • 2 independent judges given training data • Scored from 1-5. • Qualifying score > 4

  12. Relational Filtering • Machine-learning classification problem • Pair human classifications to a set of automatically extracted table features • Forms a supervised training set for the statistical learner > 1 less variation Statistics to help distinguish relational tables

  13. Metadata Detection • Only per-attribute labels needed. • Used in improving rank quality, data visualization, construction of ACSDb. Features to detect the header row in a table

  14. Relation Extractor’s Performance high recall low precision equal weight

  15. Data Model • Relation Extraction • Attribute Correlation Statistics Database (ACSDb)

  16. Attribute Correlation Statistics Database (ACSDb) • Simple collection of statistics about schema attributes • Derived from corpus of html tables • combo_make_model_year = 13single_make = 3068 • Available as a single file for download • 5.4M unique attribute names, 2.6M unique schemas • Source : http://www.eecs.umich.edu/~michjc/acsdb.html

  17. {name, addr, city, state, zip} 1 {name, size, last-modified} 1 ACSDb Recovered Relations {make, model, year} 2 {make, model, year, color} 1 • ACSDb used for computing attribute probabilities • p(“make”) = 3/5 • p(“zip”) = 1/5 • p(“addr” | “name”) = 1/2

  18. Structure of Corpus • Corpus R of databases • Each database R ∈ R is a single relation • URL Ru and offset Ri within page define R • Schema Rs is an ordered list of attributes Rs = [Grand Prix, Date, Winning Driver……] • Rt is the list of tuples, size of tuple t ≤|Rs|

  19. Extracting ACSDb from Corpus Function createACS(R) A = {} seenDomains = {} for all R ∈ R if getDomain(R.u) ∈ seenDomains[R.S] then seenDomains[R.S].add(getDomain(R.u)) A[R.S] = A[R.S] + 1 end if end for

  20. Distribution of frequency-ordered unique schemas in ACSDb Small number of schemas appear very frequently

  21. Relational Search • Challenges • Ranking Algorithms

  22. Relational Search • Search engine style keyword based queries • Query-appropriate visualizations • Structured operations supported over search results • Good search relevance is the key

  23. Relational Search Keyword query Possible visualization Ranked list of databases returned

  24. Relation Ranking Challenges • Relations are a mixture of “structure” and “content” • Lack incoming hyperlink anchor text used in traditional IR • PageRank style metrics unsuitable • Inverted Index unsuitable

  25. Relation Ranking Challenges • No domain-specific schema graph • Applying word frequency to embedded tables • Factoring relations specific features– schema elements, presence of keys, size of relation, # of NULLs

  26. Relational Search • Challenges • Ranking Algorithms

  27. Naïve Rank • Query q and top k parameter as input • Query sent to search engine • Fetches top-k pages ,extracts tables from each page • Stops even if less than k tables returned 1: Function naiveRank(q, k): 2: let U = urls from web search for query q 3: for i = 0 to k do 4: emit getRelations(U[i]) 5: end for

  28. Filter Rank • Slight improvement • Ensures k relations extracted 1: Function filterRank(q, k): 2: let U = ranked urls from web search for query q 3: let numEmitted = 0 4: for all u ∈ U do 5: for all r ∈ getRelations(u) do 6: if numEmitted >= k then 7: return 8: end if 9: emit r; numEmitted + + 10: end for 11: end for

  29. Feature Rank • No reliance on existing search engine • Uses several features to score each extracted relation in the corpus • Feature scores combined using Linear Regression Estimation (LRE) • LRE trained on thousand (q,relation) pairs • Judged by two judges on a scale of 1-5. • Results sorted on score

  30. Feature Rank Query independent features: # rows, # cols has-header? # of NULLs in table Query dependent features: document-search rank of source page # hits on header # hits on leftmost column # hits on second-to-leftmost column # hits on table body Subject matter Semantic key 1: Function featureRank(q, k): 2: let R = set of all relations extracted from corpus 3: let score(r ∈ R) = combination of per-relation features 4: sort r ∈ R by score(r) 5: for i = 0 to k do 6: emit R[i] 7: end for

  31. Schema Rank • Uses ACSDb-based schema coherency score • Coherent Schema implies tighter relation • High: {make, model} • Low: {make, zipcode} • Pointwise Mutual Information (PMI) determines how strongly two items are related. • Positive (strongly correlated) , Negative (negatively correlated), 0 independent • Coherency score for schema S is average pairwise PMI scores over all pairs of attributes in the schema.

  32. Schema Rank • Coherency Score • Pointwise Mutual Information (PMI) • 0 , + & - 1: Function cohere(R): 2: totalPMI = 0 3: for all a ∈ attrs(R), b ∈ attrs(R), a ≠ b do 4: totalPMI = PMI(a, b) 5: end for 6: return totalPMI/(|R| ∗ (|R| − 1))

  33. Indexing • Inverted index (term -> docid, offset) • WebTables data exists in two dimensions • (term -> tableid, (x, y) offsets) better suited for ranking function • Supports queries with spatial operators like samerow and samecol • Example: Paris and France on same row, Paris, London and Madrid in same column.

  34. Web Tables Search System Index split across servers

  35. ACSDb Applications • Schema Auto Complete • Attribute Synonym-Finding • Join Graph Traversal

  36. Schema Auto-Complete • To assist novice database designers • User enters one or more domain-specific attributes (example: “make”) • System guesses suggestions appropriate to the target domain (example: “model”, “year”, “price”, “mileage”)

  37. Schema Auto-Complete • Maximize p(S-I | I) • Probability values computed from ACSDb • Add to S from overall attribute set A • Threshold t set to .01

  38. ACSDb Applications • Schema Auto Complete • Attribute Synonym-Finding • Join Graph Traversal

  39. Attribute Synonym-Finding • Traditionally done using Thesauri • Do not support non-natural-language strings eg tel-# • Input set of context attributes, C • Output list of attribute pairs P likely to be synonymous in schemas that contain C • Example: For attribute “artist”, output is “song/track”.

  40. Attribute Synonym-Finding • For synonymous attributes a,b p(a,b) = 0 • If p(a,b) = 0 & p(a)p(b) is large, syn score high. • Synonyms appear in similar contexts C: for a third attribute z, z ∈ C, z ∈ A, p(z|a,C) ≈ p(z|b,C) • If a, b always “replace” each other then denominator ≈ 0 else denominator is large

  41. Attribute Synonym-Finding 1: Function SynFind(C, t): 2: R = [] 3: A = all attributes that appear in ACSDb with C 4: for a ∈ A, b ∈ A, s.t. a ≠ b do 5: if (a, b) ∈ ACSDb then 6: // Score candidate pair with syn function 7: if syn(a, b) > t then 8: R.append(a, b) 9: end if 10: end if 11: end for 12: sort R in descending syn order 13: return R

  42. ACSDb Applications • Schema Auto Complete • Attribute Synonym-Finding • Join Graph Traversal

  43. Join Graph Traversal • Assist a schema designer • Join Graph N,L • Node for every unique schema, undirected join link between any 2 schemas sharing a label • Join graph cluttered • Cluster together similar schema neighbors

  44. Join Neighbor Similarity • Measure whether shared attribute D plays similar role in schema X and Y • Similar to coherency score, except probability inputs to PMI fn conditioned on presence of D • Two schemas cohere well, clustered together • Used as distance metric to cluster schemas sharing an attribute with S. • User can choose from fewer outgoing links.

  45. Join Graph Traversal // input : ACSDb A, focal schema F // output : Join Graph (N,L) connecting any two schemas with shared attributes 1: Function ConstructJoinGraph(A, F): 2: N = {} 3: L = {} //schema S, shared attribute c 4: for (S, c) ∈ A do 5: N.add(S) // add node 6: end for 7: for (S, c) ∈ A do 8: for attr ∈ F do 9: if attr ∈ S then 10: L.add((attr,F, S)) // add link 11: end if 12: end for 13: end for 14: return N,L

  46. Experimental Results

  47. Fraction of High Scoring Relevant Tables in Top-k • Ranking: compared 4 algorithms on a test dataset , two judges • Judges rate (query,relation) pairs from 1-5 • 1000 pairs over 30 queries • Queries chosen by hand • Fraction of top-k that are relevant (≥4) shows better performance at higher grain

  48. Schema Auto-Completion • File system contents • File system contents • Baseball at-bats • Baseball at-bats

  49. Rate of attribute recall for 10 expert generated test schemas • Output schema almost always coherent • Need to get most relevant attributes • 6 humans created schema for each case • Retained attributes ≥ 2 files sys ->address book • 3 tries Ambiguous data Incremental improvement

  50. Synonym Finding • Fraction of correct synonyms in top-k ranked list from the synonym finder • Judge determines accuracy

More Related