1 / 49

Flint: exploiting redundant information to wring out value from Web data

Flint: exploiting redundant information to wring out value from Web data. Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo , Paolo Papotti Roma Tre University - Rome, Italy. Motivations. The opportunity: An increasing number of web sites with structured information

mai
Download Presentation

Flint: exploiting redundant information to wring out value from Web data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flint: exploiting redundant information to wring out value from Web data • Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti • Roma Tre University - Rome, Italy

  2. Motivations • The opportunity:An increasing number of web sites with structured information • The problem: • Current technologies are limited in exploiting the data offered by these sources • Web semantics technologies are too complex and costly • Challenges: • development of unsupervised, scalable techniques to extract and integrate data from fairly structured large corpora available on the Web [DB Claremont Report 2008]

  3. Structured Data at Work: Search Engines

  4. Structured Data at Work: Search Engines

  5. Structured Data at Work: Search Engines

  6. Introduction • Notable approaches for massive extraction of Web data concentrate on information organized according to specific patterns that occur on the Web • WebTables [Cafarella et al VLDB2008] and ListExtract [Elmeleegy et al VLDB2009] focus on data published in HTML tables and lists • Information extraction systems (e.g. TextRunner [Banko-Etzioni ACL2008]) exploit lexical-syntactic patterns to extract collections of facts (e.g., x is the capital of y) • Even a small fraction of the Web implies an impressive amount of data • given a Web fragment of 14 billion pages: 1.1% of them are good tables ->154 millions [Cafarella et al VLDB2008]

  7. Observation • Many sources publish data about one object of a real-world entity for each page • Collections of pages can be thought as HTML encodings of a relation

  8. Observation • Learned while looking for pages to evaluate RoadRunner • I was frustrated … RoadRunner was build to infer arbitrary nested structures (list of list of list …) but pages were much more simpler • And pages with complex structures usually were designed to support the navigation to detail pages

  9. Information redundancy on the Web • For many disparate entities (e.g. stock quotes, people, products, movies, books, etc.) many web sites follow this publishing strategy • These sites can be considered as sources that provide redundant information. The redundancy occurs: • at the schema level: the same attributes are published by more than one source (e.g. volume, min/max/avg price, market capt. for stock quotes) • at the extensional level: several objects are published by more than one source (e.g. many web sites publish data about the same stock quotes)

  10. Abstract Generative Process "Hidden Relation" = λY(e Y(σY(π Y(R0)))) SY SG = λG (e G(σG(π G(R0)))) SR =λR (e R(σR(πR(R0))))

  11. Abstract Generative Process "Hidden Relation" σticker like "nasdaq:%" (πticker, price, volume(R0)) • Each source generated by: • πprojection • σselection • eerror (e.g. approx, mistakes, formattings) • λtemplate encoding

  12. Abstract Generative Process "Hidden Relation" • Each source generated by: • πprojection • σselection • eerror (e.g. approx, mistakes, formattings) • λtemplate encoding e Y (σticker like "nasdaq:%" (πticker, price, volume(R0))) e Y(): round(volume, 1000), price(Nσ)

  13. Abstract Generative Process "Hidden Relation" SY = λY (e Y (σticker like "nasdaq:%" (πticker, price, volume(R0)))) e Y(): round(volume, 1000), price(Nσ) • Each source generated by: • πprojection • σselection • eerror (e.g. approx, mistakes, formattings) • λtemplate encoding

  14. Problem: Invert the Process "Hidden Relation"

  15. The Flint approach • Exploit the redundancy of information • to discover the sources[Blanco et al WIDM08, WWW11] • to generate the wrappers, to match data from different sources, to infer labels for the extracted data[Blanco et al EDBT08,WebDB10, VLDS11] • to evaluate the quality of the data and the accuracy of the sources[Blanco et al Caise2010, Wicow11]

  16. The Flint approach • Exploit the redundancy of information • to discover the sources[Blanco et al WIDM08, WWW11] • to generate the wrappers, to match data from different sources, to infer labels for the extracted data[Blanco et al EDBT08,WebDB10, VLDS11] • to evaluate the quality of the data and the accuracy of the sources[Blanco et al Caise2010, Wicow11]

  17. The Flint approach (intuition) SG SY SR

  18. The Flint approach (intuition) SG SY SR

  19. Integration and Extraction • integration problem • extraction problem • how they can be tackled contextually • We start considering the web sources as relational views over the hidden relation

  20. Integration Problem • Given a set of sources S1, S2, … Sk, each Si publishes a view of the hidden relation • Problem: create a set of mappings, where each mapping is a set of attributes with the same semantics

  21. Integration Problem SG SY SR

  22. Integration Algorithm • Intuition: we match attributes from different sources to build aggregations of attributes with the same semantics • Assumption: alignment (record linkage) over a bunch of tuples • To identify attributes with the same semantics, we rely on an instance based matching • noise implies possible discrepancies in the values! SA SB d(a1, b1) = 0.08

  23. Integration Algorithm SA • Every attribute is a node SB SC

  24. Integration Algorithm • Every attribute is matched against all other attributes

  25. Integration Algorithm • Edges are ranked w.r.t. to the distance (due to the discrepancies). We start with the best match

  26. Integration Algorithm • We drop useless edges

  27. Integration Algorithm • We take the next edge in the rank and drop useless edges

  28. Integration Algorithm

  29. Integration Algorithm

  30. Integration Algorithm • Clustering algorithm to solve the problem • AbstractIntegration is O(n2) over the total number of attributes in the sources • But we are dealing with clean relational views... are these the relations we get from wrappers?

  31. Extraction Problem • A source Si is a collection of pages Si = p1, p2,… , pn • each page publishes data about one object of a real-world entity • Two different types of values can appear in a page: • target values: data from the hidden relation • noise values: not relevant data (e.g., advertising, template, layout, etc)

  32. Extraction Problem • A wrapper wi is a set of extraction rules wi = erA1, …, erAn page 1 page 2 er 1 er 3 er 2 er 4

  33. Extraction Problem • A wrapper wi is a set of extraction rules wi = erA1, …, erAn • Unsupervised wrapper inference limits: • Extraction of noise data (e.g. er 3) • Some extraction rule may be imprecise (e.g. er 4) er 1 er 3 er 2 er 4 page 1 page 2

  34. Extraction Problem • An extraction rule is: • correct if for every page it extracts a target value of the same conceptual attribute • weak if it mixes either target values with different semantics or target values with noise values

  35. Extraction Problem • Problem: Given a set of sources S = S1, S2, … Sn, produce a set of wrappers W*={w1, w2, … wn}, such that wi contains all and only the correct rules for Si • We leverage the redundant information among different sources to identify and filter out the weak rules • In a redundant environment, extracted data do not match by chance!

  36. Overlapping Rules • To increase the probability of getting the correct rules, we need a wrapper with more extraction rules P1 P2 P3 … er1= 2.1%, 42ML, 3.0%, ... html er2= 2.1%, 1.3%, 3.0%, ... tags body er3= 33ML, 42ML, 1ML, ... er4= 5, 5, 6, ... div div 2.1% ... ... 33ML

  37. Overlapping Rules • Two extraction rules from the same wrapper overlap if they extract the same occurrence of the same string from one page P1 P2 P3 … html er1= 2.1%, 42ML, 3.0%, ... er2= 2.1%, 1.3%, 3.0%, ... tags body div div 2.1% ... ... 33ML

  38. Overlapping Rules • Given a set of overlapping rules, one is correct, the others are weak • Idea: match all of them against rules from other sources: (i) correct rule is the one with the best matching score, (ii) drop the others S1 S2

  39. Overlapping Rules • Given a set of overlapping rules, one is correct, the others are weak • Idea: match all of them against rules from other sources: (i) correct rule is the one with the best matching score, (ii) drop the others 0.5 0 1 0.5 S1 S2

  40. Integration Algorithm • It is correct • It is O(n2) over the total number of attributes in the source

  41. Extraction and Integration Alg. • Lemma: AbstractExtraction is correct • AbstractExtraction is O(n2) over the total number of extraction rules

  42. Extraction and Integration Alg. • Greedy best-effort algorithm for integration and extraction [Blanco et al. WebDb2010, WWW2011] • Promising experimental results

  43. Some Results • R = number of correct extraction rules over the number of sources containing the actualattribute

  44. Adding Labels • Last step: assign a label to each mapping • Candidate labels: the textual template nodes that occur closest to the extracted values • poor performances on a single source • but effective on large number of sources because it exploits the redundancy of labels (observed also in [Cafarella et al SIGMOD Record2008] )

  45. The Flint approach • Exploit the redundancy of information • to discover the sources[Blanco et al WIDM08, WWW11] • to generate the wrappers, to match data from different sources, to infer labels for the extracted data[Blanco et al EDBT08,WebDB10, VLDS11] • to evaluate the quality of the data and the accuracy of the sources[Blanco et al Caise2010, Wicow11]

  46. Source Discovery • We developed crawling techniques to discover and collect the collections of our input sources [Blanco et al WIDM08, WWW11] • Input: a few sample pages • The crawler also associates an identifier to objects described in the collected pages

  47. Data Quality and Source Accuracy • Redundancy implies inconsistencies and conflicts, since sources can provide different values for the same attribute of a given object • This is modeled by the error function in the abstract generation process) • A concrete example: • on April 21th 2009, the open trade for the Sun Microsystem Inc. stock quote published by three distinct finance web sites, was 9.17, 9.15 and 9.15 • Which one is correct? (probability distribution) • What is the accuracy of the sources? • … is there any source that's copying values?

  48. Data Quality and Source Accuracy • Probabilistic models to evaluate the accuracy of web data • NAIVE (voting) • ACCU [Yin et al, TKDE08; Wu&Marian, WebDb07; Galland et al, WSDM10] (voting + source accuracy) • DEP [Dong et al, PVLDB09](voting + source accuracy + copiers) • M-DEP [Blanco et al, Caise10; Dong et al, PVLDB10](voting + source accuracy + copiers over more attributes)

  49. Conclusion • Data do not match by chance • Unexpected attributes discovered • Tolerant to noise (financial data challenging) • Other projects are exploiting data redundancy (e.g. Nguyen et al VLDB11, Rastogi et al VLDB10, Gupta-Sarawagi WIDM11) • Plans to leverage also schema knowledge • The approach applies for domains where instances are replicated over more sites (e.g. not suitable for real estate)

More Related