Data integration for relational web
1 / 24

Data Integration for Relational Web - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Data Integration for Relational Web. Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington. OVERVIEW. INTRODUCTION OCTOPUS AND ITS OPERATORS ALGORITHMS IMPLEMENTATION AT SCALE EXPERIMENTS RELATED WORK CONCLUSIONS

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Data Integration for Relational Web

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Data Integration for Relational Web

Michael Cafarella Alon HalevyNodira Khoussainova

University of Washington Google, incUniversity of Washington











  • OCTOPUS, a system that combines search, extraction, data cleaning and integration, and enables the end user to create new data set from those found on the Web.

  • Drawbacks of traditional Data Integration tools:

    • Locating Relevant Data: Difficult due to the large amount of data on Web.

    • System should Integrate well and present user with relevant data.

    • Data Sources Embedded in Web Pages: Data needs to be prepared before processing.

    • Offer Web Specific Data Extraction Tool

    • The semantic of Web data are implicitly tied to the Web Page.

    • Eg: Multiple Tables of VLDB PC members available but the Year of the Conference is in the Web Page text.

    • Able to recover any Relevant and implicit column for each extracted data source

    • Most of the Integration tool are tailored for the query posted against the stable database.

    • We do not concentrate on a certain data source, but we consider transient data source.





    • SEARCH


    • EXTEND


  • Manipulates the data Extracted from Web pages

  • Currently it handles HTML tables and HTML lists.[3] & [9]

  • Can manage any manipulate data obtained from any information extraction technique that emits relational data.

  • Extracted relation is a table T with ‘k’ columns.

  • The table extracted has domain sensitive columns.

    • Eg: Column will contain strings which depict strings or integers which are drawn from same domain. (movie titles).

  • Relation T would also preserves it extraction lineage.

  • Data Available: 3.9B HTML lists, 154M of 14B HTML table contain high quality relational data (slightly over 1% of data available on Web).


  • In Traditional Data integrating Technique

  • Create a mediated schema which can be used for query processing.

    • For eg: To collect data about the programming committee the schema would be PCMEMBER(name, institution, conference, year)

      The query would be reformulated to

  • In OCTOPUS system, the description for data sources are not prepared in advance.

  • Because Data Integration task in OCTOPUS

    • Transient

    • Large Number of Data Sources.

  • Integral Part of data integration is finding the relevant data sources over the Web.

  • Search operator finds relevant data over the Web and then clusters the result.

    • Each member table of the cluster is a concrete table that contributes to the Clusters Schema Relation.

  • The Context operator helps to discover the selection predicates that apply to schematic mapping of source table and mediated table but which are not described explicitly.

    • Context only requires single concrete relation to operate on(linkage).

  • Search and Context operators are sufficient to express semantic mappings for sources to the mediated schema.

  • The Extend operator will help us to express joins between data sources.

    • For eg: In the previous example if the mediated schema is extended with an another attribute Adviser.

  • We will have to join the tables VLDB08Page and VLDB09Page with other relation on the Web that describe the adviser relationship.

  • However the above information may come from many different sources and hence we would have several set of inputs.

  • The OCTOPUS uses the ranking technique like the conventional Web Search Engine to decide on the output.

  • Data Cleaning operators such as Data Transformation, Entity Resolution can also be implemented in OCTOPUS.



  • Takes Extracted Set of Relation S and a Users Query q as input.

  • Returns a Sorted List of Cluster of tables in S, ranked by the relevance to q.

  • Relevance ranking helps to find the useful source relation and is evaluated as the traditional Web Search.

  • Clustering: Finding Relations in S that are similar.

  • Tables in a Single Cluster should be able unionable with few or no modifications.( ie. They should be identical or very similar)

  • The Output of Search is List L of table sets.

  • A single table may appear in multiple clusters C.

  • It sorts the List L for relevancy and diversity of results.



  • Takes a single extracted Relation T as input and modifies to contain additional columns using data derived from T’s Source Web Page.

  • The values generated by Context can be viewed as the selection conditions in semantic mapping created by SEARCH.

  • Data values that hold true for every tuple are generally projected out and added to surrounding text.

  • Hence it makes the implicit data that are embedded in the Web page available explicitly.

  • In the previous example, year is the implicit data which can me made available by CONTEXT operator.



  • Enables the user to add more columns to the table by performing a join.

  • Takes a column “c” of table T as input and a topic keyword “k”.

  • It returns 1or more columns whose values are described by k.

  • The new column added to T does not necessarily come from aa single data source.

  • It gathers data from large number of sources.

  • It can also gather data from table with different label from k or no label at all.



    • Ranking

    • Clustering



  • SEARCH :

    • Rank the Table by relevance to Users Query

    • Cluster other related tables around top ranking Search result.



  • Simple Rank Algorithm:

  • Transmits the users search query to Web Search engine obtains the URL ordering and presents the data according to that order.

  • Drawbacks:

    • Ranks Individual whole page and not the data on that page.

      • Eg: persons home page contains a HTML list that serve as navigation list to other pages.

    • When multiple data sets are present on the web page, SR algorithm relies on in-page ordering. (ie. In the order of its appearance)

    • Any metadata about the HTML lists exists only in the surrounding text and not the table itself.

      • Cannot count hits between the query and a specific tables metadata.

  • SCPRanking Algorithm:

  • Uses symmetric conditional probability to measure correlation between cell in extracted database and query term. It is defined as:

    • How likely the term q and c appear together in a document.

  • SCPRank scores the table and not the cell.

  • It sends the query to the Search Engine, extracting a candidate set of tables.

  • Then it computes per-column scores, each of which is sum of per-cell SCP score in the column.

  • The tables overall score is the max of all of its per-column scores.

  • Finally it sorts the table in the order of their scores and returns a ranked list.

  • Time consuming.

  • Compute score for first ‘r’ rows of every candidate table.

  • Approximating SCP score on a small subset of Web corpus.



  • It computes the dist(t,t’) for every t’ € T-t.

  • Then it applies a similarity score threshold that limits the size of the cluster centered around t.

  • Dist() for Text Cluster: computes tf-idf cosine dist between texts of table a and text of table b.

  • Dist() for Size Cluster: computes column to column similarity score that measures the difference in mean string length between them.

    • The overall table-to-able similarity score for a pair of table is sum of per column score for best column-to-column matching.

  • Dist() for Column Cluster: Its similar to Size Cluster however it computes a tf-idf cosine distance using only the text found in the 2 columns.


  • Significant Term Algorithm:

    • Examines the source page of the extracted table and returns the k terms with the highest tf-idf values and do not appear in the extracted data.

  • Related View Partners:

    • Looks beyond the source page.

    • Operating on the table T, it obtains a large number of candidate related view tables, by using each value in T as parameter for a new Web Search

    • Then filters out tables that are unrelated to t’s source page, by removing all tables that do not contain atleast one value from ST(T)

    • It obtains all the data value in the remaining table and ranks them according to the frequency of occurrence, returns the k highest ranked values.

  • Hybrid Algorithm:

  • It uses the fact that the above 2 algorithm are complimentary in nature.

  • ST finds the context terms that RVP misses and RVP discovers the context terms that ST misses.

  • Hybrid returns the context term that appear in result of either algorithm.


The Algorithm used is shown below:


  • We now evaluate the quality of result generated by each of the operators.

  • The Queries used:


  • Ranking:

  • Clustering:




  • Data Integration on Web called as “MashUp” is increasingly popular area of work.

  • The Yahoo Pipes allows the user to graphically describe the flow of data (structured data only)

  • CIMPLE is data integration system for web use designed to construct community websites.


  • OCTOPUS allows the user to integrate data from many unstructured data source.

  • It offers access to orders of magnitude of data sources, frees the user from having to design or even know about the mediated schema.


  • Login