1 / 20

Organizing Structured Web Sources by Query Schemas: A Clustering Approach

Organizing Structured Web Sources by Query Schemas: A Clustering Approach. Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign. Background: MetaQuerier – Large-Scale Integration of the deep Web. Query. Result. MetaQuerier. The Deep Web. The Deep Web.

lena
Download Presentation

Organizing Structured Web Sources by Query Schemas: A Clustering Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign

  2. Background: MetaQuerier – Large-Scale Integration of the deep Web Query Result MetaQuerier The Deep Web

  3. The Deep Web MetaQuerier: System architecture MetaQuerier Front-end: Query Execution Result Compilation Query Translation Source Selection Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery Database Crawler Interface Extraction Source Organization Schema Matching

  4. In MetaQuerier, source organization is to cluster query interfaces into implicit domains Airfares Automobiles Books

  5. [Author; {contain}; text] [Title; {contain}; text] … … [Format; {=}; {hardcopy, paperback, …}] … … Interface Extraction [ SIGMOD 2004 ] Query Interface Query Schema What are the representative feature of query interfaces? Is query schema the feature we are looking for?

  6. Query schemas are appropriate representatives of Web databases: distinctive property Airfares Movies Hotels Number of observations Attributes Index Attributes Index Attributes Index • Each domain contains a dominant range of attributes, distinctive from other domains • Some attributes are only observed in one domain (anchor attributes): For example: ISBN for Books, MPAA Rating for Movies, • Source organization becomes the clustering of query schemas

  7. Query schemas can be viewed as categorical data • Query schemas as transactions: S1: {author, title, subject, ISBN} S2: {author, title, category, publisher} S3: {make, model, price, zip code} S4: {manufacturer, model, price} S5: {from, to, departure date, return date, number of passengers} S6: {departure city, arrival city, number of adults, number of children} …… • Thus, we can apply algorithms for clustering categorical data

  8. Clustering categorical data: Objective function • Clustering needs to have an objective function to evaluate the quality of clusters • Existing objective functions • Likelihood [1998] (Model-based clustering) • Context Linkage [ROCK 2000] • Entropy [COOLCAT 2002] • In this paper, we propose a new objective function • Model-Differentiation

  9. Model-Differentiation: A new objective function for model-based clustering • Assumption of model-base clustering: Each cluster Ci has a generative model Mi to generate its data with probabilistic behavior • What is a good clustering result? (our observation) data in different clusters are very dissimilar • models of different clusters are very dissimilar • a new objective function: maximize the dissimilarity of models • To realize, we need to answer three questions: • How to model the data? • How to estimate the model, given data? • How to measure the dissimilarity of models?

  10. Modeling: Multinomial distribution • Each attribute is an independent event • A schema is generated by a series of sampling from M Model M A schema: {title, author, ISBN} Vocabulary: author (P1) publisher (P2) title (P3) ISBN (P4) city (P5) price (P6) model (P7) … P1 ISBN author title P3 P4 Probability: P1*P3*P4

  11. Model estimation: Given a set of data, how to estimate its model? • Maximum likelihood estimation S1 = {title, author, ISBN}, S2 = {author, ISBN, publisher} S3 = {author, title, price}, S4 = {author, title, price} Vocabulary: author, title, ISBN, price, publisher

  12. Measuring the dissimilarity of models: Statistical hypothesis testing • Multinomial distribution can be directly tested by χ2 testing S1 = {title, author, ISBN}, S2 = {author, ISBN, price}, S3 = {make, model, price} Pro Pro M<1,2> M3 1. Combining S1 and S2: Attrs Attrs Pro Pro M<1,3> M2 2. Combining S1 and S3: Attrs Attrs Pro Pro M<2,3> M1 3. Combining S2 and S3: Attrs Attrs Inspire a hierarchical agglomerative clustering (HAC) algorithm

  13. Hypothesis testing needs sufficient observations: Pre-clustering to form small clusters Distinguishable S2 S1: with anchor attributes S1 and S2 should be in the same domain and thus pre-clustered How to decide whether an S is “distinguishable” ? Sup(S1) Any Si, Sj in Sup(S1) S1

  14. Post-classification: Handling “loners” Separate Pre-clustering Model clustering Loners: too small for X2 test after pre-clustering Naïve Bayesian

  15. Experiments • Data • Questions to answer: • Can schema clustering effectively organize Web databases? • Can it build a domain hierarchy correctly?

  16. We also try existing objective functions • Three existing objective functions • Likelihood: maximize likelihood • Entropy: maximize entropy • Context Linkage: minimize cross links • To be fair, keep pre-clustering and post classification, and only change the clustering step by different measures

  17. Effectiveness of Clustering • 8 domains, 8 clusters • Most Web databases are clustered correctly • Quantitatively analysis: Conditional Entropy (the smaller, the better) Model-Differentiation: 0.32; Likelihood: 0.42; Entropy: 0.38; Context Linkage: 0.61

  18. To build a domain hierarchy • After 8 clusters, continue to run the HAC algorithm to merge them together • It is consistent with common-sense: close concepts are merged first

  19. Conclusions • Cluster Web databases using their query schemas • First work on clustering Web databases, not pages • Query schemas are good representatives • Essentially a problem of clustering categorical data • A new objective function: Model-Differentiation • Realized by statistical hypothesis testing • Derive different similarity measure for HAC

  20. Thank You!

More Related