1 / 20

Stratified K-means Clustering Over A Deep Web Data Source

Stratified K-means Clustering Over A Deep Web Data Source. Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug. 14, 2012. Outline. Introduction Deep Web Clustering on the deep web Stratified K-means Clustering Stratification Sample Allocation

brandi
Download Presentation

Stratified K-means Clustering Over A Deep Web Data Source

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug. 14, 2012

  2. Outline • Introduction • Deep Web • Clustering on the deep web • Stratified K-means Clustering • Stratification • Sample Allocation • Conclusion

  3. Deep Web • Data sources hidden from the Internet • Online query interface vs. Database • Database accessible through online Interface • Input attribute vs. Output attribute • An example of Deep Web

  4. Data Mining over the Deep Web • High level summary of data • Scenario 1: a user wants to relocate to the county. • Summary of the residences of the county? • Age, Price, Square Footage • County property assessor’s web-site only allows simple queries

  5. Challenges • Databases cannot be accessed directly • Sampling method for Deep web mining • Obtaining data is time consuming • Efficient sampling method • High accuracy with low sampling cost

  6. An Example of Deep Web for Real-Estate

  7. k-means clustering over a deep web data source • Goal: Estimating k centers for theunderlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.

  8. Overview of Method Sample Allocation Stratification

  9. Stratification on the deep web • Partitioning the entire population in to strata • Stratifies on the query space of input attributes • Goal: Homogenous Query subspaces • Radius of query subspace: • Rule: Choosing the input attribute that mostly decreases the radius of a node • For an input attribute , decrease of radius:

  10. Partition on Space of Output Attributes

  11. Sampling Allocation Methods • We have created c*k partitions and c*k subspaces • A pilot sample • C*k-mean clustering generate c*k partitions • Representativesampling • Good Estimation on statistics of c*k subspaces • Centers • Proportions

  12. Representative Sampling-Centers • Center of a subspace • Mean vector of all data points belonging to the subspace • Let sample S={DR1, DR2, …, DRn} • For i-th subspace, center :

  13. Distance Function • For c*k estimated centers with true centers • Using Euclidean Distance • Integrated variance • Computed based on pilot sample • : # of sample drawn from j-th stratum

  14. Optimized Sample Allocation • Goal: • Using Lagrange multipliers: • We are going to sample stratum with large variance • Data is spread in a wide area, and more data are need to represent the population

  15. Active Learning based sampling Method • In machine learning • Passive learning: data are randomly chosen • Active Learning • Certain data are selected, to help build a better model • Obtaining data is costly and/or time-consuming • Choosing stratum i, the estimated decrease of distance function is • Iterative Sampling Process • At each iteration, stratum with largest decrease of distance function is selected for sampling • Integrated variance is updated

  16. Representative Sampling-Proportion • Proportion of a sub-space: • Fraction of data records belonging to the sub-space • Depends on proportion of the sub-space in each stratum • In j-th stratum, • Risk function • Distance between estimated factions and their true values • Iterative Sampling Process • At each iteration, stratum with largest decrease of risk function is chosen for sampling • Parameters are updated

  17. Stratified K-means Clustering • Weight for data records in i-th stratum • , : size of population, : size of sample • Similar to k-means clustering • Center for i-th cluster

  18. Experiment Result • Data Set: • Yahoo! data set: • Data on used cars • 8,000 data records • Average Distance

  19. Representative Sampling-Yahoo! Data set • Benefit of Stratification • Compared with rand, decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8% • Benefit of Representative Sampling • Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5% • Center based sampling methods have better performance • Optimized sampling method has better performance in the long run

  20. Conclusion • Clustering over a deep web data source is challenging • A Stratified k-means clustering method over the deep web • Representative Sampling • Centers • Proportions • The experiment results show the efficiency of our work

More Related