1 / 36

Index Structures for Querying the Deep Web

Index Structures for Querying the Deep Web. Jian Qiu, Feng Shao , Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google. Deep Web. Keyword queries. Static web pages. Surface web. Deep Web. Keyword queries. Static web pages. Surface web.

lou
Download Presentation

Index Structures for Querying the Deep Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google

  2. Deep Web Keyword queries Static web pages Surface web

  3. Deep Web Keyword queries Static web pages Surface web 400-500 times the size of surface web! www.ebay.com Deep web … Ebay databases Amazon databases CNN databases Cars.com databases …

  4. Deep Glue Deep Glue Query results Structured queries 400-500 times the size of surface web! Deep web Ebay database Amazon database CNN databases Cars.com database …

  5. Deep Glue System Database Concepts @ half.com… Find textbooks with price<$50 Our focus Indexer Query Query Engine Index structures Superset of relevant data sources Internet Half.com databases …

  6. Index structure for deep web: Challenges • Deal with structured data • Underlying databases are structured • Surface web typically unstructured • Deal with large volumes • Orders of magnitude larger than the size of surface web

  7. Our approach • Understand the structure/typing of the data • Support equality and range queries • Heavily compress the index • Achieve a factor of 10 compression • Tradeoff between compression factor and the number of false positives • Compression factor 10 with only ~10 false positives for 1000 data sources.

  8. Outline • Query model • Index Structures • Experimental Evaluation • Related work and conclusion

  9. Assumptions • Data sources are classified into domains • Online car dealers, online auctions, online travel agents, … • Data sources in the same domain use same logical relational schema • Indexing attributes • Price, date, make, model, isbn,… • Indexed by Deep Glue system • Indexing data can be obtained via • Crawling the deep web [Raghavan 01 ] • Previously agreed-upon protocols [Froogle]

  10. Query Model • Support equality and range queries • currently on a single indexing attribute • Schema: Car(Id,Make,Model,Year,Price) • Queries: • Find all year 2003 cars, year = 2003 • Find all cars that cost less than $1000, price < 1000

  11. Outline • Query model • Index Structures • Experimental Evaluation • Related work and conclusion

  12. Overview • Uncompressed Index • Compressed Index, still support equality and range queries • Value Clustered Index (VCI) • DataSource Clustered Index (DCI) • Value DataSource Clustered Index (VDCI) • Histogram Based Index (HBI)

  13. Uncompressed Index (UI) • For each distinct value v for an indexing attribute, stores the list of data sources data source UI: value B+tree d1: ebay.com , d2: amazon.com …

  14. Problems • A huge number of values and data sources in deep web !! • Indexing every indexing attribute requires space • Need to compress UI ! • Use gzip? • Have to uncompress the index  index lookup too expensive! • Need new compression techniques

  15. Overview • Uncompressed Index • Compressed Index, still support equality and range queries • Value Clustered Index (VCI) • DataSource Clustered Index (DCI) • Value DataSource Clustered Index (VDCI) • Histogram Based Index (HBI)

  16. Value Clustered Index (VCI) • Intuition: • “closely related” values are stored in “closely related” data sources • ISBN numbers of antique books in the online book retailers specializing in antique books. • Cluster “closely related” values • Stores the list of data sources only for each cluster

  17. VCI Example data source Union VCI structures: value B+tree • False positives • value 1  data source d1 • Tradeoff between space and accuracy • Mapping all values in one cluster • Mapping each distinct value into a separate cluster Cluster 1: { 1, 6} Cluster 2: { 2, 5} Cluster 3: { 3, 4}

  18. VCI Implementation • Use existing scalable algorithm • Scales to large data sets: Birch Framework [Zhang96] • Minimize the number of false positives • Specify the parameters for Birch • Centroid, the mid-point of a cluster • Radius, a measure of quality for a cluster • Distance between clusters Centroid cluster1 Radius cluster2 Distance

  19. VCI formulae • For a cluster having the set of values V ds(v): the set of data sources for value v • centroid(V) = • radius(V) = • distance(V1, V2) • Additional number of false positives when merging two clusters Data sources associated with the cluster Sum of number of false positives

  20. Overview • Uncompressed Index • Compressed Index: • Value Clustered Index (VCI) • DataSource Clustered Index (DCI) • Value DataSource Clustered Index (VDCI) • Histogram Based Index (HBI)

  21. DataSource Clustered Index (DCI) • Intuition: “closely related” data sources may have “closely related” sets of values • Amazon and b&n has similar sets of ISBN numbers • In the data graph, VCI clusters rows and DCI clusters columns data source • Cluster 1: {d2,d3,d6} • Cluster 2: { d4, d5} • Cluster 3: { d1, d7, d8} • Table structures are similar to VCI. • See paper for other details value

  22. Overview • Uncompressed Index • Compressed Index: • Value Clustered Index (VCI) • DataSource Clustered Index (DCI) • Value DataSource Clustered Index (VDCI) • Histogram Based Index (HBI)

  23. Value-DataSource Clustered Index (VDCI) • VCI, DCI: clusters in 1 dimension • VDCI: clusters in 2 dimensions, generalizes VCI/DCI • Cluster: a set of values and a set of data sources data source • Cluster 1:{ {2,3}, {d2,d3,d4}} • Cluster 2:{ {4,5}, {d4,d5,d6} } • Cluster 3:{ {1,2}, {d6,d7,d8} } • Data source d4 is in two clusters • Value 2 is in two clusters • Table structures are similar to VCI. • See paper for other details value

  24. Overview • Uncompressed Index • Compressed Index: • Value Clustered Index (VCI) • DataSource Clustered Index (DCI) • Value DataSource Clustered Index (VDCI) • Histogram Based Index (HBI)

  25. Histogram Based Index (HBI) • VCI/VDCI don’t consider the ordering among values • Range queries implies this need • HBI groups adjacent values in the same cluster • Also need to ensure the accuracy • Use threshold to determine the boundary of a cluster • Threshold: average number of false positives in a cluster

  26. HBI Example data source • Threshold: 2 • Cluster adjacent values • Cluster 1: {1} • Cluster 2: {2,3,4} • Cluster 3: {5,6} value

  27. Outline • Query model • Index Structures • Experimental Evaluation • Related work and conclusion

  28. Experimental setup • Synthetic data • 1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs • Other parameters are in the paper • Metrics • Index creation time • Compression factor • False positives • Setup • 2.8GHz Pentium IV, 1GB memory, 80GB disk • C++

  29. Index creation time

  30. Equality queries (1000 data sources)

  31. Range Queries (1000 data sources)

  32. Outline • Query model • Index Structures • Experimental Evaluation • Related work and conclusion

  33. Related work • Distributed database & information integration • Niagara system [Naughton01] • GlOSS [Gravano99] • … • Database/Inverted list compression • Query Optimization in Compressed Databases [Chen 01] • Compressing the Relations and Index [Goldstein 98] • Improved Query Performance with Variant Indices [O’Neill 97] • Implementation and Performance of Compressed Databases [Westmann 00] • Size Reduction of Inverted Files [Weiss 90] • …

  34. Conclusion • Space-efficient index structures for querying the deep web • Support equality and range queries • A factor of 10 compression with a little loss in precision • Future work • Combine cluster-based and histogram-based • Multiple attributes queries • Joins • Incremental index maintenance

  35. Questions?

  36. Experimental setup • Other parameters: • Number of groups • The data sources in the same group use same distribution to generate the values • Default 20 • Group mode • How many groups a data source belongs to • Default 1 • Value correlation • How the orders in the value space maps to the value ordering over which Gaussian distribution is used. • Default 0.2

More Related