Index Structures for Querying the Deep Web

Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google

Deep Web Keyword queries Static web pages Surface web

Deep Web Keyword queries Static web pages Surface web 400-500 times the size of surface web! www.ebay.com Deep web … Ebay databases Amazon databases CNN databases Cars.com databases …

Deep Glue Deep Glue Query results Structured queries 400-500 times the size of surface web! Deep web Ebay database Amazon database CNN databases Cars.com database …

Deep Glue System Database Concepts @ half.com… Find textbooks with price<$50 Our focus Indexer Query Query Engine Index structures Superset of relevant data sources Internet Half.com databases …

Index structure for deep web: Challenges • Deal with structured data • Underlying databases are structured • Surface web typically unstructured • Deal with large volumes • Orders of magnitude larger than the size of surface web

Our approach • Understand the structure/typing of the data • Support equality and range queries • Heavily compress the index • Achieve a factor of 10 compression • Tradeoff between compression factor and the number of false positives • Compression factor 10 with only ~10 false positives for 1000 data sources.

Outline • Query model • Index Structures • Experimental Evaluation • Related work and conclusion

Assumptions • Data sources are classified into domains • Online car dealers, online auctions, online travel agents, … • Data sources in the same domain use same logical relational schema • Indexing attributes • Price, date, make, model, isbn,… • Indexed by Deep Glue system • Indexing data can be obtained via • Crawling the deep web [Raghavan 01 ] • Previously agreed-upon protocols [Froogle]

Query Model • Support equality and range queries • currently on a single indexing attribute • Schema: Car(Id,Make,Model,Year,Price) • Queries: • Find all year 2003 cars, year = 2003 • Find all cars that cost less than $1000, price < 1000

Overview • Uncompressed Index • Compressed Index, still support equality and range queries • Value Clustered Index (VCI) • DataSource Clustered Index (DCI) • Value DataSource Clustered Index (VDCI) • Histogram Based Index (HBI)

Uncompressed Index (UI) • For each distinct value v for an indexing attribute, stores the list of data sources data source UI: value B+tree d1: ebay.com , d2: amazon.com …

Problems • A huge number of values and data sources in deep web !! • Indexing every indexing attribute requires space • Need to compress UI ! • Use gzip? • Have to uncompress the index  index lookup too expensive! • Need new compression techniques

Overview • Uncompressed Index • Compressed Index, still support equality and range queries • Value Clustered Index (VCI) • DataSource Clustered Index (DCI) • Value DataSource Clustered Index (VDCI) • Histogram Based Index (HBI)

Value Clustered Index (VCI) • Intuition: • “closely related” values are stored in “closely related” data sources • ISBN numbers of antique books in the online book retailers specializing in antique books. • Cluster “closely related” values • Stores the list of data sources only for each cluster

VCI Example data source Union VCI structures: value B+tree • False positives • value 1  data source d1 • Tradeoff between space and accuracy • Mapping all values in one cluster • Mapping each distinct value into a separate cluster Cluster 1: { 1, 6} Cluster 2: { 2, 5} Cluster 3: { 3, 4}

VCI Implementation • Use existing scalable algorithm • Scales to large data sets: Birch Framework [Zhang96] • Minimize the number of false positives • Specify the parameters for Birch • Centroid, the mid-point of a cluster • Radius, a measure of quality for a cluster • Distance between clusters Centroid cluster1 Radius cluster2 Distance

VCI formulae • For a cluster having the set of values V ds(v): the set of data sources for value v • centroid(V) = • radius(V) = • distance(V1, V2) • Additional number of false positives when merging two clusters Data sources associated with the cluster Sum of number of false positives

Overview • Uncompressed Index • Compressed Index: • Value Clustered Index (VCI) • DataSource Clustered Index (DCI) • Value DataSource Clustered Index (VDCI) • Histogram Based Index (HBI)

DataSource Clustered Index (DCI) • Intuition: “closely related” data sources may have “closely related” sets of values • Amazon and b&n has similar sets of ISBN numbers • In the data graph, VCI clusters rows and DCI clusters columns data source • Cluster 1: {d2,d3,d6} • Cluster 2: { d4, d5} • Cluster 3: { d1, d7, d8} • Table structures are similar to VCI. • See paper for other details value

Value-DataSource Clustered Index (VDCI) • VCI, DCI: clusters in 1 dimension • VDCI: clusters in 2 dimensions, generalizes VCI/DCI • Cluster: a set of values and a set of data sources data source • Cluster 1:{ {2,3}, {d2,d3,d4}} • Cluster 2:{ {4,5}, {d4,d5,d6} } • Cluster 3:{ {1,2}, {d6,d7,d8} } • Data source d4 is in two clusters • Value 2 is in two clusters • Table structures are similar to VCI. • See paper for other details value

Histogram Based Index (HBI) • VCI/VDCI don’t consider the ordering among values • Range queries implies this need • HBI groups adjacent values in the same cluster • Also need to ensure the accuracy • Use threshold to determine the boundary of a cluster • Threshold: average number of false positives in a cluster

HBI Example data source • Threshold: 2 • Cluster adjacent values • Cluster 1: {1} • Cluster 2: {2,3,4} • Cluster 3: {5,6} value

Experimental setup • Synthetic data • 1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs • Other parameters are in the paper • Metrics • Index creation time • Compression factor • False positives • Setup • 2.8GHz Pentium IV, 1GB memory, 80GB disk • C++

Index creation time

Equality queries (1000 data sources)

Range Queries (1000 data sources)

Related work • Distributed database & information integration • Niagara system [Naughton01] • GlOSS [Gravano99] • … • Database/Inverted list compression • Query Optimization in Compressed Databases [Chen 01] • Compressing the Relations and Index [Goldstein 98] • Improved Query Performance with Variant Indices [O’Neill 97] • Implementation and Performance of Compressed Databases [Westmann 00] • Size Reduction of Inverted Files [Weiss 90] • …

Conclusion • Space-efficient index structures for querying the deep web • Support equality and range queries • A factor of 10 compression with a little loss in precision • Future work • Combine cluster-based and histogram-based • Multiple attributes queries • Joins • Incremental index maintenance

Questions?

Experimental setup • Other parameters: • Number of groups • The data sources in the same group use same distribution to generate the values • Default 20 • Group mode • How many groups a data source belongs to • Default 1 • Value correlation • How the orders in the value space maps to the value ordering over which Gaussian distribution is used. • Default 0.2

Index Structures for Querying the Deep Web

Index Structures for Querying the Deep Web

Presentation Transcript

Querying the Web for Genealogical Information

Index Structures

The “Deep Web”

A Generic Framework for Querying and Updating Secondary XML Index Structures

Reasoning and Querying for the Web urq.deri.ie

Querying the deep Web

Deep Web Integration: Querying Structured Data on the Deep Web

ViST: a dynamic index method for querying XML data by tree structures

Index structures

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Index Structures

Optimized Index Structures for Querying RDF from the Web

Querying The Web Database

Index Structures

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web

Index Structures for Files

Index Structures 13.2 – Secondary Index

The Case for Learned Index Structures

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Reasoning and Querying for the Web urq.deri.ie