1 / 20

An Improved Indexing Scheme for Range Queries

An Improved Indexing Scheme for Range Queries. Yvonne Yao Adviser: Professor Huiping Guo. Database-as-a-Service. Business organizations handle a large amount of data (TB) Cost of managing and maintaining these data onsite is high DAS DBMSs outsourcing

Download Presentation

An Improved Indexing Scheme for Range Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Improved Indexing Scheme for Range Queries Yvonne Yao Adviser: Professor Huiping Guo

  2. Database-as-a-Service • Business organizations handle a large amount of data (TB) • Cost of managing and maintaining these data onsite is high • DAS • DBMSs outsourcing • Clients rely on service providers for data management and maintenance • Cost is a lot lowered. • But…

  3. Database-as-a-Service • Security of data is not guaranteed • Service providers are untrusted • Store only an encrypted form of data onto the remote server • Only users with the correct key(s) can have access • How then can we query the encrypted data? • Retrieve and decrypt the entire table, and apply SQL statements on it. Too expensive! • A more realistic approach was discovered

  4. Database-as-a-Service

  5. Bucketization • Various approaches to build meta-data: B+-tree based, hash-based, and bucket-based • What is bucketization? • Partition of attribute data into several buckets • Each bucket is identified by an ID • Bucket IDs are stored, along with encrypted data, on the remote server • Client keeps partition information as meta-data • General bucketization approach • Equi-width • Equi-depth

  6. Example 1

  7. Example 1

  8. Example 1 • User query: SELECT * FROM grades WHERE gpa < 3.0 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ • Size of superset is 29, of which 7 of them are false positives

  9. Query Optimal Bucketization • General idea: minimizing the bucket cost of each bucket • Input: <D = (V, F), M> • V = {v1, v2, v3, …, vn} where v1 < v2 < v3 < … <vn • F = Frequency of each value • M = Number of buckets to fill • Output: a matrix indicating the boundary of each bucket

  10. Query Optimal Bucketization • QOB • Finds optimum solutions to two smaller sub-problems • one contains the leftmost M-1 buckets covering the (n-i) smallest points • Another contains the rightmost single bucket covering the remaining i points V = {v1, v2, v3, v4, v5, v6, …, vn-3, vn-2, vn-1, vn} n-i points go to last i points go to M-1 buckets last bucket

  11. Example 2

  12. Example 2 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ • Same as the general bucketization method • In most cases, QOB can outperform the conventional bucketization strategy, but not always

  13. Deviation Bucketization • Built upon QOB, takes the same parameters • Has two levels of buckets • First level: same as those produced by QOB • Second level: bucketization of deviation values, the difference between the value itself to the average of the bucket • Each first-level-bucket has at most M second level buckets • QOB has at most M buckets, while DB has at most M2 buckets

  14. Deviation Bucketization • DB • Run QOB (D, M) • Construct First-Level-Buckets from boundary matrix • For each First-Level-Bucket • Initialize empty datasets vi’ and fi’ • For each vi in the bucket • vi’ = vi’∪vi’ – avg() • fi’ = fi’ ∪ 1 • Create a new dataset di = (vi’, fi’) • Run QOB(di, M)

  15. Example 3

  16. Example 3 • Qserver: SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3_1’ OR gpaID = ‘Bucket_3_2’ • In this case, no false positives are returned • Generally, false positives will still be returned, just the number of them will be greatly reduced

  17. Experiments • Two datasets • Synthetic dataset: 105 integers from [0, 999] • Real dataset: 103 data points from the Aspect column of the Forest CoverType databasein UCI’s KDD Archive • Two sets of queries • Qsyn • Qreal

  18. Experiment 1

  19. Experiment 2

  20. Thank You

More Related