1 / 48

Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA

Index for Cloud Data Management. Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA. Outline. Motivating Applications E xisting Technologies Conclusions & Future work . Motivating Application. select sum(number) from Product where product.name = ‘beer’

robert
Download Presentation

Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Index for Cloud Data Management Lab of Web And Mobile Data Management(WAMDM) Youzhong MA

  2. Outline • Motivating Applications • Existing Technologies • Conclusions & Future work

  3. Motivating Application select sum(number) from Product whereproduct.name = ‘beer’ and product.price<=10$ andproduct.price>=5$ Cloud System Queries with multi-attributes and non-rowkey are quite common ! Table:Product Big Data in a Private Cloud

  4. Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Coupon • Motivating Application: Mobile Coupon Distribution Mobile Coupon Distributer Page 4

  5. System Scalability Efficient Complex Queries Large amounts of Data High Throughput Multi-Dimensional Query Nearest Neighbors Query Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location • Distribution Policy • Area • # of coupons Current Location Current Location Coupon Coupon Coupon Motivating Application: Mobile Coupon Distribution 125,000,000 subscribers in Japan Page 5

  6. Outline • Motivating Applications • Existing Technologies • Conclusions & Future work

  7. Existing Technologies at a reasonable price

  8. Solutions-overview Local Index + Global Index CAS NEC

  9. Efficient B-tree Based Indexing for Cloud Data ProcessingS. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. PVLDB'10

  10. Efficient B-tree Based Indexing for Cloud Data Processing • Motivation • Designing a scalable and high-throughput indexing scheme to support efficient query for huge volumes of data in cloud • Low maintenance cost but also support parallel search

  11. System Architecture BATON overlay network publish Local Index

  12. Challenges • How to select the local B+-tree nodes to publish in Global index? • How to organize the global index? • How to maximize the throughput?

  13. Selecting local B+-tree nodes • Cost modeling • Query cost • routing cost: • local search cost: • Update cost :cost of sending an index message :cost of random I/O 1:Search in global index 2:Search in local index

  14. Adaptive indexing strategy • Index expand • Index collapse Local Index

  15. BATON:Balanced Tree Overlay Network • A distributed tree structure for P2P systems • Supporting range search

  16. Index Construction • Assign a range to each node • For each node n • The range of its left sub-tree is less than that of n • The range of its right sub-tree is larger than that of n

  17. Publish local B+-tree node to BATON

  18. Maximizing the throughput • Eventual consistent model • Lazy update • if the update does not affect the key range of a local B+-tree, the stale index will not affect the correctness of the query processing. • Eager update • updates in the Left-most and right-most nodes

  19. Pros and cons • Pros • Supporting efficient point query and range query for non-rowkey • Proposed an adaptive indexing strategy based on the cost model of overlay routings • Cons • Can not support multi-dimensional query

  20. Multi-dimensional index [X.ZhangCloudDB’09]

  21. Multi-dimensional index [J.WangSIGMOD’10] [G.ChenVLDB’11]

  22. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware ServicesShoji Nishimura, Sudipto Das. MDM'11

  23. Contributions • Using linearization to implement a scalable multi-dimensional index structure layered over a range-partitioned Key-value store • Implementing a K-d tree and a Quad tree by the design

  24. Buckets Index key00 key11 value00 value11 But, our target is multi-dimensional… key01 key12 value01 value12 key00 key11 key1Y key0X value0X value1Y keynn Latitude Time keynn valuenn Longitude Ordered Key-Value Stores Sorted by key Good at 1-D Range Query

  25. key11 key00 value11 value00 key01 key12 value12 value01 key0X key1Y value0X value1Y Naïve Solution: Linearlization Projects n-D space to 1-D space Apply a Z-ordering curve… key00 key11 keynn keynn valuenn Simple, but problematic…

  26. 9 2 Problem: False positive scans • MD-query on Linearized space • Translate a MD-query to linearized range query. • Ex. Query from 2 to 9. • Scan queried linearized range. • Filter points out of the queried area. • ex. blue-hatched area (4 to 7) Require the boundary information of the original space.

  27. MD-HBase • Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase Multi-Dimensional Index Single Dimensional Index

  28. Space Partition By the K-d tree Partitioned space by the K-d tree Binary Z-ordering space bitwise interleaving 11 10 01 00 11 10 01 00 00 01 10 11 00 01 10 11 How do we represent these subspaces?

  29. *→0 *→1 1000 1111 (10, 00) (11, 11) Left-bottom corner Right-top corner Key Idea: The longest common prefix naming scheme Subspaces represented as the longest common prefix of keys! • Remarkable Property • Preserve boundary informationof the original space 11 10 01 00 1*** 00 01 10 11 000* 1***

  30. Build an index with the longest common prefix of keys Buckets 000* Index 11 10 01 00 001* 01** 1*** 01** 000* 001* 1*** 00 01 10 11 allocate per subspace

  31. Multi-dimensional Range Query Scan 0010 -1001 on the index 000* Index Subspace Pruning 11 10 01 00 001* Scan Filter 01** 10** Scan 00 01 10 11 11** Reconstruct the boundary Info. & Check whether intersecting the queried area

  32. Variations of Storage Layer table buckets • Table Share Model • Use single table, Maintain bucket boundary • Most space efficiency • Table per Bucket Model • Allocate a table per bucket • Most flexible mapping • One-to-one, one-to-many, many-to-one • Bucket split is expensive • Copy all points to the new buckets. • Region per Bucket Model • Allocate a region per bucket • Most bucket split efficiency • Require modification of HBase

  33. Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 timesfaster than others and responses proportional time to selectivity. Experimental Results: Multi-dimensional Range Query

  34. Dataset: spatially skewed data MD-HBase shows good scalability without significant overhead. Experimental Results: Insert

  35. Conclusions • Designed a scalable multi-dimensional data store. • Mapping multi-dimension to single dimension • Key Idea: indexingthe longest common prefix of keys • Demonstrated scalable insert throughput and excellent query performance. • Range Query: 10-100 times faster than existing technologies. • Insert: 220K inserts/sec on 16nodes cluster without overhead

  36. CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range QueriesY. Zou, J. Liu, S. Wang. NPC’10 end

  37. Introduction • Motivation • Building index in DOTs to support multi-dimensional range query • High performance, low space overhead, high reliability • DOT • Distributed Ordered Table • BigTable,HBase • Observations • Usually 3 to 5 replica in DOTs • Index number is usually less than 5 • Random read is significantly slower than scan

  38. Basic idea:Complemental Clustering Index CCIT: convert slow random reads to fast sequential scan CCT: for fast data recovery

  39. Challenges • Performance • Reliability • Space overhead

  40. Performance Query optimization based on the region-to-server mapping information • HBase 0.20.1 • 16 nodes • 90 million records

  41. Reliability: Fault tolarance • Get other index value from CCTs • Query the CCITs to recover data • Replicate CCTs

  42. Space overhead • N:the index column number • X-axis • Length of record to length of index columns • Y-axis • Overhead ratio

  43. Conclusions • Proposed CCIndex to support Multi-dimensional range query in DOTs • Not suitable for more than 5 index columns • Write operation is slower than the original table

  44. Outline • Motivating Applications • Existing Technologies • Conclusions & Future work

  45. Conclusions • Index for non-rowkey in cloud data management system • Solutions • Local index + global index • Linearlization • Secondary index • Key issues • Index reliability • Query result correctness • Index maintenance • …

  46. Future work • Study the architecture of HDFS and Hbase in detail • Test the existing index solutions in Cloud • Index framework and index structure

  47. References • M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. PVLDB, 1(1):598–609, 2008. • Y. Zou, J. Liu, S. Wang. CCIndex: a Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries. NPC’10. • S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009. • J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a cloud system. In SIGMOD, 2010. • S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data processing. PVLDB, 3(1):1207–1218, 2010. • X. Zhang, J. Ai, Z. Wang, J. Lu, and X. Meng, “An efficient multidimensional index for cloud data management,” in CloudDB, 2009, pp.17–24. • Shoji Nishimura, Sudipto Das. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. MDM2011.

  48. Thank you

More Related