1 / 63

Jiaheng Lu Renmin Universtiy of China 2009-08-25

中科院软件所. 中国人民大学. Cloud-based Data Management: Challenges & Opportunities. Jiaheng Lu Renmin Universtiy of China 2009-08-25. Research experience and interesting. National University of Singapore PhD XML query processing and XML keyword search University of California, Irvine Postdoc

elroy
Download Presentation

Jiaheng Lu Renmin Universtiy of China 2009-08-25

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 中科院软件所 中国人民大学 Cloud-based Data Management: Challenges & Opportunities Jiaheng Lu Renmin Universtiy of China 2009-08-25

  2. Research experience and interesting • National University of Singapore PhD • XML query processing and XML keyword search • University of California, Irvine Postdoc • Approximate string processing • Data integration and data cleaning • Renmin University of China • Cloud data management • XML data management

  3. Outline • Motivation: cloud data management • Database Future and Challenges: • Large-scale Data management & transaction processing • Cloud-based data indexing and query optimization • Recent research work: • An efficient multiple-dimensional indexes for cloud data management • CIKM Workshop CloudDB 2009

  4. Motivation: Internet Chatter

  5. BLOG Wisdom • “If you want vast, on-demand scalability, you need a non-relational database.” Since scalability requirements: • Can change very quickly and, • Can grow very rapidly. • Difficult to manage with a single in-house RDBMS server. • Although RDBMS scale well: • When limited to a single node. • Overwhelming complexity to scale on multiple sever nodes.

  6. Current State • Most enterprise solutions are based on RDBMS technology. • Significant Operational Challenges: • Provisioning for Peak Demand • Resource under-utilization • Capacity planning: too many variables • Storage management: a massive challenge • System upgrades: extremely time-consuming

  7. Internet Search Data Analytics: A Case Study • Data analytics: • Parsed WEB Logs ingested in a RDBMS store. • Hourly and Daily summarization for custom reporting. • Operational nightmare: • Maintaining live reporting system ON at all costs and at all times. • Timely completion of hourly summarization. • Constant tension between Ad-hoc workload versus reporting workload. • Data-driven feedback to live products. • Temporal depth of detailed data

  8. Internet Search Data Analytics: A Case Study • Various solutions explored: • Data Warehousing appliance for fast summarization. • Parallel RDBMS technology for fast ad-hoc queries. • Business Intelligence Products (Data Cubes) for fast and intuitive reporting and analysis. • None of the solutions completely satisfactory: • Plans to migrate low-level data to file-based system to overcome Database scalability bottlenecks

  9. Paradigm Shift in Computing

  10. WEB is replacing the Desktop

  11. What is Cloud Computing? • Old idea: Software as a service (SaaS) • Def: delivering applications over the internet • Recently: “[Hardware, infrastructure, Platform] as a service” • Poorly defined so we avoid all “X as a service” • Utility Computing: pay-as-you-go computing • Illusion of infinite resources • No up-front cost • Fine-grained billing (e.g. hourly)

  12. Why Now? • Experience with very large datacenters • Unprecedented economies of scale • Other factors • Pervasive broadband internet • Pay-as-you-go billing model

  13. Cloud Computing Spectrum • Instruction Set VM (Amazon EC2, 3Tera) • Framework VM • Google AppEngine, Force.com

  14. Cloud Killer Apps • Mobile and web applications • Extensions of desktop software • Matlab, Mathematica • Batch processing/MapReduce

  15. Economics of Cloud Users • Pay by use instead of provisioning for peak

  16. Economics of Cloud Users • Risk of over-provisioning: underutilization

  17. Economics of Cloud Users • Heavy penalty for under-provisioning

  18. Economics of Cloud Providers • 5-7X economies of scale [Hamilton 2008] • Extra benefits • Amazon: utilize off-peak capacity • Microsoft: sell .NET tools • Google: reuse existing infrastructure

  19. Engineering Definition • Providing services on virtual machines allocated on top of a large physical machine pool.

  20. Business Definition • A method to address scalability and availability concerns for large scale applications.

  21. Data Management in the Cloud?

  22. Cloud Computing Implications on DBMSs • Where do Databases fit in this paradigm? • Generational reality: • Animoto.com • Started with 50 servers on Amazon EC2 • Growth of 25,000 users/hour • Need to scale to 3,500 servers in 2 days. • Many similar stories: • RightScale • Joyent • …

  23. Clouded Data? • Reality Number Ⅰ: • Unlimited processing assumption • Interactive page views: • By targeting large number of SQL queries against MySQL • Still Expect sub-millisecond object retrieval • Reality Number Ⅱ: • Why can’t the database tier be replicated in the same way as the Web Server and App Server can? →These are the major challenges for Data Management in the cloud.

  24. The Vision • R&D Challenges at the macro level: • Where and how does the DBMS fit into this model. • R&D Challenges at micro level: • Specific technology components that must be developed to enable the migration of enterprise data into the clouds.

  25. Data and Networks: Attempt Ⅰ • Distributed Database (1980s): • Idealized view: unified access to distributed data • Prohibitively expensive: global synchronization • Remained a laboratory prototype: • Associated technology widely in-use: 2PC

  26. Data and Networks: Attempt Ⅱ

  27. Data and Networks: Pragmatics

  28. Database on S3: SIGMOD’08 • Amazon’s Simple Storage Service(S3): • Updates may not preserve initiation order • No “force” writes • Eventual guarantee • Proposed solution: • Pending Update Queue • Checkpoint protocol to ensure consistent ordering • ACID: only Atomicity + Durability

  29. Unbundling Txns in the Cloud • Research results: • CIDR’09 proposal to unbundle Transactions Management for Cloud Infrastructures • Attempts to refit the DBMS engine in the cloud storage and computing

  30. Analytical Processing

  31. Architectural and System Impacts • Current state: • MapReduce Paradigm for data analysis • What is missing: • Auxiliary structures and indexes for associative access to data (i.e., attribute-based access) • Caveat: inherent inconsistency and approximation • Future projection: • Eventual merger of databases (ODSs) and data warehouses as we have learned to use and implement them.

  32. Underlying Principles: CIDR’2009 • Business data may not always reflect the state of the world or the business: • Inherent lack of perfect information • Secondary data need not be updated with primary data: • Inherent latency • Transactions/Events may temporarily violate integrity constraints: • Referential integrity may need to be compromised

  33. Data Security & Privacy • Data privacy remains a show-stopper in the context of database outsourcing. • Encryption-based solutions are too expensive and are projected to be so in the foreseeable future: • Private Information Retrieval (Sion’2008) • Other approaches: • Information-theoretic approaches that uses data-partitioning for security (Emekci’2007) • Hardware-based solution for information security

  34. Self management and self tuning in cloud-based data management • Self management and self tuning • Query optimization on thousands of nodes

  35. Remarks • Data Management for Cloud Computing poses a fundamental challenge to database researchers: • Scalability • Reliability • Data Consistency • Radically different approaches and solution are warranted to overcome this challenge: • Need to understand the nature of new applications

  36. References • Life Beyond Distributed Transactions: An Apostate’s Opinion by P.Helland, CIDR’07 • Building a Database on S3 M.Brartner, D.Florescu, D.Graf, D.Kossman, T.Kraska, SIGMOD’08 • Unbundling Transaction Services in the Cloud D.Lo,et, A.Fekete, G.Weikum, M.Zwilling, CIDR’09 • Principles of Inconsistency S.Finkelstein, R.Brendle, D.Jacobs, CIDR’09 • VLDB Database School (China) 2009 http://www.sei.ecnu.edu.cn/~vldbschool2009/VLDBSchool2009English.htm

  37. An Efficient Multi-Dimensional Index for Cloud Data Management CIKM workshop CloudDB09

  38. Outline INTRODUCTION MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE Extended Nodes partition Node partition Cost Estimation Strategy EVALUATION

  39. Cloud Computing Google File System Yahoo PNUTS

  40. Distributed Cloud base? • BigTable How to query on other attributes besides primary key? • HBase

  41. Distributed Index: Single Dimension? S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009. H. chih Yang and D. S. Parker, “Traverse: Simplified indexing on large map-reduce-merge clusters,” in Proceedings of DASFAA 2009, Brisbane, Australia, April 2009, pp. 308–322. M. K. Aguilera, W. Golab, and M. A. Shah, “A practical scalable distributed b-tree,” in Proceedings of VLDB’08, Auckland, New Zealand, August 2008, pp. 598–609.

  42. Outline INTRODUCTION MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE Extended Nodes partition Node partition Cost Estimation Strategy EVALUATION

  43. Framework of Request Processing in Cloud

  44. R-Tree R-trees is a tree data structure that is similar to a B-tree, but is used for spatial access methods

  45. KD-Tree kd-tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space.

  46. R-Tree & KD-Tree: RKDTree Master range : 6800~9000, 3400~8900 range : 2000~40000, 3400~8900 range : 6300~7000, 599~1400 range: 0~2000, 500~1200 range: 800~3500, 300~1300 Slave Slave Slave Slave Slave

  47. Outline INTRODUCTION MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE Extended Nodes partition Node partition Cost Estimation Strategy EVALUATION

  48. Random cutting: Pick several random values on the attributeand cut by the points. with the random method you may receivegreat performance, but also possible to have poor performance. Equal cutting: Cut the attribute into several equal intervals.This method is relatively stable since no extreme case willhappen. Clustering-based cutting: Cut the attribute by clustering valueson the attribute and cut between clusters. This methodmay receive foreseeable better performance, but the time costis also apparently higher. The time complexity of a clusteringalgorithm is typically O(nlogn) or even higher. Nodes partition for data summary

  49. Nodes partition Random cutting Equal cutting Clustering-based cutting

More Related