云计算与云数据管理 陆嘉恒 中国人民大学 www.jiahenglu.net
主要内容 • 云计算概述 • Google 云计算技术：GFS，Bigtable 和Mapreduce • Yahoo云计算技术和Hadoop • 云数据管理的挑战 3
Why we use cloud computing? Case 1: Write a file Save Computer down, file is lost Files are always stored in cloud, never lost
Why we use cloud computing? Case 2: Use IE --- download, install, use Use QQ --- download, install, use Use C++ --- download, install, use …… Get the serve from the cloud
What is cloud and cloud computing? Cloud Demand resources or services over Internet scale and reliability of a data center.
What is cloud and cloud computing? Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the Internet. Users need not have knowledge of, expertise in, or control over the technology infrastructure in the "cloud" that supports them.
Characteristics of cloud computing • Virtual. software, databases, Web servers, operating systems, storage and networking as virtual servers. • On demand. add and subtract processors, memory, network bandwidth, storage.
Types of cloud service SaaS Software as a Service PaaS Platform as a Service IaaS Infrastructure as a Service
SaaS Software delivery model No hardware or software to manage Service delivered through a browser Customers use the service on demand Instant Scalability
SaaS Examples Your current CRM package is not managing the load or you simply don’t want to host it in-house. Use a SaaS provider such as Salesforce.com Your email is hosted on an exchange server in your office and it is very slow. Outsource this using Hosted Exchange.
PaaS Platform delivery model Platforms are built upon Infrastructure, which is expensive Estimating demand is not a science! Platform management is not fun!
PaaS Examples You need to host a large file (5Mb) on your website and make it available for 35,000 users for only two months duration. Use Cloud Front from Amazon. You want to start storage services on your network for a large number of files and you do not have the storage capacity…use Amazon S3.
IaaS Computer infrastructure delivery model A platform virtualization environment Computing resources, such as storing and processing capacity. Virtualization taken a step further
IaaS Examples You want to run a batch job but you don’t have the infrastructure necessary to run it in a timely manner. Use Amazon EC2. You want to host a website, but only for a few days. Use Flexiscale.
The 21st Century Vision Of Computing Leonard Kleinrock , one of the chief scientists of the original Advanced Research Projects Agency Network (ARPANET) project which seeded the Internet, said: “ As of now, computer networks are still in their infancy, but as they grow up and become sophisticated, we will probably see the spread of ‘computer utilities’ which, like present electric and telephone utilities, will service individual homes and offices across the country.”
The 21st Century Vision Of Computing Sun Microsystems co-founder Bill Joy
Definitions Cluster Grid Cloud utility
Definitions Cluster Grid Cloud Utility computing is the packaging of computing resources, such as computation and storage, as a metered service similar to a traditional public utility utility
Definitions Cluster Grid Cloud utility A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer.
Definitions Cluster Grid Cloud utility Grid computing is the application of several computers to a single problem at the same time — usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data
Definitions Cluster Grid Cloud utility Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.
Grid Computing & Cloud Computing • share a lot commonality intention, architecture and technology • Difference programming model, business model, compute model, applications, and Virtualization.
Grid Computing & Cloud Computing • the problems are mostly the same • manage large facilities; • define methods by which consumers discover, request and use resources provided by the central facilities; • implement the often highly parallel computations that execute on those resources.
Grid Computing & Cloud Computing • Virtualization • Grid • do not rely on virtualization as much as Clouds do, each individual organization maintain full control of their resources • Cloud • an indispensable ingredient for almost every Cloud
Any question and any comments ? 2014/10/23 33
主要内容 • 云计算概述 • Google 云计算技术：GFS，Bigtable 和Mapreduce • Yahoo云计算技术和Hadoop • 云数据管理的挑战 34
Cloud Systems BigTable HBase HyperTable Hive HadoopDB GreenPlum CouchDB Voldemort PNUTS SQL Azure OSDI’06 BigTable-like MapReduce VLDB’09 VLDB’09 DBMS-based VLDB’08
The Google File System (GFS) • A scalable distributed file system for large distributed data intensive applications • Multiple GFS clusters are currently deployed. • The largest ones have: • 1000+ storage nodes • 300+ TeraBytes of disk storage • heavily accessed by hundreds of clients on distinct machines
Introduction • Shares many same goals as previous distributed file systems • performance, scalability, reliability, etc • GFS design has been driven by four key observation of Google application workloads and technological environment
Intro: Observations 1 • 1. Component failures are the norm • constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system • 2. Huge files (by traditional standards) • Multi GB files are common • I/O operations and blocks sizes must be revisited
Intro: Observations 2 • 3. Most files are mutated by appending new data • This is the focus of performance optimization and atomicity guarantees • 4. Co-designing the applications and APIs benefits overall system by increasing flexibility
The Design • Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients
The Master • Maintains all file system metadata. • names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. • Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state
The Master • Helps make sophisticated chunk placement and replication decision, using global knowledge • For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers • Master is not a bottleneck for reads/writes
Chunkservers • Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle. • handle is assigned by the master at chunk creation • Chunk size is 64 MB • Each chunk is replicated on 3 (default) servers
Clients • Linked to apps using the file system API. • Communicates with master and chunkservers for reading and writing • Master interactions only for metadata • Chunkserver interactions for data • Only caches metadata information • Data is too large to cache.
Chunk Locations • Master does not keep a persistent record of locations of chunks and replicas. • Polls chunkservers at startup, and when new chunkservers join/leave for this. • Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)
Operation Log • Record of all critical metadata changes • Stored on Master and replicated on other machines • Defines order of concurrent operations • Also used to recover the file system state
System Interactions: Leases and Mutation Order • Leases maintain a mutation order across all chunk replicas • Master grants a lease to a replica, called the primary • The primary choses the serial mutation order, and all replicas follow this order • Minimizes management overhead for the Master
Atomic Record Append • Client specifies the data to write; GFS chooses and returns the offset it writes to and appends the data to each replica at least once • Heavily used by Google’s Distributed applications. • No need for a distributed lock manager • GFS choses the offset, not the client