slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Overview PowerPoint Presentation
Download Presentation
Overview

Loading in 2 Seconds...

  share
play fullscreen
1 / 58
Download Presentation

Overview - PowerPoint PPT Presentation

Sharon_Dale
222 Views
Download Presentation

Overview

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

    PNUTS Building and running a cloud database system Brian Cooper

    Slide 2:Overview

    Building a cloud service How PNUTS works Advanced features Lessons learned

    Slide 3:Yahoo!

    Yahoo! has almost 100 properties Mail, Messenger, Finance, Shopping, Sports, OMG! 20 properties are #1 or #2 Yahoo! is #1 in time spent online in U.S. (10.5%) 164 million unique U.S. visitors in January 79 percent of U.S. online audience 598 million unique worldwide visitors in January 48 percent of global online audience This is where we make our money! Users coming to Yahoo! sites and spending time We are focusing on the audience side of Yahoo! Not the search engine (Jan. 2010, source: ComScore)

    Slide 4:CLOUD COMPUTING

    4

    Slide 5:Why?

    Two competing needs Accelerating innovation Focus on building your application, not the infrastructure Increasing availability Without infinite hardware and system operators How will cloud services help? Cloud services will perform the heavy lifting of scaling & high-availability Focus on horizontal cloud services Platforms to support multiple vertical applications

    Slide 6:Requirements for Cloud Services

    Multi-tenancy Support for multiple, organizationally distant customers Horizontal scaling Add cloud capacity incrementally and transparently as needed by tenants Elasticity Tenants can request and receive resources on-demand, paying for usage Security & Account management Accounts/IDs, authentication, access control; isolate tenants; data security Availability & Operability High availability and reliability over commodity hardware Easy to operate, with few operators; automated monitoring & metering

    Slide 7:Cloud Data Management Systems

    Large data analysis (Hadoop) Structured record storage (PNUTS) Blob storage (MObStor) Scan oriented workloads Focus on sequential disk I/O $ per cpu cycle CRUD Point lookups and short scans Index organized table and random I/Os $ per latency Object retrieval and streaming Scalable file storage $ per GB

    Slide 8:What Makes a Cloud Data Service?

    DBA to the world! Many apps Each with hundreds or thousands of client processes Must automanage cannot manually tweak knobs Must autobalance load will constantly shift Massive scalability Scaling up via shared or specialized hardware is infeasible Scale out with commodity hardware 10,000 or 100,000 servers Failures are the common case Must continue to operate in the face of servers down Must autoscale plug in new servers and let them go These capabilities must be baked in from the start

    Slide 9:WHAT IS PNUTS?

    9

    Brian Sonja Jimi Brandon Kurt

    Slide 10: Example: social network updates

    Slide 11: Example: social network updates

    16 Mike <ph.. 6 Jimi <ph.. 8 Mary <re.. 12 Sonja <ph.. 15 Brandon <po.. 17 Bob <re.. <photo> <title>Flower</title> <url>www.flickr.com</url> </photo> (caveat: not necessarily how our Y! Updates product actually works)

    Slide 12: The world has changed

    Can trade away standard DBMS features: Complicated queries Strong transactions But I must have my scalability, flexibility and availability!

    Slide 13:The PNUTS Solution

    Record-orientation: Optimized for low-latency record access Scale out: Add machines to scale throughput Asynchrony: Avoid expensive synchronous operations Consistency model: Hide complexity of asynchronous replication Flexible access: Hashed or ordered, indexes, views; flexible schemas Cloud deployment model: Hosted, managed service [VLDB 08]

    Slide 14:PNots

    Not a SQL database Simple queries, simple transaction model Not a parallel processing engine Though it can play well with MapReduce Not a filesystem Record storage, not blob storage Not peer-to-peer We own the servers and can save some complexity Servers organized into natural groups (datacenters)

    Slide 15:Data Model

    Slide 16:Query Model

    Simple call API Get Set Delete Scan Getrange Scan and Getrange with predicate Web service (RESTful) API Encode data as JSON 16

    Slide 17: Representing sparse data

    $ curl http://pnuts.yahoo.com/PNUTSWebService/V1/get/userTable/yahoo {"record":{ "status":{"code":200,"message":"OK"}, "metadata":{ "seq_id":"5", "modtime":1234231551, "disk_size":89}, "fields": { "addr":{"value":"700 First Ave"}, "city":{"value":"Sunnyvale"}, "state":{"value":"CA"} } } } (some details changed to protect the innocent)

    Slide 18:DISTRIBUTION

    18

    Storage units

    Slide 19:Architecture

    19

    Slide 20:Tablet Splitting and Balancing

    20 Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers

    Slide 21:TabletsHash Table

    Apple Lemon Grape Orange Lime Strawberry Kiwi Avocado Tomato Banana Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Dont get scurvy! But at what price? How much did you pay for this lemon? Is this a vegetable? New Zealand The perfect fruit Name Description Price $12 $9 $1 $900 $2 $3 $1 $14 $2 $8 0x0000 0xFFFF 0x911F 0x2AF3 21

    Slide 22:TabletsOrdered Table

    22 Apple Banana Grape Orange Lime Strawberry Kiwi Avocado Tomato Lemon Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Dont get scurvy! But at what price? The perfect fruit Is this a vegetable? How much did you pay for this lemon? New Zealand $1 $3 $2 $12 $8 $1 $9 $2 $900 $14 Name Description Price A Z Q H

    Slide 23:Accessing Data

    23 Get key k

    Slide 24:Updates

    Write key k Sequence # for key k Sequence # for key k SU SU SU Write key k SUCCESS Write key k Routers Log servers 24

    Slide 25:ASYNCHRONOUS REPLICATION AND CONSISTENCY

    25

    Slide 26:Asynchronous Replication

    26

    Slide 27:Global Replication

    (not necessarily actual Yahoo! datacenters)

    Slide 28:Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key Brian? We also support an eventual consistency model Applications can choose which kind of table to create

    Consistency Model 28 Time Record inserted Update Update Update Update Update Delete Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update Update

    Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Stale version Read

    Slide 29:Timeline Model

    29

    Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version Stale version

    Slide 30:Timeline Model

    30

    Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read = v.6 Current version Stale version Stale version

    Slide 31:Timeline Model

    31

    Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version Stale version

    Slide 32:Timeline Model

    32

    Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Stale version

    Slide 33:Timeline Model

    33

    Slide 34:Consistency levels

    Eventual consistency Transactions: Alice changes status from Sleeping to Awake Alice changes location from Home to Work (Alice, Home, Sleeping) Region 1 (Alice, Home, Sleeping) Region 2 Final state consistent

    Slide 35:Consistency levels

    Timeline consistency Transactions: Alice changes status from Sleeping to Awake Alice changes location from Home to Work (Alice, Home, Sleeping) Region 1 (Alice, Home, Sleeping) Region 2 (Alice, Work, Awake) Work (Alice, Work, Awake)

    36

    Slide 36:Mastering

    A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E

    37

    Slide 37:Coping With Failures

    A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E X

    Slide 38:ADVANCED FEATURES

    38

    Slide 39: Ordered tables

    Time ranges Relationship graphs Hierarchical data Indexes and views Ordered tables provide efficient scanning of clustered subranges

    Slide 40:Ordered tables are tricky

    Hotspots! Solution: Proactive load balancing Move tablets from hot servers to cold servers If necessary, split hot tablets

    Slide 41:Parallel scans

    Scan engine Client

    Slide 42: Adaptive server allocation

    Scan engine Client

    Slide 43:Server scheduling

    Scan engine Client 1 Client 2

    Slide 44:Indexes and views

    How to have lots of interesting indexes, without killing performance? Solution: Asynchrony! Indexes updated asynchronously when base table updated Some interesting views can be represented as indexes

    Slide 45:View types

    Index Remote view table Base table ByAuthor view table

    Slide 46:View types

    Equijoin Co-clustered remote view tables Each sub-table managed like an index PostComments view table Posts view table Comments table

    Slide 47:Remote view tables

    A regular table, but updated by the view maintainer instead of a client Update Log server Log server SU

    Slide 48:SOME NUMBERS

    48

    Slide 49:Performance comparison

    Setup Six server-class machines 8 cores (2 x quadcore) 2.5 GHz CPUs 8 GB RAM 6 x 146GB 15K RPM SAS drives in RAID 1+0 Gigabit ethernet RHEL 4 Plus extra machines for clients, routers, controllers, etc. Workloads 120 million 1 KB records = 20 GB per server Write heavy workload: 50/50 read/update Updates write the whole record 50 client processes usually; up to 300 needed to generate higher throughputs Obviously many variations are possible; these are just two points in the space Metrics Latency versus throughput curves Caveats Write performance would be improved for Sherpa, Sharded and Cassandra with a dedicated log disk We tuned each system as well as we knew how

    Slide 50:Results

    Slide 51:YCSB

    Developing a common benchmark for serving systems Yahoo! Cloud Serving Benchmark Details coming soon

    Slide 52:CONCLUSIONS

    52

    Slide 53:Lessons learned (1)

    Simpler is better than clever Clever approaches are hard to implement, test, debug and maintain Incremental is better than big-bang How many new things do you want to test at once? Why throw away years of hardening?

    Slide 54:Lessons learned (2)

    Non-algorithmic challenges can be hard Dealing with network config, legacy software and requirements, the corporate way, multiple stakeholders Researchers should get dirty hands Being a part of shipping a real system can radically readjust your worldview Write some test cases to understand system complexity

    Slide 55:New in 2010!

    SIGMOD and SIGOPS are starting a new conference, to be co-located alternately with SIGMOD and SOSP: ACM Symposium on Cloud Computing (SoCC) Steering committee: Phil Bernstein, Ken Birman, Joe Hellerstein, John Ousterhout, Raghu Ramakrishnan, Doug Terry, John Wilkes PC Chairs: Surajit Chaudhuri & Mendel Rosenblum http://research.microsoft.com/socc2010

    Slide 56:Research + product collaboration

    Yahoo! Research Raghu Ramakrishnan Brian Cooper Utkarsh Srivastava Adam Silberstein Erwin Tam Excellent interns Parag Agrawal, Robert Ikeda, Ymir Vigfusson, Arvind Thiagarajan, Jeffrey Terrace, Yang Zhang, Mert Akdere, Prasang Upadhyaya Excellent visitors Arno Jacobsen, Rodrigo Fonseca Cloud Computing Chuck Neerdaels P.P.S. Narayan Toby Negrin Plus Dev/QA/Ops teams

    Slide 57:Thanks!

    cooperb@yahoo-inc.com research.yahoo.com