slide1
Download
Skip this Video
Download Presentation
An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research

Loading in 2 Seconds...

play fullscreen
1 / 65

CCDI - PowerPoint PPT Presentation


  • 369 Views
  • Uploaded on

An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research. Reflects many discussions with: Eric Baldeschwieler, Jay Kistler, Chuck Neerdaels, Shelton Shugar, and Raymie Stata

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CCDI' - richard_edik


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

An Overview of Cloud ComputingRaghu RamakrishnanChief Scientist, Audience and Cloud ComputingResearch Fellow, Yahoo! Research

Reflects many discussions with:

Eric Baldeschwieler, Jay Kistler, Chuck Neerdaels, Shelton Shugar, and Raymie Stata

and joint work with the Sherpa team, in particular:

Brian Cooper, Utkarsh Srivastava, Adam Silberstein and Nick Puz in Y! Research

Chuck Neerdaels, P.P. Suryanarayanan and many others in CCDI

ccdi research collaboration
Yahoo! Research

Raghu Ramakrishnan

Brian Cooper

Utkarsh Srivastava

Adam Silberstein

Nick Puz

Rodrigo Fonseca

CCDI

Chuck Neerdaels

P.P.S. Narayan

Kevin Athey

Toby Negrin

Plus Dev/QA teams

CCDI—Research Collaboration
scenarios
SCENARIOS

Pie-in-the-sky

living in the clouds
Living in the Clouds
  • We want to start a new website, FredsList.com
  • Our site will provide listings of items for sale, jobs, etc.
  • As time goes on, we’ll add more features
    • And illustrate how more cloud capabilities (and corresponding infrastructure components) are used as needed
      • List of capabilities/components is illustrative, not exhaustive
  • Our cloud provides a “dataset” abstraction
    • FredsList doesn’t worry about the underlying components
step 1 listings
Step 1: Listings

FredsList wants to store listings as (key, category, description)

FredsList.com application

DECLARE DATASET Listings AS

( ID String PRIMARY KEY,

Category String,

Description Text )

5523442, childcare,

Nanny available in San Jose

1234323, transportation, For sale: one bicycle, barely used

215534,

wanted,

Looking for issue 1 of Superman comic book

Simple Web Service API’s

Database

Sherpa

step 2 search
Step 2: Search

FredsList’s customers quickly ask for keyword search

FredsList.com application

ALTER Listings

SET Description SEARCHABLE

“dvd’s”

“bicycle”

“nanny”

Simple Web Service API’s

Database

Search

Sherpa

Vespa

Messaging

YMB

step 3 photos
Step 3: Photos

FredsList decides to add photos to listings

FredsList.com application

ALTER Listings

ADD Photo BLOB

Simple Web Service API’s

Storage

Database

Search

Foreign key

photo → listing

MObStor

Sherpa

Vespa

Messaging

YMB

step 4 data analysis
Step 4: Data Analysis

FredsList wants to analyze its listings to get statistics about category, do geocoding, etc.

FredsList.com application

ALTER Listings

MAKE ANALYZABLE

Hadoop program to generate fancy pages for listings

Hadoop program to geocode data

Pig query to analyze categories

Simple Web Service API’s

Storage

Compute

Database

Search

Foreign key

photo → listing

MObStor

Grid

Sherpa

Vespa

Messaging

YMB

Batch export

step 5 performance
Step 5: Performance

FredsList wants to reduce its data access latency

FredsList.com application

ALTER Listings

MAKE CACHEABLE

Simple Web Service API’s

Storage

Compute

Database

Caching

Search

Foreign key

photo → listing

MObStor

Grid

Sherpa

memcached

Vespa

Messaging

YMB

Batch export

eyes to the skies
EYES TO THE SKIES

Motherhood-and-Apple-Pie

why clouds
Why Clouds?
  • On-demand infrastructure to create a fundamental shift in the OE curve. Let’s us:
    • Do things we can’t do
    • Reduce time to market
    • Build more robustly, more efficiently, more globally, more completely, for a given budget
  • Cloud services should do heavy lifting of heavy-lifting of scaling & high-availability
    • Today, this is done at the app-level, which is not productive
requirements for cloud services
Requirements for Cloud Services
  • Multitenant. A cloud service must support multiple, organizationally distant customers.
  • Elasticity. Tenants should be able to negotiate and receive resources/QoS on-demand.
  • Resource Sharing. Ideally, spare cloud resources should be transparently applied when a tenant’s negotiated QoS is insufficient, e.g., due to spikes.
  • Horizontal scaling. It should be possible to add cloud capacity in small increments; this should be transparent to the tenants of the service.
  • Metering. A cloud service must support accounting that reasonably ascribes operational and capital expenditures to each of the tenants of the service.
  • Security. A cloud service should be secure in that tenants are not made vulnerable because of loopholes in the cloud.
  • Availability. A cloud service should be highly available.
  • Operability. A cloud service should be easy to operate, with few operators. Operating costs should scale linearly or better with the capacity of the service.
types of cloud services
Types of Cloud Services
  • Two kinds of cloud services:
    • Horizontal Cloud Services
      • Functionality enabling tenants to build applications or new services on top of the cloud
    • Functional Cloud Services
      • Functionality that is useful in and of itself to tenants. E.g., various SaaS instances, such as Saleforce.com; Google Analytics and Yahoo!’s IndexTools; Yahoo! properties aimed at end-users and small businesses, e.g., flickr, Groups, Mail, News, Shopping
      • Could be build on top of horizontal cloud services or from scratch
      • Yahoo! has been offering these for a long while (e.g., Mail for SMB, Groups, Flickr, BOSS, Ad exchanges)
horizontal cloud services
Horizontal Cloud Services
  • Horizontal cloudservices are foundations on which tenants build applications or new services. They should be:
    • Semantics-free. Must be "generic infrastructure,” and not tied to specific app-logic.
      • May provide the ability to inject application logic through well-defined APIs
    • Broadly applicable. Must be broadly applicable (i.e., it can\'t be intended for just one or two properties).
    • Fault-tolerant over commodity hardware. Must be built using inexpensive commodity hardware, and should mask component failures.
  • While each cloud service provides value, the power of the cloud paradigm will depend on a collection of well-chosen, loosely coupled services that collectively make it easy to quickly develop and operate innovative web applications.
what s in the horizontal cloud
What’s in the Horizontal Cloud?

Security

Simple Web Service API’s

Horizontal Cloud Services

Provisioning & Virtualization

e.g., EC2

Batch Storage & Processing

e.g., Hadoop

& Pig

Operational Storage

e.g., S3,

MObStor,

Sherpa

Edge Content Services

e.g., YCS, YCPI

Other Services

Messaging, Workflow, virtual DBs & Webserving

ID & Account Management

Shared

Infrastructure

Metering, Billing, Accounting

Monitoring & QoS

Common Approaches to QA, Production Engineering,

Performance Engineering, Datacenter Management, and Optimization

yahoo ccdi thrust areas
Yahoo! CCDI Thrust Areas
  • Fast Provisioning and Machine Virtualization: On demand, deliver a set of hosts imaged with desired software and configured against standard services
    • Multiple hosts may be multiplexed onto the same physical machine.
  • Batch Storage and Processing: Scalable data storage optimized for batch processing, together with computational capabilities
  • Operational Storage: Persistent storage that supports low-latency updates and flexible retrieval
  • Edge Content Services: Support for dealing with network topology, communication protocols, caching, and BCP

Rest of

today’s talk

hadoop batch storage analysis
Hadoop: Batch Storage/Analysis

Why is batch processing important?

  • Whether it’s
    • response-prediction for advertising
    • machine-learned relevance for Search, or
    • content optimization for audience,
    • data-intensive computing is increasingly central to everything Yahoo! does
    • Hadoop is central to addressing this need
  • Hadoop is a case-study in our cloud vision
    • Processes enormous amounts of data
    • Provides horizontal scaling and fault-tolerance for our users
    • Allows those users to focus on their app logic

[Workflow]

High-level query layer (Pig)

Map-Reduce

HDFS

slide18

SHERPA

To Help You Scale Your Mountains of Data

the yahoo storage problem
The Yahoo! Storage Problem
  • Small records – 100KB or less
  • Structured records - tens, hundreds or thousands of fields
  • Extreme data scale - Tens of TB
  • Extreme request scale - Tens of thousands of requests/sec
  • Low latency globally - 20+ datacenters worldwide
  • High Availability - outages cost $millions
  • Variable usage patterns - as applications and users change

19

the sherpa solution
The Sherpa Solution

The next generation global-scale record store

  • Record-orientation: Routing, data storage optimized for low-latency record access
  • Scale out: Add machines to scale throughput (while keeping latency low)
  • Asynchrony: Pub-sub replication to far-flung datacenters to mask propagation delay
  • Consistency model: Reduce complexity of asynchrony for the application programmer
  • Cloud deployment model: Hosted, managed service to reduce app time-to-market and enable on demand scale and elasticity

20

what is sherpa
What is Sherpa?

A 42342 E

A 42342 E

B 42521 W

B 42521 W

C 66354 W

D 12352 E

F 15677 E

A 42342 E

E 75656 C

B 42521 W

C 66354 W

C 66354 W

D 12352 E

D 12352 E

E 75656 C

E 75656 C

F 15677 E

F 15677 E

CREATE TABLE Parts (

ID VARCHAR,

StockNumber INT,

Status VARCHAR

)

Structured, flexible schema

Geographic replication

Parallel database

Hosted, managed infrastructure

21

what will sherpa become

A 42342 E

A 42342 E

A 42342 E

B 42521 W

B 42521 W

B 42521 W

C 66354 W

C 66354 W

C 66354 W

D 12352 E

D 12352 E

D 12352 E

E 75656 C

E 75656 C

E 75656 C

F 15677 E

F 15677 E

F 15677 E

What Will Sherpa Become?

Indexes and views

CREATE TABLE Parts (

ID VARCHAR,

StockNumber INT,

Status VARCHAR

)

Geographic replication

Parallel database

Structured, flexible schema

Hosted, managed infrastructure

sherpa design goals
Sherpa Design Goals

Consistency

Per-record guarantees

Timeline model

Option to relax if needed

Multiple access paths

Hash table, ordered table

Primary, secondary access

Hosted service

Applications plug and play

Share operational cost

Scalability

Thousands of machines

Easy to add capacity

Restrict query language to avoid costly queries

Geographic replication

Asynchronous replication around the globe

Low-latency local access

High availability and fault tolerance

Automatically recover from failures

Serve reads and writes despite failures

23

technology elements
Technology Elements

Applications

Tabular API

PNUTS API

  • PNUTS
  • Query planning and execution
  • Index maintenance
  • Distributed infrastructure for tabular data
  • Data partitioning
  • Update consistency
  • Replication

YCA: Authorization

  • YDOT FS
  • Ordered tables
  • YDHT FS
  • Hash tables
  • YMB
  • Pub/sub messaging
  • Zookeeper
  • Consistency service

24

data manipulation
Data Manipulation

Per-record operations

Get

Set

Delete

Multi-record operations

Multiget

Scan

Getrange

Web service (RESTful) API

25

tablets hash table
Tablets—Hash Table

Name

Description

Price

0x0000

$12

Grape

Grapes are good to eat

$9

Limes are green

Lime

$1

Apple

Apple is wisdom

$900

Strawberry

Strawberry shortcake

0x2AF3

$2

Orange

Arrgh! Don’t get scurvy!

$3

Avocado

But at what price?

Lemon

How much did you pay for this lemon?

$1

$14

Is this a vegetable?

Tomato

0x911F

$2

The perfect fruit

Banana

$8

Kiwi

New Zealand

0xFFFF

26

tablets ordered table
Tablets—Ordered Table

Name

Description

Price

A

$1

Apple

Apple is wisdom

$3

Avocado

But at what price?

$2

Banana

The perfect fruit

$12

Grape

Grapes are good to eat

H

$8

Kiwi

New Zealand

Lemon

$1

How much did you pay for this lemon?

Limes are green

Lime

$9

$2

Orange

Arrgh! Don’t get scurvy!

Q

$900

Strawberry

Strawberry shortcake

$14

Is this a vegetable?

Tomato

Z

27

detailed architecture
Detailed Architecture

Remote regions

Local region

Clients

REST API

Routers

YMB

Tablet controller

Storage

units

29

tablet splitting and balancing
Tablet Splitting and Balancing

Storage unit

Tablet

Each storage unit has many tablets (horizontal partitions of the table)

Storage unit may become a hotspot

Tablets may grow over time

Overfull tablets split

Shed load by moving tablets to other servers

30

accessing data
Accessing Data

Record for key k

Get key k

Record for key k

1

2

3

4

Get key k

SU

SU

SU

32

bulk read
Bulk Read

{k1, k2, … kn}

Get k1

Get k2

Get k3

Scatter/

gather server

1

2

SU

SU

SU

33

range queries in ydot

Storage unit 1

Canteloupe

Storage unit 3

Lime

Storage unit 2

Strawberry

Storage unit 1

Grapefruit…Pear?

Grapefruit…Lime?

Storage unit 1

Canteloupe

Storage unit 3

Lime

Storage unit 2

Strawberry

Storage unit 1

Lime…Pear?

Router

Storage unit 1

Storage unit 2

Storage unit 3

Range Queries in YDOT
  • Clustered, ordered retrieval of records

Apple

Avocado

Banana

Blueberry

Canteloupe

Grape

Kiwi

Lemon

Lime

Mango

Orange

Strawberry

Tomato

Watermelon

Apple

Avocado

Banana

Blueberry

Strawberry

Tomato

Watermelon

Lime

Mango

Orange

Canteloupe

Grape

Kiwi

Lemon

updates
Updates

Write key k

SU

SU

SU

6

5

2

4

1

8

7

3

Sequence # for key k

Write key k

Routers

Message brokers

Write key k

Sequence # for key k

SUCCESS

Write key k

35

consistency model
Goal: make it easier for applications to reason about updates and cope with asynchrony

What happens to a record with primary key “Brian”?

Consistency Model

Record inserted

Delete

Update

Update

Update

Update

Update

Update

Update

v. 2

v. 5

v. 1

v. 3

v. 4

v. 6

v. 7

v. 8

Time

Time

Generation 1

38

consistency model39
Consistency Model

Read

Stale version

Current version

Stale version

v. 2

v. 5

v. 1

v. 3

v. 4

v. 6

v. 7

v. 8

Time

Generation 1

39

consistency model40
Consistency Model

Read up-to-date

Stale version

Current version

Stale version

v. 2

v. 5

v. 1

v. 3

v. 4

v. 6

v. 7

v. 8

Time

Generation 1

40

consistency model41
Consistency Model

Read ≥ v.6

Stale version

Current version

Stale version

v. 2

v. 5

v. 1

v. 3

v. 4

v. 6

v. 7

v. 8

Time

Generation 1

41

consistency model42
Consistency Model

Write

Stale version

Current version

Stale version

v. 2

v. 5

v. 1

v. 3

v. 4

v. 6

v. 7

v. 8

Time

Generation 1

42

consistency model43
Consistency Model

Write if = v.7

ERROR

Stale version

Current version

Stale version

v. 2

v. 5

v. 1

v. 3

v. 4

v. 6

v. 7

v. 8

Time

Generation 1

43

consistency model44
Consistency Model

Mechanism: per record mastership

Write if = v.7

ERROR

Stale version

Current version

Stale version

v. 2

v. 5

v. 1

v. 3

v. 4

v. 6

v. 7

v. 8

Time

Generation 1

44

mastering
Mastering

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

A 42342 E

B 42521 W

Tablet master

C 66354 W

D 12352 E

E 75656 C

F 15677 E

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

46

bulk insert update replace
Bulk Insert/Update/Replace
  • Client feeds records to bulk manager
  • Bulk loader transfers records to SU’s in batches
    • Bypass routers and message brokers
    • Efficient import into storage unit

Client

Bulk manager

Source Data

bulk load in ydot
Bulk Load in YDOT
  • YDOT bulk inserts can cause performance hotspots
  • Solution: preallocate tablets
index maintenance
Index Maintenance
  • How to have lots of interesting indexes, without killing performance?
  • Solution: Asynchrony!
    • Indexes updated asynchronously when base table updated

Planned functionality

mobstor
MObStor

Yahoo!’s next-generation globally replicated, virtualized media object storage service

Better provisioning, easy migration, replication, better BCP, and performance

New features (Evergreen URLs, CDN integration, REST API, …)

The object metadata problem addressed using Sherpa, though MObStor is focused on blob storage.

51

the world has changed
The World Has Changed
  • Web applications need:
    • Scalability!
      • Preferably elastic
    • Geographic distribution
    • High availability
    • Reliable storage
  • Web applications can do without:
    • Complicated queries
    • Strong transactions
web data management
Web Data Management
  • CRUD
  • Point lookups and short scans
  • Index organized table and random I/Os
  • $ per latency
  • Scan oriented workloads
  • Focus on sequential disk I/O
  • $ per cpu cycle

Structured record storage

(PNUTS)

Large data analysis

(Hadoop)

  • Object retrieval and streaming
  • Scalable file storage
  • $ per GB

Blob storage

(SAN/NAS)

types of record stores
Types of Record Stores
  • Query expressiveness

S3

PNUTS

Oracle

Simple

Feature rich

Object retrieval

Retrieval from single table of objects/records

SQL

types of record stores55
Types of Record Stores
  • Consistency model

S3

PNUTS

Oracle

Best effort

Strong guarantees

Eventual consistency

Timeline consistency

ACID

Program centric consistency

Object-centric consistency

types of record stores56
Types of Record Stores
  • Elasticity (ability to add resources on demand)

PNUTS

S3

Oracle

Not scalable

Elastic

Limited

(via data distribution)

VLSD

(Very Large Scale Distribution /Replication)

data stores comparison
User-partitioned SQL stores

Microsoft Azure SDS

Amazon SimpleDB

Multi-tenant application databases

Salesforce.com

Oracle on Demand

Mutable object stores

Amazon S3

Versus PNUTS

More expressive queries

Users must control partitioning

Limited elasticity

Highly optimized for complex workloads

Limited flexibility to evolving applications

Inherit limitations of underlying data management system

Object storage versus record management

Data Stores Comparison
application design space
Application Design Space

Get a few things

Sherpa

MObStor

YMDB

MySQL

Oracle

Filer

BigTable

Scan everything

Hadoop

Everest

Files

Records

59

alternatives matrix
Alternatives Matrix

Consistency model

Structured

access

Global low

latency

SQL/ACID

Availability

Operability

Updates

Elastic

Sherpa

Y! UDB

MySQL

Oracle

HDFS

BigTable

Dynamo

Cassandra

60

further reading
Further Reading

Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008)

Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee,

Ramana Yerneni, Raghu Ramakrishnan

PNUTS: Yahoo!\'s Hosted Data Serving Platform (VLDB 2008)

Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava,

Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen,

Nick Puz, Daniel Weaver, Ramana Yerneni

opening up yahoo search
Opening Up Yahoo! Search

Phase 1

Phase 2

BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo! Search infrastructure and technology to developers and companies to help them build their own search experiences.

Giving site owners and developers control over the appearance of Yahoo! Search results.

search results of the future
Search Results of the Future

yelp.com

Gawker

babycenter

New York Times

epicurious

LinkedIn

answers.com

webmd

boss offerings
BOSS Offerings

BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search.

  • ACADEMIC
  • Working with the following universities to allow for wide-scale research in the search field:

API

A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences.

CUSTOM

Working with 3rd parties to build a more relevant, brand/site specific web search experience.

This option is jointly built by Yahoo! and select partners.

  • University of Illinois Urbana Champaign
  • Carnegie Mellon University
  • Stanford University
  • Purdue University
  • • MIT
  • Indian Institute of
  • Technology Bombay
  • University of
  • Massachusetts

(Slide courtesy Prabhakar Raghavan)