Moat a multi object assignment toolkit
Download
1 / 49

MOAT: A Multi-Object Assignment Toolkit - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

MOAT: A Multi-Object Assignment Toolkit. Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh. Background. Availability has become principle design goal : 0.1% improvement  $2M / year for Amazon and Ebay [internetweek.com]

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MOAT: A Multi-Object Assignment Toolkit' - eldon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Moat a multi object assignment toolkit

MOAT: A Multi-Object Assignment Toolkit

Haifeng Yu

Intel Research Pittsburgh / CMU

Joint work with:

Phillip B. Gibbons

Intel Research Pittsburgh


Background
Background

  • Availability has become principle design goal:

    • 0.1% improvement  $2M / year

      for Amazon and Ebay [internetweek.com]

    • One major focus of 8 OSDI’04 papers (out of 27)

  • Two orthogonal efforts:

    • Lower-level system components robustness

      • Example: disk, individual machine, Internet routing

    • Higher-level redundancy

      • Example: data replication

  • This talk focuses on higher-level redundancy

Haifeng Yu, Intel Research Pittsburgh / CMU


High availability via replication
High Availability via Replication

  • Large amount of data accessed by many users:

    • Distributed file systems

    • Network monitoring (PIER, SDIMS, IRISLOG)

    • Index databases for search engine (Google, p2p)

    • Scientific / medical databases

  • Data replicated across multiple machines

    • Object: The unit for replication

      • File, file block, database table, database tuple, inverted index for a certain keyword

Haifeng Yu, Intel Research Pittsburgh / CMU


Multi object accesses
Multi-object Accesses

  • Many accesses request multiple objects

    • Compile a project

    • Writing a paper under Latex

    • Asking for aggregates of network conditions

    • Search for web pages containing multiple keywords

  • Availability of single object can be misleading:

    • An access requesting 1,000 objects can observe up to 1,000 times higher unavailability

    • There’s more subtlety.....

Haifeng Yu, Intel Research Pittsburgh / CMU


A simple example

A B

A B

C D

C D

A C

A B

C D

B D

A Simple Example

  • Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D

  • Four machines fail independently with same prob, each holds two file

  • Which assignment gives better avail:

or

Better

Assignment matters because objects are now correlated

Haifeng Yu, Intel Research Pittsburgh / CMU


A simple example continued

A B

A B

C D

C D

A C

A B

C D

B D

A Simple Example - Continued

  • Suppose user is happy even if only three objects are available (e.g., when computing average)

or

Better

  • Assignment makes a difference

    • Even if we are using the same machines (same amount of redundancy/resource)

    • Easily have multiple-nine difference

Haifeng Yu, Intel Research Pittsburgh / CMU


Goal and contributions
Goal and Contributions

  • MOAT (Multi-Object Assignment Toolkit):

    • Goal: High availability for multi-object accesses

    • Key issue: Replica assignment

  • Contributions:

    • First to observe the importance of replica assignment

    • Strong theoretical results regarding best and worst assignments

    • Practical designs to approximate optimal assignments

    • MOAT toolkit implementation for replica assignments

Haifeng Yu, Intel Research Pittsburgh / CMU


Outline
Outline

  • Motivation and MOAT contributions 

  • System model and case studies of existing systems

  • Theoretical results

  • Designs for approximating optimal assignments

  • Designs for mixed accesses

  • Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU


Assumptions for this talk
Assumptions for This Talk

  • Assume:

    • Replication (no erasure coding)

    • Crash failures (no Byzantine failures)

    • Eventual consistency (no quorum or voting)

    • Most of our results hold without these assumptions

  • Assume same replication degree for all objects

    • We have results for different replication degrees as well

  • Talk to me if interested in the more complete story...

Haifeng Yu, Intel Research Pittsburgh / CMU


Moat architecture overview

file

system

p2p

DB

search

engine

network

monitoring

Data API

obj create / delete / read / write

Control API

assignment policy

MOAT

raw data on distributed

machines or disks

MOAT Architecture Overview

Storage

System

App

replication / repair / load balancing / naming / assignment

Haifeng Yu, Intel Research Pittsburgh / CMU


System model

A B

C D

A B

C D

System Model

  • Basic system model:

    • N objects, each with k replicas

    • Load balancing among all machines

    • Machines fail independently with same prob

  • An assignment is a mapping: replica  machine, for all Nk replicas

Haifeng Yu, Intel Research Pittsburgh / CMU


Some simple assignments
Some Simple Assignments

  • PTN: partition assignment

    • Used in most practice of Coda [Satyanarayanan et al.’90]

...........

A B C

D E F

...........

A B C

D E F

for k = 2

  • RAND: pick a random replica each time

    • Similar as in Google File System [Ghemawat et al.’03]

Haifeng Yu, Intel Research Pittsburgh / CMU


Assignment in chord stoica et al 01

C

C

hash(A) = 95

B

B

C

A

A

B

Assignment in Chord [Stoica et al.’01]

  • DHTs:

    • Hash machine IP to get machine id

  • Assignment in Chord:

    • Sliding window

    • Neither PTN nor RAND

120

080

104

090

101

098

Haifeng Yu, Intel Research Pittsburgh / CMU


Assignment in can ratnasamy et al 01
Assignment in CAN [Ratnasamy et al.’01]

  • Hash object k times

    • CAN uses a similar approach

  • Similar as RAND

    • But machines may have slightly different number of objects

120

080

hash1(A) = 95

104

090

101

098

A

Haifeng Yu, Intel Research Pittsburgh / CMU


Assignment in can ratnasamy et al 011
Assignment in CAN [Ratnasamy et al.’01]

  • Hash object k times

    • CAN uses a similar approach

  • Similar as RAND

    • But machines may have slightly different number of objects

120

080

A

hash2(A) = 119

104

090

101

098

A

Haifeng Yu, Intel Research Pittsburgh / CMU


Assignment in can ratnasamy et al 012
Assignment in CAN [Ratnasamy et al.’01]

  • Hash object k times

    • CAN uses a similar approach

  • Similar as RAND

    • But machines may have slightly different number of objects

120

080

A

hash1(B) = 84

hash2(B) = 100

104

090

B

101

098

A

B

Haifeng Yu, Intel Research Pittsburgh / CMU


Which assignment should we use
Which assignment should we use?

  • MOAT Goal: Improve avail of multi-object accesses

  • If an access requests n (n  N) objects, what if only x are available?

  • Threshold-based success definition:

    • If x≥t, user happy  Available

    • If x < t, too low confidence  Unavailable

  • Availability for an access defined as:

    • Prob[  t objects available out of n requested objects]

Haifeng Yu, Intel Research Pittsburgh / CMU


Examples of t
Examples of t

  • t = n

    • File systems

    • Search for terrorist images in image database

  • t close n

    • Query for top-10 most-loaded machines on PlanetLab

  • t not close n

    • Sample with confidence

Haifeng Yu, Intel Research Pittsburgh / CMU


Outline1
Outline

  • Motivation and MOAT contributions 

  • System model and case studies of existing systems 

  • Theoretical results

  • Designs for approximating optimal assignments

  • Designs for mixed accesses

  • Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU


Formal results
Formal Results

  • For access requesting N objects

  • Theorem: Among all assignments, when t = N:

    • PTN is best (within constant)

    • RAND is worst (within constant)

    • Difference is about c folds (c is #obj / machine)

  • Theorem: Among all assignments, when t = c+1 < N:

    • PTN is worst

    • RAND is best (within constant)

    • Difference is even larger

Haifeng Yu, Intel Research Pittsburgh / CMU


Numerical examples from simulation

c times difference

if p is small, where c is # obj/machine

Numerical Examples (from Simulation)

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

unavailability

PTN

Chord

RAND (CAN)

unavail of single obj

threshold

Haifeng Yu, Intel Research Pittsburgh / CMU


A spectrum of assignments
A Spectrum of Assignments

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

unavailability

PTN

RAND (CAN)

threshold

Haifeng Yu, Intel Research Pittsburgh / CMU


More formal arguments
More Formal Arguments

  • Tradeoff is fundamental:

    • Impossible to achieve the best of RAND and PTN

  • Previous results only for access requesting N objects

    • Similar results hold for accesses requesting n (n  N) objects

    • But each machine may not be filled to capacity:

      • For PTN, use as few machines as possible

      • For RAND, use as many machines as possible

  • I have more....talk to me if you are interested

Haifeng Yu, Intel Research Pittsburgh / CMU


Access Requesting 500 Objects

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

RAND (CAN)

unavailability

Chord

PTN

threshold

Haifeng Yu, Intel Research Pittsburgh / CMU


Outline2
Outline

  • Motivation and MOAT contributions 

  • System model and case studies of existing systems 

  • Theoretical results 

  • Designs for approximating optimal assignments

  • Designs for mixed accesses

  • Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU


Design of replica assignment
Design of Replica Assignment

  • Trivial in a static / centralized environment

  • Challenging in dynamic environment:

    • We may not have global knowledge with many objects and many machines

  • Basic solution: Consistent hashing

    • But some re-design is necessary

Haifeng Yu, Intel Research Pittsburgh / CMU


Approximating rand
Approximating RAND

  • Multi-hash DHT:

    • Hash the object k times

    • As in CAN

120

080

A

hash1(B) = 84

hash2(B) = 100

104

090

B

101

098

A

B

Haifeng Yu, Intel Research Pittsburgh / CMU


Approximating ptn
Approximating PTN

  • Chord does not achieve PTN

C

120

080

C

hash(A) = 95

104

090

B

B C

101

098

A

A B

Haifeng Yu, Intel Research Pittsburgh / CMU


Approximating ptn1

120

101

090

120

101

090

Approximating PTN

  • Chord does not achieve PTN

  • Group DHT:

    • (Arbitrarily) group machine into groups of k size

C

C

C

hash(A) = 95

B

A B

A B

Haifeng Yu, Intel Research Pittsburgh / CMU


Node join and leave in group dht
Node Join and Leave in Group DHT

  • Maintain r rondevour points in DHT

    • Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04]

  • New node reports to a random rondevour point

  • If group can be formed, join DHT

  • Two options upon node leave:

    • Dismiss group and delete the group from DHT

    • The group wait to recruit a new node

  • Groups use rondevour point to decide

Haifeng Yu, Intel Research Pittsburgh / CMU


Complexity analysis
Complexity Analysis

Haifeng Yu, Intel Research Pittsburgh / CMU


Outline3
Outline

  • Motivation and MOAT contributions 

  • System model and case studies of existing systems 

  • Theoretical results 

  • Designs for approximating optimal assignments 

  • Designs for mixed accesses

  • Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU


Mixture of queries
Mixture of Queries

  • Previous design only for single access requesting all N objects

    • PTN if t close to N

    • RAND if t far from N

  • But there are other accesses

    • Requests n (n < N) objects with threshold t

  • How does t change with n ?

    • Infinite possibilities

    • We focus on 4 large categories

Haifeng Yu, Intel Research Pittsburgh / CMU


Four application scenarios
Four Application Scenarios

Strict accesses: t n

Loose accesses: t< n

Haifeng Yu, Intel Research Pittsburgh / CMU


Loose for both small and large n
Loosefor both small and large n

  • Goal:

    • Approach RAND for both small and large n

  • Design:

    • Multi-hash DHT

120

080

A

hash1(B) = 84

hash2(B) = 100

104

090

B

101

098

A

B

Haifeng Yu, Intel Research Pittsburgh / CMU


Loose for small n strict for large n

120

101

090

120

101

090

Loosefor small n; Strict for large n

  • Goal:

    • Approach RAND for small n

    • Approach PTN for large n

  • Design:

    • Group DHT

C

C

C

A

A B

A B

Haifeng Yu, Intel Research Pittsburgh / CMU


Strict for both small and large n

120

101

090

120

101

090

Strictfor both small andlarge n

  • Goal:

    • Approach PTN for both small and large n

  • Assume accesses are tree accesses

  • Design:

    • Group DHT with item-balancing [Karger et al.’04]

C

C

A = 95

B

A B

A B

Haifeng Yu, Intel Research Pittsburgh / CMU


Strict for small n loose for large n
Strictfor small n; Loose for large n

  • Goal:

    • Approaches PTN for n < R

    • Approaches RAND for n >> R

  • Design:

    • Multi-hash DHT

    • But cluster objects into clusters of constant size R

120

080

hash1(AB) = 84

hash2(AB) = 100

104

090

A

B

101

098

A

B

Haifeng Yu, Intel Research Pittsburgh / CMU


Simulation results for strict accesses
Simulation Results for Strict Accesses

Here an access needs all n objects to be successful

400 machines

fail prob = 0.2

40,000 obj

4 replica / obj

unavailability

number (n) of objects requested by an access

Haifeng Yu, Intel Research Pittsburgh / CMU


Simulation results for loose accesses
Simulation Results for Loose Accesses

Here an access needs only t = n - 150 objects to be successful

400 machines

fail prob = 0.2

40,000 obj

4 replica / obj

unavailability

number (n) of objects requested by an access

Haifeng Yu, Intel Research Pittsburgh / CMU


Current status
Current Status

  • Waiting for paper deadlines

  • Finishing implementing MOAT

  • Evaluation on IrisLog trace and file system traces

Haifeng Yu, Intel Research Pittsburgh / CMU


Related work
Related Work

  • Multi-object accesses rarely addressed

    • CFS [Dabek et al.’01] focuses on individual file blocks

    • Chain replication [Renesse et al.’04] considers single data object

    • A long list .....

  • Replica assignment largely ignored

    • Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied

  • Replica placement[Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied:

    • Typically for machines in different locations in the network

    • Machines are heterogeneous

    • Approaches does not apply to replica assignment

Haifeng Yu, Intel Research Pittsburgh / CMU


Conclusions
Conclusions

  • Availability becoming key design goal

    • Multi-object access availability dramatically different from single-object availability

  • MOAT Contributions:

    • First to observe the importance of replica assignment

    • Strong theoretical results regarding the best and worst assignments

    • Practical designs to approximate optimal assignments

    • MOAT toolkit implementation

Haifeng Yu, Intel Research Pittsburgh / CMU


My other recent work
My Other Recent Work

  • Om [NSDI’04]:

    • Consistent and automatic replica regeneration

    • Regenerate from any single replica rather than a majority

  • Signed quorum systems [PODC’04]:

    • Constant quorum size at the cost of small prob of inconsistency

  • Node failure characteristics in WAN [WORLDS’04]:

    • Answer subtle questions regarding real-world failure properties

Haifeng Yu, Intel Research Pittsburgh / CMU



Erasure coding
Erasure Coding

  • Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object

  • RAID techniques are special cases

  • Replication is a special case where m = 1

Haifeng Yu, Intel Research Pittsburgh / CMU


Example revisited

A B

A B

C D

C D

A C

A B

C D

B D

Example Revisited

  • Need four files to compile:

or

Better

Can we treat A, B, C, D as a single obj and use erasure coding?

So that all files can be reconstructed from any 4 out of 8 fragments

  • Erasure coding is hard to be applied across large amount of data

    • Updating any portion of data needs to update k - m + 1 fragments  the size of original data

    • We cannot use erasure coding across 1,000 files

Haifeng Yu, Intel Research Pittsburgh / CMU


Threshold semantics and erasure coding
Threshold Semantics and Erasure Coding

In short, they are different, orthogonal concepts

Haifeng Yu, Intel Research Pittsburgh / CMU


Numerical examples from simulation1

c times difference

if p is small, where c is # obj/machine

Numerical Examples (from Simulation)

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

Chord

unavailability

PTN

CRAND (100)

CRAND (10)

RAND (CAN)

threshold

Haifeng Yu, Intel Research Pittsburgh / CMU


ad