moat a multi object assignment toolkit
Download
Skip this Video
Download Presentation
MOAT: A Multi-Object Assignment Toolkit

Loading in 2 Seconds...

play fullscreen
1 / 49

MOAT: A Multi-Object Assignment Toolkit - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

MOAT: A Multi-Object Assignment Toolkit. Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh. Background. Availability has become principle design goal : 0.1% improvement  $2M / year for Amazon and Ebay [internetweek.com]

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MOAT: A Multi-Object Assignment Toolkit' - eldon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
moat a multi object assignment toolkit

MOAT: A Multi-Object Assignment Toolkit

Haifeng Yu

Intel Research Pittsburgh / CMU

Joint work with:

Phillip B. Gibbons

Intel Research Pittsburgh

background
Background
  • Availability has become principle design goal:
    • 0.1% improvement  $2M / year

for Amazon and Ebay [internetweek.com]

    • One major focus of 8 OSDI’04 papers (out of 27)
  • Two orthogonal efforts:
    • Lower-level system components robustness
      • Example: disk, individual machine, Internet routing
    • Higher-level redundancy
      • Example: data replication
  • This talk focuses on higher-level redundancy

Haifeng Yu, Intel Research Pittsburgh / CMU

high availability via replication
High Availability via Replication
  • Large amount of data accessed by many users:
    • Distributed file systems
    • Network monitoring (PIER, SDIMS, IRISLOG)
    • Index databases for search engine (Google, p2p)
    • Scientific / medical databases
  • Data replicated across multiple machines
    • Object: The unit for replication
      • File, file block, database table, database tuple, inverted index for a certain keyword

Haifeng Yu, Intel Research Pittsburgh / CMU

multi object accesses
Multi-object Accesses
  • Many accesses request multiple objects
    • Compile a project
    • Writing a paper under Latex
    • Asking for aggregates of network conditions
    • Search for web pages containing multiple keywords
  • Availability of single object can be misleading:
    • An access requesting 1,000 objects can observe up to 1,000 times higher unavailability
    • There’s more subtlety.....

Haifeng Yu, Intel Research Pittsburgh / CMU

a simple example

A B

A B

C D

C D

A C

A B

C D

B D

A Simple Example
  • Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D
  • Four machines fail independently with same prob, each holds two file
  • Which assignment gives better avail:

or

Better

Assignment matters because objects are now correlated

Haifeng Yu, Intel Research Pittsburgh / CMU

a simple example continued

A B

A B

C D

C D

A C

A B

C D

B D

A Simple Example - Continued
  • Suppose user is happy even if only three objects are available (e.g., when computing average)

or

Better

  • Assignment makes a difference
    • Even if we are using the same machines (same amount of redundancy/resource)
    • Easily have multiple-nine difference

Haifeng Yu, Intel Research Pittsburgh / CMU

goal and contributions
Goal and Contributions
  • MOAT (Multi-Object Assignment Toolkit):
    • Goal: High availability for multi-object accesses
    • Key issue: Replica assignment
  • Contributions:
    • First to observe the importance of replica assignment
    • Strong theoretical results regarding best and worst assignments
    • Practical designs to approximate optimal assignments
    • MOAT toolkit implementation for replica assignments

Haifeng Yu, Intel Research Pittsburgh / CMU

outline
Outline
  • Motivation and MOAT contributions 
  • System model and case studies of existing systems
  • Theoretical results
  • Designs for approximating optimal assignments
  • Designs for mixed accesses
  • Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU

assumptions for this talk
Assumptions for This Talk
  • Assume:
    • Replication (no erasure coding)
    • Crash failures (no Byzantine failures)
    • Eventual consistency (no quorum or voting)
    • Most of our results hold without these assumptions
  • Assume same replication degree for all objects
    • We have results for different replication degrees as well
  • Talk to me if interested in the more complete story...

Haifeng Yu, Intel Research Pittsburgh / CMU

moat architecture overview

file

system

p2p

DB

search

engine

network

monitoring

Data API

obj create / delete / read / write

Control API

assignment policy

MOAT

raw data on distributed

machines or disks

MOAT Architecture Overview

Storage

System

App

replication / repair / load balancing / naming / assignment

Haifeng Yu, Intel Research Pittsburgh / CMU

system model

A B

C D

A B

C D

System Model
  • Basic system model:
    • N objects, each with k replicas
    • Load balancing among all machines
    • Machines fail independently with same prob
  • An assignment is a mapping: replica  machine, for all Nk replicas

Haifeng Yu, Intel Research Pittsburgh / CMU

some simple assignments
Some Simple Assignments
  • PTN: partition assignment
    • Used in most practice of Coda [Satyanarayanan et al.’90]

...........

A B C

D E F

...........

A B C

D E F

for k = 2

  • RAND: pick a random replica each time
    • Similar as in Google File System [Ghemawat et al.’03]

Haifeng Yu, Intel Research Pittsburgh / CMU

assignment in chord stoica et al 01

C

C

hash(A) = 95

B

B

C

A

A

B

Assignment in Chord [Stoica et al.’01]
  • DHTs:
    • Hash machine IP to get machine id
  • Assignment in Chord:
    • Sliding window
    • Neither PTN nor RAND

120

080

104

090

101

098

Haifeng Yu, Intel Research Pittsburgh / CMU

assignment in can ratnasamy et al 01
Assignment in CAN [Ratnasamy et al.’01]
  • Hash object k times
    • CAN uses a similar approach
  • Similar as RAND
    • But machines may have slightly different number of objects

120

080

hash1(A) = 95

104

090

101

098

A

Haifeng Yu, Intel Research Pittsburgh / CMU

assignment in can ratnasamy et al 011
Assignment in CAN [Ratnasamy et al.’01]
  • Hash object k times
    • CAN uses a similar approach
  • Similar as RAND
    • But machines may have slightly different number of objects

120

080

A

hash2(A) = 119

104

090

101

098

A

Haifeng Yu, Intel Research Pittsburgh / CMU

assignment in can ratnasamy et al 012
Assignment in CAN [Ratnasamy et al.’01]
  • Hash object k times
    • CAN uses a similar approach
  • Similar as RAND
    • But machines may have slightly different number of objects

120

080

A

hash1(B) = 84

hash2(B) = 100

104

090

B

101

098

A

B

Haifeng Yu, Intel Research Pittsburgh / CMU

which assignment should we use
Which assignment should we use?
  • MOAT Goal: Improve avail of multi-object accesses
  • If an access requests n (n  N) objects, what if only x are available?
  • Threshold-based success definition:
    • If x≥t, user happy  Available
    • If x < t, too low confidence  Unavailable
  • Availability for an access defined as:
    • Prob[  t objects available out of n requested objects]

Haifeng Yu, Intel Research Pittsburgh / CMU

examples of t
Examples of t
  • t = n
    • File systems
    • Search for terrorist images in image database
  • t close n
    • Query for top-10 most-loaded machines on PlanetLab
  • t not close n
    • Sample with confidence

Haifeng Yu, Intel Research Pittsburgh / CMU

outline1
Outline
  • Motivation and MOAT contributions 
  • System model and case studies of existing systems 
  • Theoretical results
  • Designs for approximating optimal assignments
  • Designs for mixed accesses
  • Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU

formal results
Formal Results
  • For access requesting N objects
  • Theorem: Among all assignments, when t = N:
    • PTN is best (within constant)
    • RAND is worst (within constant)
    • Difference is about c folds (c is #obj / machine)
  • Theorem: Among all assignments, when t = c+1 < N:
    • PTN is worst
    • RAND is best (within constant)
    • Difference is even larger

Haifeng Yu, Intel Research Pittsburgh / CMU

numerical examples from simulation

c times difference

if p is small, where c is # obj/machine

Numerical Examples (from Simulation)

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

unavailability

PTN

Chord

RAND (CAN)

unavail of single obj

threshold

Haifeng Yu, Intel Research Pittsburgh / CMU

a spectrum of assignments
A Spectrum of Assignments

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

unavailability

PTN

RAND (CAN)

threshold

Haifeng Yu, Intel Research Pittsburgh / CMU

more formal arguments
More Formal Arguments
  • Tradeoff is fundamental:
    • Impossible to achieve the best of RAND and PTN
  • Previous results only for access requesting N objects
    • Similar results hold for accesses requesting n (n  N) objects
    • But each machine may not be filled to capacity:
      • For PTN, use as few machines as possible
      • For RAND, use as many machines as possible
  • I have more....talk to me if you are interested

Haifeng Yu, Intel Research Pittsburgh / CMU

slide24

Access Requesting 500 Objects

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

RAND (CAN)

unavailability

Chord

PTN

threshold

Haifeng Yu, Intel Research Pittsburgh / CMU

outline2
Outline
  • Motivation and MOAT contributions 
  • System model and case studies of existing systems 
  • Theoretical results 
  • Designs for approximating optimal assignments
  • Designs for mixed accesses
  • Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU

design of replica assignment
Design of Replica Assignment
  • Trivial in a static / centralized environment
  • Challenging in dynamic environment:
    • We may not have global knowledge with many objects and many machines
  • Basic solution: Consistent hashing
    • But some re-design is necessary

Haifeng Yu, Intel Research Pittsburgh / CMU

approximating rand
Approximating RAND
  • Multi-hash DHT:
    • Hash the object k times
    • As in CAN

120

080

A

hash1(B) = 84

hash2(B) = 100

104

090

B

101

098

A

B

Haifeng Yu, Intel Research Pittsburgh / CMU

approximating ptn
Approximating PTN
  • Chord does not achieve PTN

C

120

080

C

hash(A) = 95

104

090

B

B C

101

098

A

A B

Haifeng Yu, Intel Research Pittsburgh / CMU

approximating ptn1

120

101

090

120

101

090

Approximating PTN
  • Chord does not achieve PTN
  • Group DHT:
    • (Arbitrarily) group machine into groups of k size

C

C

C

hash(A) = 95

B

A B

A B

Haifeng Yu, Intel Research Pittsburgh / CMU

node join and leave in group dht
Node Join and Leave in Group DHT
  • Maintain r rondevour points in DHT
    • Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04]
  • New node reports to a random rondevour point
  • If group can be formed, join DHT
  • Two options upon node leave:
    • Dismiss group and delete the group from DHT
    • The group wait to recruit a new node
  • Groups use rondevour point to decide

Haifeng Yu, Intel Research Pittsburgh / CMU

complexity analysis
Complexity Analysis

Haifeng Yu, Intel Research Pittsburgh / CMU

outline3
Outline
  • Motivation and MOAT contributions 
  • System model and case studies of existing systems 
  • Theoretical results 
  • Designs for approximating optimal assignments 
  • Designs for mixed accesses
  • Conclusions

Haifeng Yu, Intel Research Pittsburgh / CMU

mixture of queries
Mixture of Queries
  • Previous design only for single access requesting all N objects
    • PTN if t close to N
    • RAND if t far from N
  • But there are other accesses
    • Requests n (n < N) objects with threshold t
  • How does t change with n ?
    • Infinite possibilities
    • We focus on 4 large categories

Haifeng Yu, Intel Research Pittsburgh / CMU

four application scenarios
Four Application Scenarios

Strict accesses: t n

Loose accesses: t< n

Haifeng Yu, Intel Research Pittsburgh / CMU

loose for both small and large n
Loosefor both small and large n
  • Goal:
    • Approach RAND for both small and large n
  • Design:
    • Multi-hash DHT

120

080

A

hash1(B) = 84

hash2(B) = 100

104

090

B

101

098

A

B

Haifeng Yu, Intel Research Pittsburgh / CMU

loose for small n strict for large n

120

101

090

120

101

090

Loosefor small n; Strict for large n
  • Goal:
    • Approach RAND for small n
    • Approach PTN for large n
  • Design:
    • Group DHT

C

C

C

A

A B

A B

Haifeng Yu, Intel Research Pittsburgh / CMU

strict for both small and large n

120

101

090

120

101

090

Strictfor both small andlarge n
  • Goal:
    • Approach PTN for both small and large n
  • Assume accesses are tree accesses
  • Design:
    • Group DHT with item-balancing [Karger et al.’04]

C

C

A = 95

B

A B

A B

Haifeng Yu, Intel Research Pittsburgh / CMU

strict for small n loose for large n
Strictfor small n; Loose for large n
  • Goal:
    • Approaches PTN for n < R
    • Approaches RAND for n >> R
  • Design:
    • Multi-hash DHT
    • But cluster objects into clusters of constant size R

120

080

hash1(AB) = 84

hash2(AB) = 100

104

090

A

B

101

098

A

B

Haifeng Yu, Intel Research Pittsburgh / CMU

simulation results for strict accesses
Simulation Results for Strict Accesses

Here an access needs all n objects to be successful

400 machines

fail prob = 0.2

40,000 obj

4 replica / obj

unavailability

number (n) of objects requested by an access

Haifeng Yu, Intel Research Pittsburgh / CMU

simulation results for loose accesses
Simulation Results for Loose Accesses

Here an access needs only t = n - 150 objects to be successful

400 machines

fail prob = 0.2

40,000 obj

4 replica / obj

unavailability

number (n) of objects requested by an access

Haifeng Yu, Intel Research Pittsburgh / CMU

current status
Current Status
  • Waiting for paper deadlines
  • Finishing implementing MOAT
  • Evaluation on IrisLog trace and file system traces

Haifeng Yu, Intel Research Pittsburgh / CMU

related work
Related Work
  • Multi-object accesses rarely addressed
    • CFS [Dabek et al.’01] focuses on individual file blocks
    • Chain replication [Renesse et al.’04] considers single data object
    • A long list .....
  • Replica assignment largely ignored
    • Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied
  • Replica placement[Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied:
    • Typically for machines in different locations in the network
    • Machines are heterogeneous
    • Approaches does not apply to replica assignment

Haifeng Yu, Intel Research Pittsburgh / CMU

conclusions
Conclusions
  • Availability becoming key design goal
    • Multi-object access availability dramatically different from single-object availability
  • MOAT Contributions:
    • First to observe the importance of replica assignment
    • Strong theoretical results regarding the best and worst assignments
    • Practical designs to approximate optimal assignments
    • MOAT toolkit implementation

Haifeng Yu, Intel Research Pittsburgh / CMU

my other recent work
My Other Recent Work
  • Om [NSDI’04]:
    • Consistent and automatic replica regeneration
    • Regenerate from any single replica rather than a majority
  • Signed quorum systems [PODC’04]:
    • Constant quorum size at the cost of small prob of inconsistency
  • Node failure characteristics in WAN [WORLDS’04]:
    • Answer subtle questions regarding real-world failure properties

Haifeng Yu, Intel Research Pittsburgh / CMU

erasure coding
Erasure Coding
  • Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object
  • RAID techniques are special cases
  • Replication is a special case where m = 1

Haifeng Yu, Intel Research Pittsburgh / CMU

example revisited

A B

A B

C D

C D

A C

A B

C D

B D

Example Revisited
  • Need four files to compile:

or

Better

Can we treat A, B, C, D as a single obj and use erasure coding?

So that all files can be reconstructed from any 4 out of 8 fragments

  • Erasure coding is hard to be applied across large amount of data
    • Updating any portion of data needs to update k - m + 1 fragments  the size of original data
    • We cannot use erasure coding across 1,000 files

Haifeng Yu, Intel Research Pittsburgh / CMU

threshold semantics and erasure coding
Threshold Semantics and Erasure Coding

In short, they are different, orthogonal concepts

Haifeng Yu, Intel Research Pittsburgh / CMU

numerical examples from simulation1

c times difference

if p is small, where c is # obj/machine

Numerical Examples (from Simulation)

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

Chord

unavailability

PTN

CRAND (100)

CRAND (10)

RAND (CAN)

threshold

Haifeng Yu, Intel Research Pittsburgh / CMU

ad