online balancing of range partitioned data with applications to p2p systems l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Online Balancing of Range-Partitioned Data with Applications to P2P Systems PowerPoint Presentation
Download Presentation
Online Balancing of Range-Partitioned Data with Applications to P2P Systems

Loading in 2 Seconds...

play fullscreen
1 / 26

Online Balancing of Range-Partitioned Data with Applications to P2P Systems - PowerPoint PPT Presentation


  • 156 Views
  • Uploaded on

Online Balancing of Range-Partitioned Data with Applications to P2P Systems. Prasanna Ganesan Mayank Bawa Hector Garcia-Molina Stanford University. Motivation. Parallel databases use range partitioning Advantages: Inter-query parallelism

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Online Balancing of Range-Partitioned Data with Applications to P2P Systems' - nura


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
online balancing of range partitioned data with applications to p2p systems

Online Balancing of Range-Partitioned Data with Applications to P2P Systems

Prasanna Ganesan

Mayank Bawa

Hector Garcia-Molina

Stanford University

motivation
Motivation
  • Parallel databases use range partitioning
  • Advantages: Inter-query parallelism
    • Data Locality  Low-cost range queries  High thru’put

0

20

35

60

80

100

Key Range

the problem
The Problem
  • How to achieve load balance?
    • Partition boundaries have to change over time
    • Cost: Data Movement
  • Goal: Guarantee load balance at low cost
    • Assumption: Load balance beneficial !!
  • Contribution
    • Online balancing -- self-tuning system
    • Slows down updates by small constant factor
roadmap
Roadmap
  • Model and Definitions
  • Load Balancing Operations
  • The Algorithms
  • Extension to P2P Setting
  • Experimental Results
model and definitions 1
Model and Definitions (1)
  • Nodes maintain range partition (on a key)
    • Load of a node = # tuples in its partition
    • Load imbalance σ = Largest load/Smallest load
  • Arbitrary sequence of tuple inserts and deletes
    • Queries not relevant
    • Automatically directed to relevant node
model and definitions 2
Model and Definitions (2)
  • After each insert/delete:
    • Potentially fix “imbalance” by modifying partitioning
    • Cost= # tuples moved
  • Assume no inserts/deletes during balancing
    • Non-critical simplification
  • Goal: σ < constant always
    • Constant amortized cost per insert/delete
    • Implication: Faster queries, slower updates
load balancing operations 1
Load Balancing Operations (1)
  • NbrAdjust: Transfer data between “neighbors’’

A

B

[0,50)

[50,100)

[0,35)

[35,100)

is nbradjust good enough
Is NbrAdjust good enough?
  • Can be highly inefficient
    • (n) amortized cost per insert/delete ( n=#nodes )

A

B

C

D

E

F

load balancing operations 2
Load Balancing Operations (2)
  • Reorder: Hand over data to neighbor and split load of some other node

A

B

C

D

E

F

[0,5)

[0,10)

[10,20)

[5,10)

[20,30)

[30,40)

[40,50)

[40,60)

[50,60)

roadmap10
Roadmap
  • Model and Definitions
  • Load Balancing Operations
  • The Algorithms
  • Experimental Results
  • Extension to P2P Setting
the doubling algorithm
The Doubling Algorithm
  • Geometrically divide loads into levels
    • Level i  Load in ( 2i,2i+1 ]
    • Will try balancing on level change
  • Two Invariants
    • Neighbors tightly balanced
      • Max 1 level apart
    • All nodes within 3 levels
      • Guarantees σ ≤ 8

2i+2

2i+1

Level i

2i

8

Level 2

4

Level 1

2

Level 0

1

Load Scale

the doubling algorithm case 2
The Doubling Algorithm: Case 2
  • Search for a blue node
    • If none, do nothing!

A

B

C

D

E

F

the doubling algorithm case 216
The Doubling Algorithm: Case 2
  • Search for a blue node
    • If none, do nothing!

A

B

E

C

D

F

the doubling algorithm 3
The Doubling Algorithm (3)
  • Similar operations when load goes down a level
    • Try balancing with neighbor
    • Otherwise, find a red node and reorder yourself
  • Costs and Guarantees
    • σ ≤ 8
    • Constant amortized cost per insert/delete
from doubling to fibbing
From Doubling to Fibbing
  • Change thresholds to Fibonacci numbers
    • σ ≤ 3  4.2
    • Can also use other geometric sequences
    • Costs are still constant

Fi+2

Fi+1

+

Fi

=

more generalizations
More Generalizations
  • Improve σ to (1+) for any >0 [BG04]
    • Generalize neighbors to c-neighbors
    • Still constant cost O(1/ )
  • Dealing with concurrent inserts/deletes
    • Allow multiple balancing actions in parallel
    • Paper claims it is ok
application to p2p systems
Application to P2P Systems
  • Goal: Construct P2P system supporting efficient range queries
    • Provide asymptotic performance a la DHTs
  • What is a P2P system? A parallel DB with
    • Nodes joining and leaving at will
    • No centralized components
    • Limited communication primitives
  • Enhance load-balancing algorithms to
    • Allow dynamic node joins/leaves
    • Decentralize implementation
experiments
Experiments
  • Goal: Study cost of balancing for different workloads
    • Compare to periodic re-balancing algorithms (Paper)
    • Trade-off between cost and imbalance ratio (Paper)
  • Results presented on Fibbing Algorithm (n=256)
  • Three-phase Workload
    • (1) Inserts (2) Alternating inserts and deletes (3) Deletes
  • Workload 1: Zipf
    • Random draws from Zipf-like distribution
  • Workload 2: HotSpot
    • Think key=timestamp
  • Workload 3: ShearStress
    • Insert at most-loaded, delete from least-loaded
load imbalance zipf
Load Imbalance (Zipf)

4.5

Growing Phase

Steady Phase

Shrinking Phase

4

3.5

3

2.5

Load Imbalance

2

1.5

1

0.5

0

0

500

1000

1500

2000

2500

3000

Time (x1000)

related work
Related Work
  • Karger & Ruhl [SPAA 04]
    • Dynamic model, weaker guarantees
  • Load balancing in DBs
    • Partitioning static relations, e.g., [GD92,RZML02, SMR00]
    • Migrating fragments across disks, e.g., [SWZ93]
    • Intra-node data structures, e.g., [LKOTM00]
  • Litwin et al. SDDS
conclusions
Conclusions
  • Indeed possible to maintain well-balanced range partitions
    • Range partitions competitive with hashing
  • Generalize to more complex load functions
    • Allow tuples to have dynamic weights
    • Change load definition in algorithms!*
    • Range partitioning is powerful
  • Enables P2P system supporting range queries
    • Generalizes DHTs with same asymptotic guarantees

*Lots of caveats apply. Need load to be evenly divisible. No guarantees offered on costs. This offer not valid with any other offers. Etc, etc. etc.