handling stress n.
Download
Skip this Video
Download Presentation
Handling Stress

Loading in 2 Seconds...

play fullscreen
1 / 57

Handling Stress - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

Handling Stress. Indranil Gupta April 25, 2006 CS598IG Spring 2006. Traditional Fault-tolerance in Distributed Systems. Node failures M assive failures. Intermittent message losses Network outages Network partitions. Stress in Distributed Systems.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Handling Stress' - kelsey-miller


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
handling stress

Handling Stress

Indranil Gupta

April 25, 2006

CS598IG Spring 2006

traditional fault tolerance in distributed systems
Traditional Fault-tolerance in Distributed Systems

Node failures Massive failures

Intermittent message losses Network outages Network partitions

stress in distributed systems
Stress in Distributed Systems

Node failures Massive failures

Intermittent message losses Network outages Network partitions

Perturbation Churn

Static Objects Dynamic Objects

stress in distributed systems1
Stress in Distributed Systems

Node failures Massive failures

Intermittent message losses Network outages Network partitions

Today’s Focus

Perturbation Churn

Static Objects Dynamic Objects

papers
Papers

We’ll concentrate on one class of distributed systems: peer to peer systems

We’ll study:

  • Characteristics of Churn: How does node availability vary in peer to peer systems?
  • Effect of Churn: How do p2p DHTs behave under churn?
  • Churn-Resistance: Are there designs that are churn-resistant?
paper 1 understanding availability

Paper 1: Understanding Availability

R. Bhagwan, S. Savage, G. Voelker

University of California, San Diego

goals
Goals
  • Measurement study of peer-to-peer (P2P) file sharing application
    • Overnet (January 2003)
  • Analyze collected data to analyze availability
    • Host IP address changes
    • Diurnal patterns
    • Interdependence among nodes
overnet
Overnet
  • Based on Kademlia, a DHT
  • Each node uses a random self-generated ID
    • The ID remains constant (unlike IP address)
    • Used to collect availability traces
  • Routing works in a similar manner to Gnutella
  • Widely deployed (eDonkey)
  • Overnet protocol and application are closed-source, but have already been reverse engineered.
experiment methodology
Experiment Methodology
  • Crawler:
    • Takes a snapshot of all the active hosts by repeatedly requesting 50 randomly generated IDs.
    • The requests lead to discovery of some hosts (through routing requests), which are sent the same 50 IDs, and the process is repeated.
    • Run once every 4 hours to minimize impact
experiment methodology1
Experiment Methodology
  • Prober:
    • Probe the list of available IDs to check for availability
      • By sending a request to ID I; request succeeds only if I replies
      • Does not use TCP, avoids problems with NAT and DHCP
    • Used on only randomly selected 2400 hosts from the initial list
    • Run every 20 minutes
  • All Crawler and Prober trace data from this study is available for your project (ask Indy if you want access)
experiment summary
Experiment Summary
  • Ran for 15 days from January 14 to January 28 (with problems on January 21) 2003
  • Each pass of crawler yielded 40,000 hosts.
  • In a single day (6 crawls) yielded between 70,000 and 90,000 unique hosts.
  • 1468 of the 2400 randomly selected hosts probes responded at least once
host availability
Host Availability

As time interval

increased, av.

decreases

diurnal patterns
Diurnal Patterns
  • Normalized to
  • “local time” at peer,
  • not EST
  • N changes by only
  • 100/day
  • 6.4 joins/host/day
  • 32 hosts/day lost
are node failures interdependent
Are Node Failures Interdependent?

30% with 0 difference, 80% within

+-0.2

Should be same

if X and Y

independent

arrival and departure
Arrival and Departure
  • 20% of nodes each day
  • are new
  • Number of nodes
  • stays about 85,000
conclusions and discussion
Conclusions and Discussion
  • Each host uses an average 4 different IP addresses within just 15 days
    • Keeping track of assumptions is important for trace collection studies
  • Availability data optimistic if we keep track of host IP aliasing
    • But still high churn
    • How does one design churn-resistant systems?
  • Strong diurnal patterns
    • Design DHTs that are adaptive to time-of-day?
  • No strong correlation among failure probabilities – use of redundancy ok in p2p systems
  • High churn rates
    • How does it affect internals of structured DHTs?
paper 2 comparing the performance of dhts under churn

Paper 2: Comparing the Performance of DHTs under Churn

J. Li, J. Stribling, T.M. Gil, R. Morris, M.F. Kaashoek

MIT

comparing different dhts
Comparing different DHTs
  • Metrics to measure
    • Cost = number of bytes of messages sent
    • Peformance = latency for a query
  • p2psim
    • 1024 nodes (inter-node latencies obtained from DNS servers, avg. 152 ms)
    • lookups issued for random keys at exponentially distributed intervals (avg. 10 min)
    • nodes crash and rejoin at exponentially distributed intervals (avg. 1 hour)
    • experiments run for 6 hours.
slide22

“Convex Hull”

  • -upper bound on performance
  • does this hide the “real”
  • performance?
dhts considered
DHTs considered
  • Tapestry, Pastry, Chord, Kademlia: normal implementations
  • Kelips: slightly different
slide24

Kademlia

Tapestry

Kelips

Chord

kelips the strawman
Kelips, the strawman
  • Tapestry, Pastry, Chord, Kademlia: normal implementations
  • Kelips: slightly different
    • nodeIDs treated as filetuple: routing within each affinity group is through a random walk (thus )
    • Not what Kelips was originally intended for. Adds extra layer of files being inserted and deleted all the time!
      • Original Kelips (if studied) would use bandwidth that was higher by a constant but give much shorter lookup latencies (due to the replication)!
slide26

Kademlia

Tapestry

Kelips

Chord

Expected Kelips

behavior

slide27

Tapestry

  • Higher base=>
    • short paths, but…
    • same lookup latency
    • more entries => b/w

Base

Stabilization int

(stab)

Reasonable value 72 s

Base low, stab low

slide28

Chord

Fixed: 72 s stab for succ/pred

Varied: stab for routing entries

Base

Base value makes no

difference

Bases 2 and 8 enough

conclusions and discussion1
Conclusions and Discussion
  • Upper Bound of performance for all DHTs considered are similar
    • Is this enough?
    • Why not average performance curves?
  • Parameter tuning is essential to performance
    • Design DHTs that tune parameters adaptively?
  • Comparing different systems is a tricky task!
paper 3 a churn resistant cooperative web caching application

Paper 3: A Churn-Resistant Cooperative Web Caching Application

P. Linga, I. Gupta, K.P. Birman

Cornell University

today s web caching

Web App

Proxy server

Cache

Web

Server

Client

Proxy server

Proxy server

Web App

Cache

Client

Internet

Today’s Web Caching
tomorrow s web caching

Web App

Cache

Web

Server

Client

Web App

Cache

Client

Internet

Tomorrow’s Web Caching
cooperative web caching
Cooperative Web caching
  • Hierarchical
    • Harvest, Squid
  • Distributed
    • Cachemesh
  • Peer-to-Peer (co-operative) – No proxies
    • BuddyWeb
    • Squirrel: re-insert each web object into underlying Pastry DHT
      • Does not have locality
      • Churn resistance?
peer to peer caching challenges
Peer-to-Peer caching – Challenges
  • Handling churn (some nodes join/leave the system rapidly)
    • Churn arises because of
      • Workstation load (perturbation)
      • Deletion of web objects (cleanup of/size limit on browser cache)
      • Users logging out
      • Even more likely in general/Grid computing scenarios
  • Locality
    • Goal is to service requests from close nodes
  • Load balancing
    • Load on clients because of servicing requests should be uniform
  • Performance
    • Should be comparable to the centralized cache case
a churn resistant solution
A churn-resistant solution
  • Kelips as the underlying index into caches
    • Gossip (epidemic)-based and hence handles churn well
  • Other Advantages (compared to Squirrel)
    • does not re-insert web objects into DHTs
    • pushes application down into DHT layer
    • Request handled by a close node
      • Flexible: Can choose close contacts
      • Low latency
      • Better load balancing
kelips
Kelips

Take a collection of “nodes”

110

230

202

30

kelips1

-

N

N

1

Kelips

Map nodes to affinity groups

Affinity Groups:

peer membership thru consistenthash

0

1

2

110

230

202

members per affinity group

30

kelips2

-

N

N

1

Kelips

110 knows about other members – 230, 30…

Affinity group view

Affinity Groups:

peer membership thru consistenthash

0

1

2

110

230

202

members per affinity group

30

Affinity group pointers

kelips3

-

N

N

1

Kelips

202 is a “contact” for 110 in group 2

Affinity group view

Affinity Groups:

peer membership thru consistenthash

0

1

2

110

Contacts

230

202

members per affinity group

30

Contact pointers

kelips4

“cnn.com” maps to group 0. So 91 tells group 0 to “route” inquiries about cnn.com to it.

-

N

N

1

Kelips

Affinity group view

Affinity Groups:

peer membership thru consistenthash

0

1

2

110

Contacts

230

202

members per affinity group

91

30

Resource Tuples

Gossip protocol replicates data cheaply

updating and refreshing soft state
Updating and refreshing soft state
  • Gossip protocol
  • Each peer periodically
    • Selects a few peers as gossip targets (from same affinity group & contacts)
    • Sends them partial soft state information – constant gossip message size
  • Gossip target selection
    • Topologically Aware : Use Round trip times
modifications to kelips
Modifications to Kelips

Required since application is pushed down into the DHT layer

  • Modified soft state
    • Don’t want multiple filetuples for same object spreading throughout the object’s affinity group
  • New lookup strategy
soft state
Soft state

Directory Table for cnn.com

Affinity group view

Resource Tuples

Contacts

web object lookup

Object’s affinity group

Requesting Node

102

110

Lookup request for

cnn.com

Forward request

160

Send Object

Web Object lookup
soft state maintenance
Soft State Maintenance
  • Directory size is limited (3-4 entries)
  • Localized information dissemination

(for Individual Directory entries)

    • Using a hops-to-live (htl) field
    • Client fetches object cnn.com from server

 resource tuple given to contact (close)

 contact spreads it to other aff grp nodes via topologically- aware gossip

 nodes replace farthest directory entries for cnn.com if new entry is closer; new entry anyway included if directory is not full

 resource tuple associated with htl, decremented if entry was included in full directory

  • Global behavior:
    • small number of replicas => each resource tuple spreads far and wide
    • large number of replicas => each resource tuple spreads to close-by nodes only; all directory entries point to close-by replicas
experiments
Experiments
  • Simulator written in C
    • Real Kelips implementation: cluster-based results
  • 1000 nodes simulated
  • Topology: GT-ITM transit stub n/w model
    • Kelips nodes are mapped at random to 600 nodes in the transit stub n/w
  • Workload: UCB Home IP traces
  • Churn: Overnet churn traces
    • Only 500 nodes are subjected to churn
workload traits
Workload traits

Performance of central cache

external bandwidth
External bandwidth

External b/w and hit ratio comparable to centralized cache

hit ratio
Hit Ratio

Low background b/w enough for good hit ratio

locality
Locality

Low latency because of good locality/close contacts

load balancing
Load balancing

Requests are uniformly distributed. Hence good load balancing.

churn affinity group view size
Churn – Affinity group view size
  • Overnet traces (hourly) injected at Epoch second intervals into system

Churn Epoch: 200 s

Churn Epoch: 40 s

Higher churn rates, quality of views deteriorates

churn hit ratio
Churn – Hit ratio

Hit ratio goes down gracefully with increasing churn rates

churn resistance
Churn-Resistance

When churn occurs…

  • Membership lists may not keep up to date but…
  • Application-defined performance metric (hit rate) does not suffer!
  • Due to redundancy and randomization
  • Kelips tends to adapt its contact lists (automatically through the gossip-based membership protocol already running within)
  • Thus, over a long period of time, it favors those nodes that are not churning as neighbors
  • Over time, queries are routed only among “good” nodes. And, this adaptivity occurs as emergent behavior.
conclusions and discussion2
Conclusions and Discussion
  • Cooperative caching works!
    • Peer to peer systems can be used for web caching.
  • Kelips handles churn well
    • Churn-resistance is emergent behavior
    • Through randomization and redundancy
  • Performance is comparable to centralized web cache
    • Do application-based reliability/performance metrics make more sense than substrate-based ones?
  • Low overhead on participating nodes
  • Locality-aware lookups
  • Added advantages : scalable, self-organizing, fault tolerant, load balancing
lecture summary and discussion
Lecture Summary and Discussion

For peer to peer systems:

  • Characteristics of Churn: Node availability varies across nodes, and time. Failures are independent.
  • Effect of Churn: Upper bound reasonable, but real performance?
  • Churn-Resistance: Use redundancy and randomization to fight churn. Churn-resistant web caching works.

Research Directions

  • Adaptive DHTs : adaptivity to changing network conditions, changing churn rates, changing time-of-day…
  • Stress-Resistant Protocols: more challenging than mere fault-tolerance
bigger direction stress resistance in distributed systems
Bigger Direction: Stress-Resistance in Distributed Systems

Node failures Massive failures

Intermittent message losses Network outages Network partitions

Perturbation Churn

Varying Request Rate Flash Crowds

Static Objects Dynamic Objects

ad