Trust sensitive scheduling on the open grid
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Trust-Sensitive Scheduling on the Open Grid PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Trust-Sensitive Scheduling on the Open Grid. Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University of Minnesota Trends in HPDC Workshop Amsterdam 2006. Background. Public donation-based infrastructures are attractive

Download Presentation

Trust-Sensitive Scheduling on the Open Grid

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Trust sensitive scheduling on the open grid

Trust-Sensitive Scheduling on the Open Grid

Jon B. Weissmanwith help from Jason Sonnek and Abhishek Chandra

Department of Computer Science

University of Minnesota

Trends in HPDC Workshop

Amsterdam 2006


Background

Background

  • Public donation-based infrastructures are attractive

    • positives: cheap, scalable, fault tolerant (UW-Condor, [email protected], ...)

    • negatives: “hostile” - uncertain resource availability/connectivity, node behavior, end-user demand => best effort service


Background1

Background

  • Such infrastructures have been used for throughput-based applications

    • just make progress, all tasks equal

  • Service applications are more challenging

    • all tasks not equal

    • explicit boundaries between user requests

    • may even have SLAs, QoS, etc.


Service model

Service Model

  • Distributed Service

    • request -> set of independent tasks

    • each task mapped to a donated node

    • makespan

    • E.g. BLAST service

      • user request (input sequence) + chunk of DB form a task


Boinc blast

BOINC + BLAST

workunit = input_sequence + chunk of DB

generated when a request arrives


The challenge

The Challenge

  • Nodes are unreliable

    • timeliness: heterogeneity, bottlenecks, …

    • cheating: hacked, malicious (> 1% of SETi nodes), misconfigured

    • failure

    • churn

  • For a service, this matters


Some data timeliness

Some data- timeliness

Computation Heterogeneity

- both across and within nodes

PlanetLab – lower bound

Communication Heterogeneity

- both across and within nodes


The problem for today

The Problem for Today

  • Deal with node misbehavior

  • Result verification

    • application-specific verifiers – not general

    • redundancy + voting

  • Most approaches assume ad-hoc replication

    • under-replicate: task re-execution (^ latency)

    • over-replicate: wasted resources (v throughput)

  • Using information about the pastbehavior of a node, we can intelligently size the amount of redundancy


System model

System Model


Problems with ad hoc replication

Problems with ad-hoc replication

Unreliable node

Task x sent to group A

Reliable node

Task y sent to group B


Smart replication

Smart Replication

  • Reputation

    • ratings based on past interactions with clients

    • simple sample-based prob. (ri) over window t

    • extend to worker group (assuming no collusion) => likelihood of correctness (LOC)

  • Smarter Redundancy

    • variable-sized worker groups

    • intuition: higher reliability clients => smaller groups


Terms

Terms

  • LOC (Likelihood of Correctness), lg

    • computes the ‘actual’ probability of getting a correct answer from a group of clients (group g)

  • Target LOC (ltarget)

    • the task success-rate that the system tries to ensure while forming client groups

    • related to the statistics of the underlying distribution


Trust sensitive scheduling

Trust Sensitive Scheduling

  • Guiding metrics

    • throughput r: is the number of successfully completed tasks in an interval

    • success rate s: ratio of throughput to number of tasks attempted


Scheduling algorithms

Scheduling Algorithms

  • First-Fit

    • attempt to form the first group that satisfies ltarget

  • Best-Fit

    • attempt to form a group that best satisfies ltarget

  • Random-Fit

    • attempt to form a random group that satisfies ltarget

  • Fixed-size

    • randomly form fixed sized groups. Ignore client ratings.

  • Random and Fixed are our baselines

  • Min group size = 3


Scheduling algorithms1

Scheduling Algorithms


Scheduling algorithms cont d

Scheduling Algorithms (cont’d)


Different groupings

Different Groupings

ltarget = .5


Evaluation

Evaluation

  • Simulated a wide-variety of node reliability distributions

  • Set ltarget to be the success rate of Fixed

    • goal: match success rate of fixed (which over-replicates) yet achieve higher throughput

    • if desired, can drive tput even higher (but success rate would suffer)


Comparison

Comparison

gain: 25-250%

open question: how much better could we have done?


Non stationarity

Non-stationarity

  • Nodes may suddenly shift gears

    • deliberately malicious, virus, detach/rejoin

    • underlying reliability distribution changes

  • Solution

    • window-based rating (reduce t = 20 from infinite)

  • Experiment: “blackout” at round 300 (30% effected)


Role of l target

Role of ltarget

  • Key parameter

  • Too large

    • groups will be too large (low throughput)

  • Too small

    • groups will be too small (low success rate)

  • Adaptively learn it (parameterless)

    • maximizing r * s :“goodput”

    • or could bias toward r or s


Adaptive algorithm

Adaptive algorithm

  • Multi-objective optimization

    • choose target LOC to simultaneously maximize throughput r and success rate s

      • a1 r + a2 s

    • use weighted combination to reduce multiple objectives to a single objective

    • employ hill-climbing and feedback techniques to control dynamic parameter adjustment


Adapting l target

Adapting ltarget

  • Blackout example


Throughput a 1 1 a 2 0

Throughput (a1=1, a2=0)


Current future work

Current/Future Work

  • Implementation of reputation-based scheduling framework (BOINC and PL)

  • Mechanisms to retain node identities (hence ri) under node churn

    • “node signatures” that capture the characteristics of the node


Current future work cont d

Current/Future Work (cont’d)

  • Timeliness

    • extending reliability to encompass time

    • a node whose performance is highly variable is less reliable

  • Client collusion

    • detection: group signatures

    • prevention:

      • combine quiz-based tasks with reputation systems

      • form random-groupings


Trust sensitive scheduling on the open grid

Thank you.


  • Login