Liferaft data driven batch processing for the exploration of scientific databases
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on
  • Presentation posted in: General

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. BETTER LUCK NEXT TIME!. Problem. Q1. Q4. Q2. Q3. Goals. Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing Pre-process queries to identify data sharing

Download Presentation

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Liferaft data driven batch processing for the exploration of scientific databases

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases


Liferaft data driven batch processing for the exploration of scientific databases

BETTER LUCK NEXT TIME!


Problem

Problem

Q1

Q4

Q2

Q3


Goals

Goals

Eliminate redundant I/O to improve query throughput

  • Batch queries with that exhibit data sharing

    • Pre-process queries to identify data sharing

    • Co-schedule queries that access the same data

    • Access contentious data first to maximize sharing

  • Starvation resistance

    • Avoid indefinite queuing times (response time)

    • Enforce some constraints on completion order


Target applications

Target Applications

  • Data intensive scan queries

    • Executed against a clustered index

    • Clustered and federated databases (e.g. joins that correlate multiple nodes)

  • Peta-scale astronomy (Pan-STARRS)

    • Data are partitioned spatially

    • Many queries scan full DB and last hours or days

  • Cross-match

    • Probabilistic spatial join across multiple databases


Filter and refine

Filter and Refine

  • Filter queries

    • Pre-process queries to determine join buckets

    • Buckets B1,…,Bn and queries Q1,…, Qm

    • Workload Wij denote objects from Qi that overlap Bj

  • Refinement

    • Read buckets one-at-a-time

    • Sort-merge join (sort by HTM ID)

    • Query specific predicates applied on output tuples


Workload throughput metric

Workload Throughput Metric

  • Greedily in order of decreasing workload throughput

  • Exploits data regions that experience contention

  • May starve requests

    • Favors buckets experiencing frequent reuse

    • No guarantee a particular bucket or query receives service


Aged workload throughput metric

Aged Workload Throughput Metric

  • Inspired by disk-drive head scheduling

  • Balance arrival order (low response time) with contention (high throughput)

  • Adaptive trade-offs based on workload saturation

    • Maximize rate at which objects are joined during saturated workloads

    • Enforce completion order (queuing times) to prevent indefinite starvation during low saturation


Scheduling behavior

Scheduling Behavior

Qi

Qj

Qk

Qk

Sub-divide queries by bucket:

  • Assumptions:

    • Inter-query time of 1 sec

    • I/O for each bucket of 1 sec

    • Cache size of 2

    • Join cost is negligible

Qi – Qi1, Qi2, Qi3

Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8

Qj – Qj5, Qj6 , Qj7, Qj8


Arrival order with no sharing

Qi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qk End

Qi End

Qj End

Qi1

Qi2

Qi3

Qk8

Qj7

Qj1

Qj6

Qj8

Qk1

Qj3

Qk4

Qj4

B1

B2

B3

B7

B1

B1

B3

B6

B4

B8

B4

B8

Arrival order with no sharing

Completion Times:

Qi – 3 sec

Qj – 8 sec

Qk – 13 sec

Avg – 8 sec

Tp – .2 qry/sec


Age based scheduling bias 1

Qi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qi End

Qj End

Qk End

Qi1

Qi2

Qi5

Qj4Qk4

Qj7Qk7

Qj6Qk6

Qj1Qk1

Qi3Qj3

Qj8Qk8

B1

B2

B5

B3

B1

B4

B7

B8

B6

Age based scheduling (bias 1)

Completion Times:

Qi – 3 sec

Qj – 7 sec

Qk – 7 sec

Avg – 5.6 sec

Tp – .33 qry/sec


Contention based scheduling bias 0

Qi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qi End

Qj End

Qk End

Qi1

Qi2

Qi3Qj3

Qk5

Qj6Qk6

Qj7Qk7

Qj8Qk8

Qj1Qk1Qj4Qk4

B1

B2

B5

B3

B7

B8

B1 B4

B6

Contention based scheduling (bias 0)

Completion Times:

Qi – 7 sec

Qj – 5 sec

Qk – 6 sec

Avg – 6 sec

Tp – .38 qry/sec

(5.6)

(.33)


Throughput performance

Throughput Performance


Tuning the age bias

Tuning theage bias

  • Throughput performance gap grows while response time gap is insensitive to saturation

  • Increasing age bias is more attractive at low saturation


Parameter tuning using trade off curves

Parameter tuning using trade-off curves


Discussion

Discussion

  • Impact of caching strategies

  • Workload overflow

    • Large intermediate join results

    • Migrate pairs of workload and bucket

  • Beyond completion order

    • Higher priority for interactive queries

  • Batch processing in a clustered environment

    P. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.


Liferaft data driven batch processing for the exploration of scientific databases

WHAT ABOUT US?


Filter and refine1

Filter and refine

  • Partition data into buckets


Average response time

Average Response Time


Outline

Outline

  • Motivation

    • Goals for data-driven, batch scheduling

    • Target application (SkyQuery)

  • LiftRaft scheduler

    • Filter and refine queries

    • Throughput maximizing metric

    • Starvation resistance

    • Differences in outcomes

  • Workload adaptive parameter selection


  • Login