liferaft data driven batch processing for the exploration of scientific databases
Download
Skip this Video
Download Presentation
LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

Loading in 2 Seconds...

play fullscreen
1 / 20

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. BETTER LUCK NEXT TIME!. Problem. Q1. Q4. Q2. Q3. Goals. Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing Pre-process queries to identify data sharing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases' - sanaa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
problem
Problem

Q1

Q4

Q2

Q3

goals
Goals

Eliminate redundant I/O to improve query throughput

  • Batch queries with that exhibit data sharing
    • Pre-process queries to identify data sharing
    • Co-schedule queries that access the same data
    • Access contentious data first to maximize sharing
  • Starvation resistance
    • Avoid indefinite queuing times (response time)
    • Enforce some constraints on completion order
target applications
Target Applications
  • Data intensive scan queries
    • Executed against a clustered index
    • Clustered and federated databases (e.g. joins that correlate multiple nodes)
  • Peta-scale astronomy (Pan-STARRS)
    • Data are partitioned spatially
    • Many queries scan full DB and last hours or days
  • Cross-match
    • Probabilistic spatial join across multiple databases
filter and refine
Filter and Refine
  • Filter queries
    • Pre-process queries to determine join buckets
    • Buckets B1,…,Bn and queries Q1,…, Qm
    • Workload Wij denote objects from Qi that overlap Bj
  • Refinement
    • Read buckets one-at-a-time
    • Sort-merge join (sort by HTM ID)
    • Query specific predicates applied on output tuples
workload throughput metric
Workload Throughput Metric
  • Greedily in order of decreasing workload throughput
  • Exploits data regions that experience contention
  • May starve requests
    • Favors buckets experiencing frequent reuse
    • No guarantee a particular bucket or query receives service
aged workload throughput metric
Aged Workload Throughput Metric
  • Inspired by disk-drive head scheduling
  • Balance arrival order (low response time) with contention (high throughput)
  • Adaptive trade-offs based on workload saturation
    • Maximize rate at which objects are joined during saturated workloads
    • Enforce completion order (queuing times) to prevent indefinite starvation during low saturation
scheduling behavior
Scheduling Behavior

Qi

Qj

Qk

Qk

Sub-divide queries by bucket:

  • Assumptions:
    • Inter-query time of 1 sec
    • I/O for each bucket of 1 sec
    • Cache size of 2
    • Join cost is negligible

Qi – Qi1, Qi2, Qi3

Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8

Qj – Qj5, Qj6 , Qj7, Qj8

arrival order with no sharing

Qi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qk End

Qi End

Qj End

Qi1

Qi2

Qi3

Qk8

Qj7

Qj1

Qj6

Qj8

Qk1

Qj3

Qk4

Qj4

B1

B2

B3

B7

B1

B1

B3

B6

B4

B8

B4

B8

Arrival order with no sharing

Completion Times:

Qi – 3 sec

Qj – 8 sec

Qk – 13 sec

Avg – 8 sec

Tp – .2 qry/sec

age based scheduling bias 1

Qi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qi End

Qj End

Qk End

Qi1

Qi2

Qi5

Qj4Qk4

Qj7Qk7

Qj6Qk6

Qj1Qk1

Qi3Qj3

Qj8Qk8

B1

B2

B5

B3

B1

B4

B7

B8

B6

Age based scheduling (bias 1)

Completion Times:

Qi – 3 sec

Qj – 7 sec

Qk – 7 sec

Avg – 5.6 sec

Tp – .33 qry/sec

contention based scheduling bias 0

Qi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qi End

Qj End

Qk End

Qi1

Qi2

Qi3Qj3

Qk5

Qj6Qk6

Qj7Qk7

Qj8Qk8

Qj1Qk1Qj4Qk4

B1

B2

B5

B3

B7

B8

B1 B4

B6

Contention based scheduling (bias 0)

Completion Times:

Qi – 7 sec

Qj – 5 sec

Qk – 6 sec

Avg – 6 sec

Tp – .38 qry/sec

(5.6)

(.33)

tuning the age bias
Tuning theage bias
  • Throughput performance gap grows while response time gap is insensitive to saturation
  • Increasing age bias is more attractive at low saturation
discussion
Discussion
  • Impact of caching strategies
  • Workload overflow
    • Large intermediate join results
    • Migrate pairs of workload and bucket
  • Beyond completion order
    • Higher priority for interactive queries
  • Batch processing in a clustered environment

P. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.

filter and refine1
Filter and refine
  • Partition data into buckets
outline
Outline
  • Motivation
    • Goals for data-driven, batch scheduling
    • Target application (SkyQuery)
  • LiftRaft scheduler
    • Filter and refine queries
    • Throughput maximizing metric
    • Starvation resistance
    • Differences in outcomes
  • Workload adaptive parameter selection
ad