Liferaft data driven batch processing for the exploration of scientific databases
Download
1 / 20

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. BETTER LUCK NEXT TIME!. Problem. Q1. Q4. Q2. Q3. Goals. Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing Pre-process queries to identify data sharing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases' - sanaa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

BETTER LUCK NEXT TIME! of Scientific Databases


Problem
Problem of Scientific Databases

Q1

Q4

Q2

Q3


Goals
Goals of Scientific Databases

Eliminate redundant I/O to improve query throughput

  • Batch queries with that exhibit data sharing

    • Pre-process queries to identify data sharing

    • Co-schedule queries that access the same data

    • Access contentious data first to maximize sharing

  • Starvation resistance

    • Avoid indefinite queuing times (response time)

    • Enforce some constraints on completion order


Target applications
Target Applications of Scientific Databases

  • Data intensive scan queries

    • Executed against a clustered index

    • Clustered and federated databases (e.g. joins that correlate multiple nodes)

  • Peta-scale astronomy (Pan-STARRS)

    • Data are partitioned spatially

    • Many queries scan full DB and last hours or days

  • Cross-match

    • Probabilistic spatial join across multiple databases


Filter and refine
Filter and Refine of Scientific Databases

  • Filter queries

    • Pre-process queries to determine join buckets

    • Buckets B1,…,Bn and queries Q1,…, Qm

    • Workload Wij denote objects from Qi that overlap Bj

  • Refinement

    • Read buckets one-at-a-time

    • Sort-merge join (sort by HTM ID)

    • Query specific predicates applied on output tuples


Workload throughput metric
Workload Throughput Metric of Scientific Databases

  • Greedily in order of decreasing workload throughput

  • Exploits data regions that experience contention

  • May starve requests

    • Favors buckets experiencing frequent reuse

    • No guarantee a particular bucket or query receives service


Aged workload throughput metric
Aged Workload Throughput Metric of Scientific Databases

  • Inspired by disk-drive head scheduling

  • Balance arrival order (low response time) with contention (high throughput)

  • Adaptive trade-offs based on workload saturation

    • Maximize rate at which objects are joined during saturated workloads

    • Enforce completion order (queuing times) to prevent indefinite starvation during low saturation


Scheduling behavior
Scheduling Behavior of Scientific Databases

Qi

Qj

Qk

Qk

Sub-divide queries by bucket:

  • Assumptions:

    • Inter-query time of 1 sec

    • I/O for each bucket of 1 sec

    • Cache size of 2

    • Join cost is negligible

Qi – Qi1, Qi2, Qi3

Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8

Qj – Qj5, Qj6 , Qj7, Qj8


Arrival order with no sharing

Q of Scientific Databasesi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qk End

Qi End

Qj End

Qi1

Qi2

Qi3

Qk8

Qj7

Qj1

Qj6

Qj8

Qk1

Qj3

Qk4

Qj4

B1

B2

B3

B7

B1

B1

B3

B6

B4

B8

B4

B8

Arrival order with no sharing

Completion Times:

Qi – 3 sec

Qj – 8 sec

Qk – 13 sec

Avg – 8 sec

Tp – .2 qry/sec


Age based scheduling bias 1

Q of Scientific Databasesi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qi End

Qj End

Qk End

Qi1

Qi2

Qi5

Qj4Qk4

Qj7Qk7

Qj6Qk6

Qj1Qk1

Qi3Qj3

Qj8Qk8

B1

B2

B5

B3

B1

B4

B7

B8

B6

Age based scheduling (bias 1)

Completion Times:

Qi – 3 sec

Qj – 7 sec

Qk – 7 sec

Avg – 5.6 sec

Tp – .33 qry/sec


Contention based scheduling bias 0

Q of Scientific Databasesi

Qj

Qk

Qk

Qi Arr

Qj Arr

Qk Arr

Qi End

Qj End

Qk End

Qi1

Qi2

Qi3Qj3

Qk5

Qj6Qk6

Qj7Qk7

Qj8Qk8

Qj1Qk1Qj4Qk4

B1

B2

B5

B3

B7

B8

B1 B4

B6

Contention based scheduling (bias 0)

Completion Times:

Qi – 7 sec

Qj – 5 sec

Qk – 6 sec

Avg – 6 sec

Tp – .38 qry/sec

(5.6)

(.33)


Throughput performance
Throughput Performance of Scientific Databases


Tuning the age bias
Tuning the of Scientific Databasesage bias

  • Throughput performance gap grows while response time gap is insensitive to saturation

  • Increasing age bias is more attractive at low saturation


Parameter tuning using trade off curves
Parameter tuning using of Scientific Databasestrade-off curves


Discussion
Discussion of Scientific Databases

  • Impact of caching strategies

  • Workload overflow

    • Large intermediate join results

    • Migrate pairs of workload and bucket

  • Beyond completion order

    • Higher priority for interactive queries

  • Batch processing in a clustered environment

    P. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.


WHAT ABOUT US? of Scientific Databases


Filter and refine1
Filter and refine of Scientific Databases

  • Partition data into buckets


Average response time
Average Response Time of Scientific Databases


Outline
Outline of Scientific Databases

  • Motivation

    • Goals for data-driven, batch scheduling

    • Target application (SkyQuery)

  • LiftRaft scheduler

    • Filter and refine queries

    • Throughput maximizing metric

    • Starvation resistance

    • Differences in outcomes

  • Workload adaptive parameter selection


ad