1 / 20

Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases. Big Picture. Ensure high throughput for concurrent accesses to peta-scale Scientific datasets Data-Intensive analysis queries Correlate, mine, and extract features Batch workloads with multiple simultaneous queries

lyle-robles
Download Presentation

Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

  2. Big Picture Ensure high throughput for concurrent accesses to peta-scale Scientific datasets • Data-Intensive analysis queries • Correlate, mine, and extract features • Batch workloads with multiple simultaneous queries • Join data partitioned and distributed across multiple nodes • Scale of exploration limited • I/O: Scanning vast amounts of data over hours or days • Network: Transferring lots of data over large distances

  3. Querying on Global-Scale • SkyQuery database federation for Astronomy • Publicly accessible virtual telescope • Sharing of heterogeneous data • Geographically dispersed (30 across NA, EA, EU) • High network cost for federated join queries • Joins on terabyte datasets between nodes • Queries last minutes producing hundreds of MB in results • Network transfers consume up to 70% of the time • Data volume and geography limit scale

  4. Incorporating Network Structure

  5. Network-Aware Join Scheduling • Capture network heterogeneity • Metric that exploits excess capacity for routing • Decentralized local optimizations • Two-approximate, MST-based solution • Supports parallelism and trade-offs with I/O cost • Ten-fold reduction in network utilization for SkyQuery (ICDE’08)

  6. Scanning Peta-Scale Data • Data intensive scan queries • Executed against a clustered index • Span multiple nodes (partitioned by space/time in cluster) • Incredibly I/O bound • Full DB scans lasting hours or days • Multiple concurrent queries (millions/month) • Significant data reuse between queries Turbulence Astronomy

  7. HTM Sub-query regions Reordering & Co-scheduling Q1 Astronomers LifeRaft Scheduling Query Results Pre-processing & Decomposition Q2 Query SELECT ... FROM … WHERE region(‘circle 181.3 -0.76 6.5’) and specclass = 2 and … Q3 LifeRaft: Data-Driven Batch Scheduling • Schedule queries greedily based on contention • Contentious regions amortize I/O over more queries • Two-fold improvement in throughput(CIDR’09)

  8. Job-Aware Batch Scheduling • Sequence of queries related to the same experiment • Predict I/O for long-running experiments • Queries may be order dependent • Batch interface for Scientists • Session IDs to explicitly link queries • Pre-declare time/space regions of interest • Pre-package operations • Submit all queries at once • Pre-fetching to improve response time • Bounding box over the data accessed • Extrapolate trajectory of job based on time/space

  9. J1 J4 J3 J2 J5 J3 J4 J1 J2 J5 Job-Aware Batch Scheduling • Delays evaluation of time-steps that are accessed in the future Revisit LifeRaft J1 Job-Aware LifeRaft J2 JOBS T1 J3 J4 T2 J5 T5 T3 T1 T2 T3 T4 T5 T4 T4 Time Steps T1 T3 T2 T4 T3 T5 T4 T4 T3 T2

  10. Extending Batch Scheduling • Provide starvation resistance • Short interactive queries that focus on small region • Soft constraints on completion order • Hard constraints on response time User Perceived Delay(Turbulence July 22nd) 4x overhead

  11. Extending Batch Scheduling • Cooperative LifeRaft • Beyond single node LifeRaft • Coordinate scheduling across multiple nodes • Communicate to refine local decisions • Avoid delaying a query that spans multiple nodes • Heterogeneity in workload allocation and performance

  12. Thank You!

  13. Supplementary Slides

  14. Extending Batch Scheduling • Query buffering • Large intermediate results • May need to page results to disk

  15. <1 sec >1hr A Case for Batch Processing • 70% of queries reuse turbulence simulation results from a dozen timesteps • Varied query sizes ranging from <1s to several hours

  16. Scheduling Behavior Qi Qj Qk Qk Sub-divide queries by bucket: • Assumptions: • Inter-query time of 1 sec • I/O for each bucket of 1 sec • Cache size of 2 • Join cost is negligible Qi – Qi1, Qi2, Qi3 Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8 Qj – Qj5, Qj6 , Qj7, Qj8

  17. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qk End Qi End Qj End Qi1 Qi2 Qi3 Qk8 Qj7 Qj1 Qj6 Qj8 Qk1 Qj3 Qk4 Qj4 B1 B2 B3 B7 B1 B1 B3 B6 B4 B8 B4 B8 Arrival order with no sharing … Completion Times: Qi – 3 sec Qj – 8 sec Qk – 13 sec Avg – 8 sec Tp – .2 qry/sec

  18. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi5 Qj4Qk4 Qj7Qk7 Qj6Qk6 Qj1Qk1 Qi3Qj3 Qj8Qk8 B1 B2 B5 B3 B1 B4 B7 B8 B6 Age based scheduling (bias 1) Completion Times: Qi – 3 sec Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec

  19. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi3Qj3 Qk5 Qj6Qk6 Qj7Qk7 Qj8Qk8 Qj1Qk1Qj4Qk4 B1 B2 B5 B3 B7 B8 B1 B4 B6 Contention based scheduling (bias 0) Completion Times: Qi – 7 sec Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec (5.6) (.33)

  20. Reducing I/O: Adaptive Physical Design Minimize cost of query execution and transitioning – 40% reduction in I/O

More Related