Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

Big Picture Ensure high throughput for concurrent accesses to peta-scale Scientific datasets • Data-Intensive analysis queries • Correlate, mine, and extract features • Batch workloads with multiple simultaneous queries • Join data partitioned and distributed across multiple nodes • Scale of exploration limited • I/O: Scanning vast amounts of data over hours or days • Network: Transferring lots of data over large distances

Querying on Global-Scale • SkyQuery database federation for Astronomy • Publicly accessible virtual telescope • Sharing of heterogeneous data • Geographically dispersed (30 across NA, EA, EU) • High network cost for federated join queries • Joins on terabyte datasets between nodes • Queries last minutes producing hundreds of MB in results • Network transfers consume up to 70% of the time • Data volume and geography limit scale

Incorporating Network Structure

Network-Aware Join Scheduling • Capture network heterogeneity • Metric that exploits excess capacity for routing • Decentralized local optimizations • Two-approximate, MST-based solution • Supports parallelism and trade-offs with I/O cost • Ten-fold reduction in network utilization for SkyQuery (ICDE’08)

Scanning Peta-Scale Data • Data intensive scan queries • Executed against a clustered index • Span multiple nodes (partitioned by space/time in cluster) • Incredibly I/O bound • Full DB scans lasting hours or days • Multiple concurrent queries (millions/month) • Significant data reuse between queries Turbulence Astronomy

HTM Sub-query regions Reordering & Co-scheduling Q1 Astronomers LifeRaft Scheduling Query Results Pre-processing & Decomposition Q2 Query SELECT ... FROM … WHERE region(‘circle 181.3 -0.76 6.5’) and specclass = 2 and … Q3 LifeRaft: Data-Driven Batch Scheduling • Schedule queries greedily based on contention • Contentious regions amortize I/O over more queries • Two-fold improvement in throughput(CIDR’09)

Job-Aware Batch Scheduling • Sequence of queries related to the same experiment • Predict I/O for long-running experiments • Queries may be order dependent • Batch interface for Scientists • Session IDs to explicitly link queries • Pre-declare time/space regions of interest • Pre-package operations • Submit all queries at once • Pre-fetching to improve response time • Bounding box over the data accessed • Extrapolate trajectory of job based on time/space

J1 J4 J3 J2 J5 J3 J4 J1 J2 J5 Job-Aware Batch Scheduling • Delays evaluation of time-steps that are accessed in the future Revisit LifeRaft J1 Job-Aware LifeRaft J2 JOBS T1 J3 J4 T2 J5 T5 T3 T1 T2 T3 T4 T5 T4 T4 Time Steps T1 T3 T2 T4 T3 T5 T4 T4 T3 T2

Extending Batch Scheduling • Provide starvation resistance • Short interactive queries that focus on small region • Soft constraints on completion order • Hard constraints on response time User Perceived Delay(Turbulence July 22nd) 4x overhead

Extending Batch Scheduling • Cooperative LifeRaft • Beyond single node LifeRaft • Coordinate scheduling across multiple nodes • Communicate to refine local decisions • Avoid delaying a query that spans multiple nodes • Heterogeneity in workload allocation and performance

Thank You!

Supplementary Slides

Extending Batch Scheduling • Query buffering • Large intermediate results • May need to page results to disk

<1 sec >1hr A Case for Batch Processing • 70% of queries reuse turbulence simulation results from a dozen timesteps • Varied query sizes ranging from <1s to several hours

Scheduling Behavior Qi Qj Qk Qk Sub-divide queries by bucket: • Assumptions: • Inter-query time of 1 sec • I/O for each bucket of 1 sec • Cache size of 2 • Join cost is negligible Qi – Qi1, Qi2, Qi3 Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8 Qj – Qj5, Qj6 , Qj7, Qj8

Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qk End Qi End Qj End Qi1 Qi2 Qi3 Qk8 Qj7 Qj1 Qj6 Qj8 Qk1 Qj3 Qk4 Qj4 B1 B2 B3 B7 B1 B1 B3 B6 B4 B8 B4 B8 Arrival order with no sharing … Completion Times: Qi – 3 sec Qj – 8 sec Qk – 13 sec Avg – 8 sec Tp – .2 qry/sec

Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi5 Qj4Qk4 Qj7Qk7 Qj6Qk6 Qj1Qk1 Qi3Qj3 Qj8Qk8 B1 B2 B5 B3 B1 B4 B7 B8 B6 Age based scheduling (bias 1) Completion Times: Qi – 3 sec Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec

Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi3Qj3 Qk5 Qj6Qk6 Qj7Qk7 Qj8Qk8 Qj1Qk1Qj4Qk4 B1 B2 B5 B3 B7 B8 B1 B4 B6 Contention based scheduling (bias 0) Completion Times: Qi – 7 sec Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec (5.6) (.33)

Reducing I/O: Adaptive Physical Design Minimize cost of query execution and transitioning – 40% reduction in I/O

Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

Presentation Transcript

Data-Intensive Text Processing with MapReduce

Processing Data Intensive Queries in Scientific Database Federations

Data-Intensive Scientific Computing in Astronomy

MapReduce for Data Intensive Scientific Analyses

Databases – Queries and Database Practice Queries

Satisfying Data-Intensive Queries Using GPU Clusters

Petabyte-scale computing for LHC

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

Large-scale Data Processing Challenges

CS1100: Data, Databases, Queries

CS1100: Data, Databases, Queries

Efficient Processing of Top- k Queries in Uncertain Databases

Large scale data processing

Data Management Challenges of Data-Intensive Scientific Workflows

Extreme Data-Intensive Scientific Computing

Runtime Data Management for Data-Intensive Scientific Applications

Data Management Challenges of Large-Scale Data Intensive Scientific Workflows

Data-Intensive Scientific Discovery

Data, Databases, and Queries

In Search of PetaByte Databases

BBM: Bayesian Browsing Model from Petabyte -scale Data

Data, Databases, and Queries