1 / 21

Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai

Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai. Main Message. Technology trends give many processors and storage units inexpensively To analyze large quantities of data sequential (regular) access patterns are 100x faster

melba
Download Presentation

Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Database SystemsInstructor: Dr. Yingshu Li Student: Chunyu Ai

  2. Main Message • Technology trends give • many processors and storage units • inexpensively • To analyze large quantities of data • sequential (regular) access patterns are 100x faster • parallelism is 1000x faster (trades time for money) • Relational systems show many parallel algorithms.

  3. 1k SPECint CPU 500 GB Disc 2 GB RAM Implications of Hardware Trends Large Disc Farms will be inexpensive (10k$/TB) Large RAM databases will be inexpensive (1K$/GB) Processors will be inexpensive So building block will be a processor with large RAM lots of Disc lots of network bandwidth

  4. Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.5 minute SCAN. Bandwidth Parallelism: divide a big problem into many smaller ones to be solved in parallel.

  5. Implication of Hardware Trends: Clusters CPU 50 GB Disc 5 GB RAM Future Servers are CLUSTERS of processors, discs ThesisMany Little will Win over Few Big

  6. Parallel Database Architectures

  7. Parallelism: Performance is the Goal Goal is to get 'good' performance. • Law 1: parallel system should be • faster than serial system • Law 2: parallel system should give • near-linear scaleup or • near-linear speedup or • both.

  8. Speed-Up and Scale-Up • Speedup: a fixed-sized problem executing on a small system is given to a system which is N-times larger. • Measured by: speedup = small system elapsed time large system elapsed time • Speedup is linear if equation equals N. • Scaleup: increase the size of both the problem and the system • N-times larger system used to perform N-times larger job • Measured by: scaleup = small system small problem elapsed time big system big problem elapsed time • Scale up is linear if equation equals 1.

  9. Kinds of Parallel Execution Pipeline Any Any Sequential Sequential Program Program Sequential Sequential Any Any Sequential Sequential Sequential Sequential Partition outputs split N ways inputs merge M ways Program Program

  10. The Drawbacks of Parallelism Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Communication: send data among nodes Skew: If tasks get very small, variance > service time

  11. Parallelism: Speedup & Scaleup Speedup: Same Job, More Hardware Less time 100GB 100GB Scaleup: Bigger Job, More Hardware Same time 1 TB 100GB Transaction Scaleup: more clients/servers Same response time 1 k clients 10 k clients 1 TB 100GB Server Server

  12. Database Systems “Hide” Parallelism • Automate system management via tools • data placement • data organization (indexing) • periodic tasks (dump / recover / reorganize) • Automatic fault tolerance • duplex & failover • transactions • Automatic parallelism • among transactions (locking) • within a transaction (parallel execution)

  13. Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash Round Robin Good for equi-joins, range queries group-by Good for equi-joins Good to spread load Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

  14. 0...9 10..19 20..29 30..39 40.. A..C G...M D..F S..Z N...R Index Partitioning • Hash indices partition by hash • B-tree indices partition as a forest of trees. • One tree per range • Primary index clusters data

  15. Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism

  16. N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows

  17. Blocking Operators = Short Pipelines • An operator is blocking, • if it does not produce any output, • until it has consumed all its input • Examples: • Sort, • Aggregates, • Hash-Join (reads all of one operand) • Blocking operators kill pipeline parallelism • Make partition parallelism all the more important. Database Load Template has three blocked phases Sort Runs Table Insert Tape Scan Merge Runs SQL Table File SQL Table Process Sort Runs Index Insert Merge Runs Index 1 Sort Runs Index Insert Merge Runs Index 2 Merge Runs Sort Runs Index Insert Index 3

  18. Parallel Aggregates • For aggregate function, need a decomposition strategy: • count(S) =  count(s(i)), ditto for sum() • avg(S) = ( sum(s(i))) /  count(s(i)) • and so on... • For groups, • sub-aggregate groups close to the source • drop sub-aggregates into a hash river.

  19. River is range or hash partitioned Merge runs Sub-sorts generate runs Range or Hash Partition River Scan or other source Parallel Sort M inputs N outputs Disk and merge not needed if sort fits in Memory • Sort is benchmark from hell for shared nothing machines • net traffic = disk bandwidth, no data filtering at the source

  20. Hash Join: Combining Two Tables Right Table • Hash smaller table into N buckets (hope N=1) • If N=1 read larger table, hash to smaller • Else, hash outer to disk then • bucket-by-bucket hash join. • Purely sequential data behavior • Always beats sort-merge and nested • unless data is clustered. • Good for equi, outer, exclusion join • Lots of papers, • Hash reduces skew Hash Buckets Left Table

  21. Q&A • Thank you!

More Related