1 / 68

Gamma DBMS Part 1: Physical Database Design

Gamma DBMS Part 1: Physical Database Design. Shahram Ghandeharizadeh Computer Science Department University of Southern California. Outline. Alternative architectures: Shared-disk versus Shared-Nothing Declustering techniques. Shared-Disk Architecture. Emerged in 1980s:

trung
Download Presentation

Gamma DBMS Part 1: Physical Database Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gamma DBMSPart 1: Physical Database Design Shahram Ghandeharizadeh Computer Science Department University of Southern California

  2. Outline • Alternative architectures: • Shared-disk versus Shared-Nothing • Declustering techniques.

  3. Shared-Disk Architecture • Emerged in 1980s: • Many clients share storage and data: data remains available when a client fails. Network Data

  4. Shared-Disk Architecture • Advantages: • Many clients share storage and data. • Redundancy is implemented in one place protecting all clients from disk failure. Network

  5. Shared-Disk Architecture • Advantages: • Many clients share storage and data. • Redundancy is implemented in one place protecting all clients from disk failure. • Centralized backup: The administrator does not care/know how many clients are on the network sharing storage. Network

  6. Shared-Disk Architecture • Advantages: • Many clients share storage and data. • Redundancy is implemented in one place protecting all clients from disk failure. • Centralized backup: The administrator does not care/know how many clients are on the network sharing storage. Data Sharing High Availability Network Data Backup

  7. Network failures • What about network failures? • Two host bus adapters per server, • Each server connected to a different switch.

  8. Storage Area Network (SAN): Block level access, Write to storage is immediate, Specialized hardware including switches, host bus adapters, disk chassis, battery backed caches, etc. Expensive Supports transaction processing systems. Network Attached Storage (NAS): File level access, Write to storage might be delayed, Generic hardware, In-expensive, Not appropriate for transaction processing systems. Shared-Disk Architecture

  9. Concepts and Terminology • Virtualization: • Available storage is represented as one HUGE disk drive, e.g., a SAN with a thousand 1.5 TB disk provides 1 Petabyte of storage, • Available storage is partitioned into Logical Unit Numbers (LUNs), • A LUN is presented to one or more servers, • A LUN appears as a disk drive to a server. • SAN places blocks across physical disks intelligently to balance load. • What to do when a PC fails?

  10. Shared-Nothing • Each node (blade) consisted of one processor, memory, and a disk drive. Network …. CPU 1 CPU N

  11. Shared-Nothing • Each node (blade) may consist of one or several processors, memory, and one or several disk drives. Node M Node 1 Network … … CPU 1 CPU 2 CPU n CPU 1 CPU 2 CPU n …. DRAM 1 DRAM 1 DRAM 2 DRAM 2 … … DRAM D DRAM D

  12. Shared-Nothing • Partition resources to construct logical nodes. With an 8 CPU PC, construct eight logical nodes each with a CPU, fraction of memory, and one disk drive. Network …. CPU 1 CPU nM

  13. Data Declustering • Data is partitioned across the nodes (why?): • Random/round-robin, • Hash partitioning, • Range partitioning. • Each piece of a table is termed a fragment. • Single attribute declustering strategies • Two multi-attribute declustering strategies: • Multi-Attribute GrId deClustering (MAGIC) • Bubba’s Extended Range Declustering (BERD)

  14. Horizontal Declustering Emp salary name age Logical View Physical View

  15. Horizontal Declustering • No partitioning attribute: Random and Round-robin. • Single attribute declustering strategies: • Hash, • Range. Note: the database administrator must choose one attribute as the partitioning attribute.

  16. Hash Declustering Emp salary name age salary is the partitioning attribute. salary % 3 salary salary salary name age name age name age Physical View

  17. Hash Declustering • Selections with equality predicates referencing the partitioning attribute are directed to a single node: • Retrieve Emp where salary = 60K • Equality predicates referencing a non-partitioning attribute and range predicates are directed to all nodes: • Retrieve Emp where age = 20 • Retrieve Emp where salary < 20K SELECT * FROM Emp WHERE salary=60K SELECT * FROM Emp WHERE salary<20K

  18. Range Declustering Emp salary name age salary is the partitioning attribute. 101K-∞ 0-50K 51K-100K salary salary salary name age name age name age Physical View

  19. Range Declustering • Equality and range predicates referencing the partitioning attribute are directed to a subset of nodes: • Retrieve Emp where salary = 60K • Retrieve Emp where salary < 20K • Predicates referencing a non-partitioning attribute are directed to all nodes. In our example, both queries are directed to one node.

  20. Year is 1988! 32 Processor Hypercube Each node consists of: 80386 processor (12 MHz) 2 MB DRAM 333 MB disk A hypercube inter-connect supporting parallel transmission of messages among nodes. An iPSC/2 Intel Hypercube

  21. Software Architecture • Each node stores its fragment on its local disk drive. • Each node may build a B+-tree (clustered/non-clustered) and hash index on its fragment of a relation. • Each node has its own concurrency control and crash recovery mechanism.

  22. Software Architecture

  23. Software Architecture

  24. Software Architecture

  25. Software Architecture

  26. Software Architecture • Processes executing on one node shared memory – identical to today’s threads! • At initialization time, a node would start a fixed number of threads (processes). • All threads listen on a well defined socket, waiting for the Scheduler to dispatch work to them. • A message contains the identity that the operator should assume: • A “switch” statement would enable a thread to become a select, project, hash-join build, hash-join probe, etc… • The message specifies the role of the thread.

  27. A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • The system is multi-programmed by increasing the number of clients in the system. • A multi-programming level of 8 means there are 8 clients generating requests to the system (independent of one another). … 32 Node Gamma

  28. A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • A 0.01% selection predicate retrieves 100 rows. • With a clustered B+-tree index, the 100 rows are grouped together in a few disk pages. … 32 Node Gamma

  29. A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • A 0.01% selection predicate retrieves 100 rows. • With a clustered B+-tree index, the 100 rows are grouped together in a few disk pages. • With range partitioning, the predicate is processed by one node. • With hash partitioning, the predicate is processed by all 32 nodes with the scheduler coordinating the execution of each predicate on a node, and gathering of the results from every node. … 0- 31,249 31,250 – 62,499 968-750 – 1,000,000 32 Node Gamma

  30. Declustering Techniques: Tradeoffs • Range selection predicate using a clustered B+-tree, 0.01% selectivity (10 records) Throughput (Queries/Second) Range Hash/Random/Round-robin Multiprogramming Level

  31. A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • A 1% selection predicate retrieves 10,000 rows. • With a clustered B+-tree index, the 10,000 rows are grouped together. … 0- 31,249 31,250 – 62,499 968-750 – 1,000,000 32 Node Gamma

  32. A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • A 1% selection predicate retrieves 10,000 rows. • With a clustered B+-tree index, the 10,000 rows are grouped together. • With Range partitioning, the predicate is processed using one or two nodes. • With Hash partitioning, the predicate is processed by all the nodes with the scheduler coordinating the execution of the predicate. … 0- 31,249 31,250 – 62,499 968-750 – 1,000,000

  33. Tradeoffs (Cont…) • Range selection predicate using a clustered B+-tree, 1% selectivity (1000 records) Throughput (Queries/Second) Hash/Random/Round-robin Range Multiprogramming Level

  34. Why Range Performs Poorly? • Note: Range performed poorly because the query (1% selection) imposed a high workload onto a node! • For a query with minimal (0.01% selection) workload requirement, Range is ideal! • Two reasons: • Random generation of selection predicates does NOT mean uniform distribution of workload across nodes. • The number of ranges is the same as the number of nodes causing the tail-end servers to observe a lower load.

  35. 1 2 3 R1 R2 R3 R1 R3 R2 R2 R1 R3 6 Ideal cases R2 R3 R1 R3 R1 R2 R3 R2 R1 {R1, R3} R2 {R1, R3} R2 {R1, R3} R2 27 ways to assign 3 requests to the 3 nodes! Only 6 result in a uniform distribution of requests. R2 {R1, R3} R2 {R1, R3} R2 {R1, R3} {R2, R3} R1 {R2, R3} R1 {R2, R3} R1 R1 {R2, R3} R1 {R2, R3} R1 {R2, R3} {R2, R1} R3 {R2, R1} R3 {R2, R1} R3 R3 {R2, R1} R3 {R2, R1} R3 {R2, R1} {R1, R2, R3} {R1, R2, R3} {R1, R2, R3}

  36. Tradeoffs (Cont…) • Simple range partitioning may lead to load imbalance for queries with high selectivity: • Low performance: increased response time and low system throughput. • Consider a table that maintains the grade of students for different exams, range partitioned on the grade. 60-79 80-100 0-19 20-39 40-59

  37. Tradeoffs (Cont…) • Assume a range predicate overlaps 3 partitions, e.g., • 0 < grade < 45 • 45 < grade < 90 60-79 80-100 0-19 20-39 40-59 60-79 80-100 0-19 20-39 40-59

  38. Tradeoffs (Cont…) • Higher response time because 2 nodes sit idle while 3 nodes process the query (assuming overhead of parallelism is negligible). 60-79 80-100 0-19 20-39 40-59 45 < grade < 90

  39. Tradeoffs (Cont…) • Lower throughput because node 3 becomes a bottleneck. • Assuming even distribution of access to ranges, when node 3 is utilized 100%, nodes 2 and 4 have a 66% utilization, while nodes 1 and 5 are utilized 33%. 60-79 80-100 0-19 20-39 40-59

  40. Hybrid Range Partitioning [VLDB’90] • To minimize the impact of load imbalance, construct more ranges than nodes, e.g., 10 ranges for a 5 node system. • Predicates such as “0 < grade < 45” are now directed to all nodes. • Assuming even distribution of access to ranges where workload consists of predicates utilizing 3 sequential ranges, when node 3 become 100% utilized, nodes 2 and 4 are now utilized 83%, while nodes 1 and 5 are utilized 66%. 0-10 51-60 11-20 61-70 31-40 81-90 41-50 91-100 21-30 71-80

  41. Multi-Attribute Declustering [SIGMOD’92] • Queries with minimal resource requirements should be directed to a few processors. Why? • Overhead of parallelism • Impacts query response time adversely, • Wastes system resources, reducing throughput. • OLTP has come a long way: • Heaviest transaction in TPC-C reads approximately 400 records. • Assuming no disk accesses, a low-end PC processes this transaction < 1 ms. • Transactions should be single sited! Range Round-robin

  42. Multi-Attribute Declustering (E.g.) • Recall the Emp(name, age, salary) table. • Workload consists of two queries, each with a 50% frequency of occurrence: • Query A, range query referencing the age attribute. On average, retrieves 5 tuples. • Retrieve Emp where age > 21 and age < 22. • Query B, range query referencing the salary attribute. On average, retrieves 10 tuples. • Retrieve Emp where salary > 50K and salary < 50.5K • Access methods: • A non-clustered B+-tree index on age • A clustered B+-tree index on salary • Ideally, both queries should be directed to one node.

  43. Multi-Attribute Declustering (E.g. Cont...) • Range decluster Emp using age as the partitioning attribute. • Assuming a system configured with nine nodes, the number of employed nodes is:

  44. MAGIC • Construct a multi-attribute grid directory on the Emp table • Each dimension corresponds to a partitioning attribute. • Each cell represents a fragment of the relation. Salary 0-20 21-25 26-30 31-35 36-40 41-70 10-20 21-25 A g e 26-30 31-35 36-40 41-60

  45. MAGIC (Low Correlation) . . . . . . • Low correlation between salary and age attribute values: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  46. MAGIC (High Correlation) . . . . • High correlation between salary and age attribute values: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  47. BERD • Range partition Emp using the salary attribute. • For the age attribute, construct an auxiliary relation containing: • The age attribute value of each record • Node containing that record • Range partition the auxiliary relation using the age attribute value.

  48. BERD Emp salary name age salary is the primary partitioning attribute. 101K-∞ 0-50K 51K-100K salary salary salary name age name age name age Physical View

  49. BERD, Auxiliary relation Auxiliary relation Node age 101K-∞ 0-50K 51K-100K salary salary salary name age name age name age

  50. BERD, Auxiliary relation Auxiliary relation Node age Range partition auxiliary relation using the age attribute. 53-∞ 0-20 21-52 node node node age age age

More Related