680 likes | 829 Views
Gamma DBMS Part 1: Physical Database Design. Shahram Ghandeharizadeh Computer Science Department University of Southern California. Outline. Alternative architectures: Shared-disk versus Shared-Nothing Declustering techniques. Shared-Disk Architecture. Emerged in 1980s:
E N D
Gamma DBMSPart 1: Physical Database Design Shahram Ghandeharizadeh Computer Science Department University of Southern California
Outline • Alternative architectures: • Shared-disk versus Shared-Nothing • Declustering techniques.
Shared-Disk Architecture • Emerged in 1980s: • Many clients share storage and data: data remains available when a client fails. Network Data
Shared-Disk Architecture • Advantages: • Many clients share storage and data. • Redundancy is implemented in one place protecting all clients from disk failure. Network
Shared-Disk Architecture • Advantages: • Many clients share storage and data. • Redundancy is implemented in one place protecting all clients from disk failure. • Centralized backup: The administrator does not care/know how many clients are on the network sharing storage. Network
Shared-Disk Architecture • Advantages: • Many clients share storage and data. • Redundancy is implemented in one place protecting all clients from disk failure. • Centralized backup: The administrator does not care/know how many clients are on the network sharing storage. Data Sharing High Availability Network Data Backup
Network failures • What about network failures? • Two host bus adapters per server, • Each server connected to a different switch.
Storage Area Network (SAN): Block level access, Write to storage is immediate, Specialized hardware including switches, host bus adapters, disk chassis, battery backed caches, etc. Expensive Supports transaction processing systems. Network Attached Storage (NAS): File level access, Write to storage might be delayed, Generic hardware, In-expensive, Not appropriate for transaction processing systems. Shared-Disk Architecture
Concepts and Terminology • Virtualization: • Available storage is represented as one HUGE disk drive, e.g., a SAN with a thousand 1.5 TB disk provides 1 Petabyte of storage, • Available storage is partitioned into Logical Unit Numbers (LUNs), • A LUN is presented to one or more servers, • A LUN appears as a disk drive to a server. • SAN places blocks across physical disks intelligently to balance load. • What to do when a PC fails?
Shared-Nothing • Each node (blade) consisted of one processor, memory, and a disk drive. Network …. CPU 1 CPU N
Shared-Nothing • Each node (blade) may consist of one or several processors, memory, and one or several disk drives. Node M Node 1 Network … … CPU 1 CPU 2 CPU n CPU 1 CPU 2 CPU n …. DRAM 1 DRAM 1 DRAM 2 DRAM 2 … … DRAM D DRAM D
Shared-Nothing • Partition resources to construct logical nodes. With an 8 CPU PC, construct eight logical nodes each with a CPU, fraction of memory, and one disk drive. Network …. CPU 1 CPU nM
Data Declustering • Data is partitioned across the nodes (why?): • Random/round-robin, • Hash partitioning, • Range partitioning. • Each piece of a table is termed a fragment. • Single attribute declustering strategies • Two multi-attribute declustering strategies: • Multi-Attribute GrId deClustering (MAGIC) • Bubba’s Extended Range Declustering (BERD)
Horizontal Declustering Emp salary name age Logical View Physical View
Horizontal Declustering • No partitioning attribute: Random and Round-robin. • Single attribute declustering strategies: • Hash, • Range. Note: the database administrator must choose one attribute as the partitioning attribute.
Hash Declustering Emp salary name age salary is the partitioning attribute. salary % 3 salary salary salary name age name age name age Physical View
Hash Declustering • Selections with equality predicates referencing the partitioning attribute are directed to a single node: • Retrieve Emp where salary = 60K • Equality predicates referencing a non-partitioning attribute and range predicates are directed to all nodes: • Retrieve Emp where age = 20 • Retrieve Emp where salary < 20K SELECT * FROM Emp WHERE salary=60K SELECT * FROM Emp WHERE salary<20K
Range Declustering Emp salary name age salary is the partitioning attribute. 101K-∞ 0-50K 51K-100K salary salary salary name age name age name age Physical View
Range Declustering • Equality and range predicates referencing the partitioning attribute are directed to a subset of nodes: • Retrieve Emp where salary = 60K • Retrieve Emp where salary < 20K • Predicates referencing a non-partitioning attribute are directed to all nodes. In our example, both queries are directed to one node.
Year is 1988! 32 Processor Hypercube Each node consists of: 80386 processor (12 MHz) 2 MB DRAM 333 MB disk A hypercube inter-connect supporting parallel transmission of messages among nodes. An iPSC/2 Intel Hypercube
Software Architecture • Each node stores its fragment on its local disk drive. • Each node may build a B+-tree (clustered/non-clustered) and hash index on its fragment of a relation. • Each node has its own concurrency control and crash recovery mechanism.
Software Architecture • Processes executing on one node shared memory – identical to today’s threads! • At initialization time, a node would start a fixed number of threads (processes). • All threads listen on a well defined socket, waiting for the Scheduler to dispatch work to them. • A message contains the identity that the operator should assume: • A “switch” statement would enable a thread to become a select, project, hash-join build, hash-join probe, etc… • The message specifies the role of the thread.
A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • The system is multi-programmed by increasing the number of clients in the system. • A multi-programming level of 8 means there are 8 clients generating requests to the system (independent of one another). … 32 Node Gamma
A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • A 0.01% selection predicate retrieves 100 rows. • With a clustered B+-tree index, the 100 rows are grouped together in a few disk pages. … 32 Node Gamma
A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • A 0.01% selection predicate retrieves 100 rows. • With a clustered B+-tree index, the 100 rows are grouped together in a few disk pages. • With range partitioning, the predicate is processed by one node. • With hash partitioning, the predicate is processed by all 32 nodes with the scheduler coordinating the execution of each predicate on a node, and gathering of the results from every node. … 0- 31,249 31,250 – 62,499 968-750 – 1,000,000 32 Node Gamma
Declustering Techniques: Tradeoffs • Range selection predicate using a clustered B+-tree, 0.01% selectivity (10 records) Throughput (Queries/Second) Range Hash/Random/Round-robin Multiprogramming Level
A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • A 1% selection predicate retrieves 10,000 rows. • With a clustered B+-tree index, the 10,000 rows are grouped together. … 0- 31,249 31,250 – 62,499 968-750 – 1,000,000 32 Node Gamma
A Comparison of Range & Hash • Closed simulation model: • A client generates a range selection predicate: X < age < Y. • The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows). • A client does not generate a new request until its pending request is processed by Gamma and returned. • A 1% selection predicate retrieves 10,000 rows. • With a clustered B+-tree index, the 10,000 rows are grouped together. • With Range partitioning, the predicate is processed using one or two nodes. • With Hash partitioning, the predicate is processed by all the nodes with the scheduler coordinating the execution of the predicate. … 0- 31,249 31,250 – 62,499 968-750 – 1,000,000
Tradeoffs (Cont…) • Range selection predicate using a clustered B+-tree, 1% selectivity (1000 records) Throughput (Queries/Second) Hash/Random/Round-robin Range Multiprogramming Level
Why Range Performs Poorly? • Note: Range performed poorly because the query (1% selection) imposed a high workload onto a node! • For a query with minimal (0.01% selection) workload requirement, Range is ideal! • Two reasons: • Random generation of selection predicates does NOT mean uniform distribution of workload across nodes. • The number of ranges is the same as the number of nodes causing the tail-end servers to observe a lower load.
1 2 3 R1 R2 R3 R1 R3 R2 R2 R1 R3 6 Ideal cases R2 R3 R1 R3 R1 R2 R3 R2 R1 {R1, R3} R2 {R1, R3} R2 {R1, R3} R2 27 ways to assign 3 requests to the 3 nodes! Only 6 result in a uniform distribution of requests. R2 {R1, R3} R2 {R1, R3} R2 {R1, R3} {R2, R3} R1 {R2, R3} R1 {R2, R3} R1 R1 {R2, R3} R1 {R2, R3} R1 {R2, R3} {R2, R1} R3 {R2, R1} R3 {R2, R1} R3 R3 {R2, R1} R3 {R2, R1} R3 {R2, R1} {R1, R2, R3} {R1, R2, R3} {R1, R2, R3}
Tradeoffs (Cont…) • Simple range partitioning may lead to load imbalance for queries with high selectivity: • Low performance: increased response time and low system throughput. • Consider a table that maintains the grade of students for different exams, range partitioned on the grade. 60-79 80-100 0-19 20-39 40-59
Tradeoffs (Cont…) • Assume a range predicate overlaps 3 partitions, e.g., • 0 < grade < 45 • 45 < grade < 90 60-79 80-100 0-19 20-39 40-59 60-79 80-100 0-19 20-39 40-59
Tradeoffs (Cont…) • Higher response time because 2 nodes sit idle while 3 nodes process the query (assuming overhead of parallelism is negligible). 60-79 80-100 0-19 20-39 40-59 45 < grade < 90
Tradeoffs (Cont…) • Lower throughput because node 3 becomes a bottleneck. • Assuming even distribution of access to ranges, when node 3 is utilized 100%, nodes 2 and 4 have a 66% utilization, while nodes 1 and 5 are utilized 33%. 60-79 80-100 0-19 20-39 40-59
Hybrid Range Partitioning [VLDB’90] • To minimize the impact of load imbalance, construct more ranges than nodes, e.g., 10 ranges for a 5 node system. • Predicates such as “0 < grade < 45” are now directed to all nodes. • Assuming even distribution of access to ranges where workload consists of predicates utilizing 3 sequential ranges, when node 3 become 100% utilized, nodes 2 and 4 are now utilized 83%, while nodes 1 and 5 are utilized 66%. 0-10 51-60 11-20 61-70 31-40 81-90 41-50 91-100 21-30 71-80
Multi-Attribute Declustering [SIGMOD’92] • Queries with minimal resource requirements should be directed to a few processors. Why? • Overhead of parallelism • Impacts query response time adversely, • Wastes system resources, reducing throughput. • OLTP has come a long way: • Heaviest transaction in TPC-C reads approximately 400 records. • Assuming no disk accesses, a low-end PC processes this transaction < 1 ms. • Transactions should be single sited! Range Round-robin
Multi-Attribute Declustering (E.g.) • Recall the Emp(name, age, salary) table. • Workload consists of two queries, each with a 50% frequency of occurrence: • Query A, range query referencing the age attribute. On average, retrieves 5 tuples. • Retrieve Emp where age > 21 and age < 22. • Query B, range query referencing the salary attribute. On average, retrieves 10 tuples. • Retrieve Emp where salary > 50K and salary < 50.5K • Access methods: • A non-clustered B+-tree index on age • A clustered B+-tree index on salary • Ideally, both queries should be directed to one node.
Multi-Attribute Declustering (E.g. Cont...) • Range decluster Emp using age as the partitioning attribute. • Assuming a system configured with nine nodes, the number of employed nodes is:
MAGIC • Construct a multi-attribute grid directory on the Emp table • Each dimension corresponds to a partitioning attribute. • Each cell represents a fragment of the relation. Salary 0-20 21-25 26-30 31-35 36-40 41-70 10-20 21-25 A g e 26-30 31-35 36-40 41-60
MAGIC (Low Correlation) . . . . . . • Low correlation between salary and age attribute values: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MAGIC (High Correlation) . . . . • High correlation between salary and age attribute values: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
BERD • Range partition Emp using the salary attribute. • For the age attribute, construct an auxiliary relation containing: • The age attribute value of each record • Node containing that record • Range partition the auxiliary relation using the age attribute value.
BERD Emp salary name age salary is the primary partitioning attribute. 101K-∞ 0-50K 51K-100K salary salary salary name age name age name age Physical View
BERD, Auxiliary relation Auxiliary relation Node age 101K-∞ 0-50K 51K-100K salary salary salary name age name age name age
BERD, Auxiliary relation Auxiliary relation Node age Range partition auxiliary relation using the age attribute. 53-∞ 0-20 21-52 node node node age age age