the gamma database machine dewitt ghandeharizadeh schneider bricker hsiao rasmussen
Download
Skip this Video
Download Presentation
The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen

Loading in 2 Seconds...

play fullscreen
1 / 63

The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen - PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on

The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen. Deepak Bastakoty (With slide material sourced from Ghandeharizadeh and DeWitt). Outline. Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen' - osmond


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the gamma database machine dewitt ghandeharizadeh schneider bricker hsiao rasmussen

The Gamma Database MachineDeWitt, Ghandeharizadeh, Schneider,Bricker, Hsiao, Rasmussen

Deepak Bastakoty

(With slide material sourced from Ghandeharizadeh and DeWitt)

EECS 584, Fall 2011

outline
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

2

7/31/2014

outline1
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

3

7/31/2014

motivation
Motivation

What ?

Parallelizing Databases

How to setup system / hardware?

How to distribute data ?

How to tailor software ?

Where does this rabbit hole lead?

EECS 584, Fall 2011

4

7/31/2014

motivation1
Motivation
  • Why ?

EECS 584, Fall 2011

motivation2
Motivation

Why ?

Obtain faster response time

Increase query throughput

Improve robustness to failure

Reduce system cost

Enable Massive Scalability

(Faster, Safer, Cheaper, and More of it !)

EECS 584, Fall 2011

6

7/31/2014

outline2
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

7

7/31/2014

system design schemes
System Design Schemes
  • Shared Memory
      • DeWitt’s DIRECT etc
  • Shared Disk
      • Storage Networks
  • Shared Nothing
      • Gamma, everything else

(PigLatin, MapReduce setups, BigTable, GFS)

system design schemes1
System Design Schemes
  • Shared Memory
    • poor scalability (number of nodes)
    • custom hardware / systems
system design schemes2
System Design Schemes
  • Shared Disk
    • storage networks
    • used to be popular
    • why ?

Network

slide11
Advantages:

Shared Disk

  • Many clients share storage and data.
  • Redundancy is implemented in one place protecting all clients from disk failure.
  • Centralized backup: The administrator does not care/know how many clients are on the network sharing storage.

Data

Sharing

High

Availability

Data

Backup

shared disk
Shared Disk
  • Storage Area Network (SAN):
    • Block level access,
    • Write to storage is immediate,
    • Specialized hardware including switches, host bus adapters, disk chassis, battery backed caches, etc.
    • Expensive
    • Supports transaction processing systems.
  • Network Attached Storage (NAS):
    • File level access,
    • Write to storage might be delayed,
    • Generic hardware,
    • In-expensive,
    • Not appropriate for transaction processing systems.
shared nothing
Shared Nothing
  • Each node has its own processor(s), memory, disk(s)

Node 1

Node M

CPU

1

CPU

2

CPU

n

CPU

1

CPU

2

CPU

n

Network

…. ….

DRAM 1

DRAM 1

DRAM 2

DRAM 2

DRAM D

DRAM D

shared nothing1
Shared Nothing
  • Why ?
    • Low Bandwidth Parallel Data Processing
    • Commodity hardware (Cheap !!)
    • Minimal Sharing = Minimal Interference
    • Hence : Scalability (keep adding nodes!)
    • DeWitt says : Even writing code is simpler
shared nothing2
Shared Nothing
  • Why ?
    • Low Bandwidth Parallel Data Processing
    • Commodity hardware (Cheap !!)
    • Minimal Sharing = Minimal Interference
    • Hence : Scalability (keep adding nodes!)
    • DeWitt says : Even writing code is simpler
outline3
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

16

7/31/2014

no shared data declustering
No Shared Data Declustering

salary

name

age

Logical View

Physical View

no shared data declustering1
No Shared Data Declustering
  • Spreading data between disks :
    • Attribute-less partitioning
      • Random
      • Round-Robin
    • Single Attribute Schemes
      • Hash De-clustering
      • Range De-clustering
    • Multiple Attributes schemes possible
      • MAGIC, BERD etc
no shared data declustering2
No Shared Data Declustering
  • Spreading data between disks :
    • Attribute-less partitioning
      • Random
      • Round-Robin
    • Single Attribute Schemes
      • Hash De-clustering
      • Range De-clustering
    • Multiple Attributes schemes possible
      • MAGIC, BERD etc
hash declustering
Hash Declustering

salary

name

age

salary is the partitioning attribute.

salary % 3

salary

salary

salary

name

age

name

age

name

age

hash declustering1
Hash Declustering
  • Selections with equality predicates referencing the partitioning attribute are directed to a single node:
    • Retrieve Emp where salary = 60K
  • Equality predicates referencing a non-partitioning attribute and range predicates are directed to all nodes:
    • Retrieve Emp where age = 20
    • Retrieve Emp where salary < 20K

SELECT *

FROM Emp

WHERE salary=60K

SELECT *

FROM Emp

WHERE salary<20K

range declustering
Range Declustering

salary

name

age

salary is the partitioning attribute.

101K-∞

0-50K

51K-100K

salary

salary

salary

name

age

name

age

name

age

range declustering1
Range Declustering
  • Equality and range predicates referencing the partitioning attribute are directed to a subset of nodes:
    • Retrieve Emp where salary = 60K
    • Retrieve Emp where salary < 20K
  • Predicates referencing a non-partitioning attribute are directed to all nodes.

In the example, both queries are directed to one node.

declustering tradeoffs
Declustering Tradeoffs
  • Range selection predicate using a clustered B+-tree
  • 0.01% selectivity

(10 records)

Throughput

(Queries/Second)

Range

Hash/Random/Round-robin

Multiprogramming Level

declustering tradeoffs1
Declustering Tradeoffs
  • Range selection predicate using a clustered B+-tree
  • 1% selectivity

(1000 records)

Throughput

(Queries/Second)

Hash/Random/Round-robin

Range

Multiprogramming Level

slide26
Why the difference ?

Declustering Tradeoffs

EECS 584, Fall 2011

26

7/31/2014

slide27
Why the difference ?

When selection was small, Range spread out load and was ideal

When selection increased, Range caused high workload on one/few nodes while Hash spread out load

Declustering Tradeoffs

EECS 584, Fall 2011

27

7/31/2014

outline4
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

28

7/31/2014

failure management
Failure Management

Key Questions

Robustness (How much damage recoverable?)

Availability (How likely? Hot Recoverable?)

MTTR (Mean Time To Recovery)

Consider two declustering schemes

Interleaved Declustering (TeraData)

Chained Declustering (Gamma DBM)

EECS 584, Fall 2011

29

7/31/2014

interleaved declustering
Interleaved Declustering

A partitioned table has a primary and a backup copy.

The primary copy is constructed using one of the partitioning techniques.

The secondary copy is constructed by:

Dividing the nodes into clusters (cluster size 4 here),

Partition a primary fragment (R0) across the remaining nodes of the cluster: 1, 2, and 3. Realizing r0.0, r0.1, and r0.2.

EECS 584, Fall 2011

30

7/31/2014

interleaved declustering1
Interleaved Declustering

On failure, query load re-directed to backup nodes in cluster

MTTR :

replace node

reconstruct failed primary from backups

reconstruct backups stored in failed node

Second failure before this can cause unavailability

Large cluster size improves failure load balancing but increases risk of data being unavailable

chained declustering gamma
Chained Declustering (Gamma)

Nodes are divided into disjoint groups called relation clusters.

A relation is assigned to one relation cluster and its records are declustered across the nodes of that relation cluster using a partitioning strategy (Range, Hash).

Given a primary fragment Ri, its backup copy is assigned to node (i+1) mod M (M is the number of nodes in the relation cluster).

chained declustering gamma1
Chained Declustering (Gamma)

During normal operation:

Read requests are directed to the fragments of primary copy,

Write requests update both primary and backup copies.

chained declustering gamma2
Chained Declustering (Gamma)

In presence of failure:

Both primary and backup fragments are used for read operations,

Objective: Balance the load and avoid bottlenecks!

Write requests update both primary and backup copies.

Note:

Load of R1 (on node 1) is pushed to node 2 in its entirety.

A fraction of read request from each node is pushed to the others for a 1/8 load increase attributed to node 1’s failure.

chained declustering gamma3
Chained Declustering (Gamma)

MTTR involves:

Replace node 1 with a new node,

Reconstruct R1 (from r1 on node 2) on node 1,

Reconstruct backup copy of R0 (i.e., r0) on node 1.

Note:

Once Node 1 becomes operational, primary copies are used to process read requests.

chained declustering gamma4
Chained Declustering (Gamma)

Any two node failures in a relation cluster does not result in data un-availability.

  • Two adjacent nodes must fail in order for data to become unavailable
outline5
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

37

7/31/2014

query processing
Query Processing
  • General Idea : Divide And Conquer
parallelizing hash join
Parallelizing Hash Join
  • Consider Hash Join
    • Join of Tables A and B using attribute j (A.j = B.j) consists of two phase:
      • Build phase: Build a main-memory hash table on Table A using the join attribute j, e.g., build a hash table on the Toy department using dno as the key of the hash table.
      • Probe phase: Scan table B one record at a time and use its attribute j to probe the hash table constructed on Table A, e.g., probe the hash table using the rows of the Emp department.
parallelism and hash join
Parallelism and Hash-Join
  • R join S where R is the inner table.
example join of emp and dept
Example Join of Emp and Dept
  • Emp join Dept (using dno)

Dept

EMP

hash join build
Hash-Join: Build
  • Read rows of Dept table one at a time and place in a main-memory hash table

dno % 7

hash join build2
Read rows of Emp table and probe the hash table and produce results when a match is found. Repeat until all rows processedHash-Join: Build

dno % 7

hash join slide for exam etc review only
Prob: Table used to build hash table larger than memory

A divide-and-conquer approach:

Use the inner table (Dept) to construct n memory buckets where each bucket is a hash table.

Every time memory is exhausted, spill a fixed number of buckets to the disk.

The build phase terminates with a set of in-memory buckets and a set of disk-resident buckets.

Read the outer relation (Emp) and probe the in-memory buckets for joining records. For those records that map onto the disk-resident buckets, stream and store them to disk.

Discard the in memory buckets to free memory space.

While disk-resident buckets of inner-relation exist:

Read as many (say i) of the disk-resident buckets of the inner-relation into memory as possible.

Read the corresponding buckets of the outer relation (Emp) to probe the in-memory buckets for joining records.

Discard the in memory buckets to free memory space.

Delete the i buckets of the inner and outer relations.

Hash-Join (slide for exam etc review only)
hash join build3
Hash-Join: Build
  • Two buckets of Dept table. One in memory and the second is disk-resident.

dno % 7

hash join probe
Read Emp table and probe the hash table for joining records when dno=1. With dno=2, stream the data to disk.Hash-Join: Probe

dno % 7

hash join while loop
Read the disk-resident bucket of Dept into memory.

Probe it with the disk-resident buckets of Emp table to produce the remaining two joining records.

Hash-Join: While loop

dno % 7

parallelism and hash join1
Parallelism and Hash-Join
  • Each node may perform hash-join independently when:
    • The join attribute is the declustering attribute of the tables participating in the join operation.
    • The participating tables are declustered across the same number of nodes using the same declustering strategy.
    • The system may re-partition the table (see the next bullet) if its aggregate memory exceeds the size of memory the tables are declustered across.
  • Otherwise, the data must be re-partitioned to perform the join operation correctly.
parallelism and hash join2
Parallelism and Hash-Join
  • R join S where R is the inner table.
outline6
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

52

7/31/2014

system evaluation
System Evaluation
  • Two Key Metrics
    • Speed-up
      • Given a system with 1 node, does adding n nodes speed it up with a factor of n ?
    • Scale-up
      • Given a system with 1 node, does the response time remain the same with n nodes ?
outline7
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Overview of Results

Bonus: Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

59

7/31/2014

modern systems are gamma too
Modern Systems are Gamma too?
  • Ghandeharizadeh’s 2009 summary :
    • All systems below, like Gamma, are “share nothing” and run on 1000s of nodes

Pig

Yahoo’s Pig Latin

Hadoop

Google’s Map/Reduce Framework

BigTable

Google’s Bigtable Data Model

Google File System

GFS

old is the new new
Old is the New New?
  • DeWitt’s Clustera :
    • DBMS centric cluster management system
    • Evolved from Gamma
    • Like MapReduce but
      • Full DBMS support
      • SQL optimization
    • Outperforms Hadoop!

Data complexity

Map/Reduce

Parallel SQL

Condor

Job complexity

outline8
Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Overview of Results

Bonus: Current Progress, Hadoop, Clustera..

Questions ?

EECS 584, Fall 2011

63

7/31/2014

ad