The gamma database machine dewitt ghandeharizadeh schneider bricker hsiao rasmussen
This presentation is the property of its rightful owner.
Sponsored Links
1 / 63

The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen. Deepak Bastakoty (With slide material sourced from Ghandeharizadeh and DeWitt). Outline. Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results

Download Presentation

The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The gamma database machine dewitt ghandeharizadeh schneider bricker hsiao rasmussen

The Gamma Database MachineDeWitt, Ghandeharizadeh, Schneider,Bricker, Hsiao, Rasmussen

Deepak Bastakoty

(With slide material sourced from Ghandeharizadeh and DeWitt)

EECS 584, Fall 2011


Outline

Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

2

7/31/2014


Outline1

Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

3

7/31/2014


Motivation

Motivation

What ?

Parallelizing Databases

How to setup system / hardware?

How to distribute data ?

How to tailor software ?

Where does this rabbit hole lead?

EECS 584, Fall 2011

4

7/31/2014


Motivation1

Motivation

  • Why ?

EECS 584, Fall 2011


Motivation2

Motivation

Why ?

Obtain faster response time

Increase query throughput

Improve robustness to failure

Reduce system cost

Enable Massive Scalability

(Faster, Safer, Cheaper, and More of it !)

EECS 584, Fall 2011

6

7/31/2014


Outline2

Outline

Motivation

Physical System Designs

Data Clustering

Failure Management

Query Processing

Evaluation and Results

Bonus : Current Progress, Hadoop, Clustera..

EECS 584, Fall 2011

7

7/31/2014


System design schemes

System Design Schemes

  • Shared Memory

    • DeWitt’s DIRECT etc

  • Shared Disk

    • Storage Networks

  • Shared Nothing

    • Gamma, everything else

      (PigLatin, MapReduce setups, BigTable, GFS)


  • System design schemes1

    System Design Schemes

    • Shared Memory

      • poor scalability (number of nodes)

      • custom hardware / systems


    System design schemes2

    System Design Schemes

    • Shared Disk

      • storage networks

      • used to be popular

      • why ?

    Network


    The gamma database machine dewitt ghandeharizadeh schneider bricker hsiao rasmussen

    Advantages:

    Shared Disk

    • Many clients share storage and data.

    • Redundancy is implemented in one place protecting all clients from disk failure.

    • Centralized backup: The administrator does not care/know how many clients are on the network sharing storage.

    Data

    Sharing

    High

    Availability

    Data

    Backup


    Shared disk

    Shared Disk

    • Storage Area Network (SAN):

      • Block level access,

      • Write to storage is immediate,

      • Specialized hardware including switches, host bus adapters, disk chassis, battery backed caches, etc.

      • Expensive

      • Supports transaction processing systems.

    • Network Attached Storage (NAS):

      • File level access,

      • Write to storage might be delayed,

      • Generic hardware,

      • In-expensive,

      • Not appropriate for transaction processing systems.


    Shared nothing

    Shared Nothing

    • Each node has its own processor(s), memory, disk(s)

    Node 1

    Node M

    CPU

    1

    CPU

    2

    CPU

    n

    CPU

    1

    CPU

    2

    CPU

    n

    Network

    …. ….

    DRAM 1

    DRAM 1

    DRAM 2

    DRAM 2

    DRAM D

    DRAM D


    Shared nothing1

    Shared Nothing

    • Why ?

      • Low Bandwidth Parallel Data Processing

      • Commodity hardware (Cheap !!)

      • Minimal Sharing = Minimal Interference

      • Hence : Scalability (keep adding nodes!)

      • DeWitt says : Even writing code is simpler


    Shared nothing2

    Shared Nothing

    • Why ?

      • Low Bandwidth Parallel Data Processing

      • Commodity hardware (Cheap !!)

      • Minimal Sharing = Minimal Interference

      • Hence : Scalability (keep adding nodes!)

      • DeWitt says : Even writing code is simpler


    Outline3

    Outline

    Motivation

    Physical System Designs

    Data Clustering

    Failure Management

    Query Processing

    Evaluation and Results

    Bonus : Current Progress, Hadoop, Clustera..

    EECS 584, Fall 2011

    16

    7/31/2014


    No shared data declustering

    No Shared Data Declustering

    salary

    name

    age

    Logical View

    Physical View


    No shared data declustering1

    No Shared Data Declustering

    • Spreading data between disks :

      • Attribute-less partitioning

        • Random

        • Round-Robin

      • Single Attribute Schemes

        • Hash De-clustering

        • Range De-clustering

      • Multiple Attributes schemes possible

        • MAGIC, BERD etc


    No shared data declustering2

    No Shared Data Declustering

    • Spreading data between disks :

      • Attribute-less partitioning

        • Random

        • Round-Robin

      • Single Attribute Schemes

        • Hash De-clustering

        • Range De-clustering

      • Multiple Attributes schemes possible

        • MAGIC, BERD etc


    Hash declustering

    Hash Declustering

    salary

    name

    age

    salary is the partitioning attribute.

    salary % 3

    salary

    salary

    salary

    name

    age

    name

    age

    name

    age


    Hash declustering1

    Hash Declustering

    • Selections with equality predicates referencing the partitioning attribute are directed to a single node:

      • Retrieve Emp where salary = 60K

    • Equality predicates referencing a non-partitioning attribute and range predicates are directed to all nodes:

      • Retrieve Emp where age = 20

      • Retrieve Emp where salary < 20K

    SELECT *

    FROM Emp

    WHERE salary=60K

    SELECT *

    FROM Emp

    WHERE salary<20K


    Range declustering

    Range Declustering

    salary

    name

    age

    salary is the partitioning attribute.

    101K-∞

    0-50K

    51K-100K

    salary

    salary

    salary

    name

    age

    name

    age

    name

    age


    Range declustering1

    Range Declustering

    • Equality and range predicates referencing the partitioning attribute are directed to a subset of nodes:

      • Retrieve Emp where salary = 60K

      • Retrieve Emp where salary < 20K

    • Predicates referencing a non-partitioning attribute are directed to all nodes.

    In the example, both queries are directed to one node.


    Declustering tradeoffs

    Declustering Tradeoffs

    • Range selection predicate using a clustered B+-tree

    • 0.01% selectivity

      (10 records)

    Throughput

    (Queries/Second)

    Range

    Hash/Random/Round-robin

    Multiprogramming Level


    Declustering tradeoffs1

    Declustering Tradeoffs

    • Range selection predicate using a clustered B+-tree

    • 1% selectivity

      (1000 records)

    Throughput

    (Queries/Second)

    Hash/Random/Round-robin

    Range

    Multiprogramming Level


    The gamma database machine dewitt ghandeharizadeh schneider bricker hsiao rasmussen

    Why the difference ?

    Declustering Tradeoffs

    EECS 584, Fall 2011

    26

    7/31/2014


    The gamma database machine dewitt ghandeharizadeh schneider bricker hsiao rasmussen

    Why the difference ?

    When selection was small, Range spread out load and was ideal

    When selection increased, Range caused high workload on one/few nodes while Hash spread out load

    Declustering Tradeoffs

    EECS 584, Fall 2011

    27

    7/31/2014


    Outline4

    Outline

    Motivation

    Physical System Designs

    Data Clustering

    Failure Management

    Query Processing

    Evaluation and Results

    Bonus : Current Progress, Hadoop, Clustera..

    EECS 584, Fall 2011

    28

    7/31/2014


    Failure management

    Failure Management

    Key Questions

    Robustness (How much damage recoverable?)

    Availability (How likely? Hot Recoverable?)

    MTTR (Mean Time To Recovery)

    Consider two declustering schemes

    Interleaved Declustering (TeraData)

    Chained Declustering (Gamma DBM)

    EECS 584, Fall 2011

    29

    7/31/2014


    Interleaved declustering

    Interleaved Declustering

    A partitioned table has a primary and a backup copy.

    The primary copy is constructed using one of the partitioning techniques.

    The secondary copy is constructed by:

    Dividing the nodes into clusters (cluster size 4 here),

    Partition a primary fragment (R0) across the remaining nodes of the cluster: 1, 2, and 3. Realizing r0.0, r0.1, and r0.2.

    EECS 584, Fall 2011

    30

    7/31/2014


    Interleaved declustering1

    Interleaved Declustering

    On failure, query load re-directed to backup nodes in cluster

    MTTR :

    replace node

    reconstruct failed primary from backups

    reconstruct backups stored in failed node

    Second failure before this can cause unavailability

    Large cluster size improves failure load balancing but increases risk of data being unavailable


    Chained declustering gamma

    Chained Declustering (Gamma)

    Nodes are divided into disjoint groups called relation clusters.

    A relation is assigned to one relation cluster and its records are declustered across the nodes of that relation cluster using a partitioning strategy (Range, Hash).

    Given a primary fragment Ri, its backup copy is assigned to node (i+1) mod M (M is the number of nodes in the relation cluster).


    Chained declustering gamma1

    Chained Declustering (Gamma)

    During normal operation:

    Read requests are directed to the fragments of primary copy,

    Write requests update both primary and backup copies.


    Chained declustering gamma2

    Chained Declustering (Gamma)

    In presence of failure:

    Both primary and backup fragments are used for read operations,

    Objective: Balance the load and avoid bottlenecks!

    Write requests update both primary and backup copies.

    Note:

    Load of R1 (on node 1) is pushed to node 2 in its entirety.

    A fraction of read request from each node is pushed to the others for a 1/8 load increase attributed to node 1’s failure.


    Chained declustering gamma3

    Chained Declustering (Gamma)

    MTTR involves:

    Replace node 1 with a new node,

    Reconstruct R1 (from r1 on node 2) on node 1,

    Reconstruct backup copy of R0 (i.e., r0) on node 1.

    Note:

    Once Node 1 becomes operational, primary copies are used to process read requests.


    Chained declustering gamma4

    Chained Declustering (Gamma)

    Any two node failures in a relation cluster does not result in data un-availability.

    • Two adjacent nodes must fail in order for data to become unavailable


    Outline5

    Outline

    Motivation

    Physical System Designs

    Data Clustering

    Failure Management

    Query Processing

    Evaluation and Results

    Bonus : Current Progress, Hadoop, Clustera..

    EECS 584, Fall 2011

    37

    7/31/2014


    Query processing

    Query Processing

    • General Idea : Divide And Conquer


    Parallelizing hash join

    Parallelizing Hash Join

    • Consider Hash Join

      • Join of Tables A and B using attribute j (A.j = B.j) consists of two phase:

        • Build phase: Build a main-memory hash table on Table A using the join attribute j, e.g., build a hash table on the Toy department using dno as the key of the hash table.

        • Probe phase: Scan table B one record at a time and use its attribute j to probe the hash table constructed on Table A, e.g., probe the hash table using the rows of the Emp department.


    Parallelism and hash join

    Parallelism and Hash-Join

    • R join S where R is the inner table.


    Example join of emp and dept

    Example Join of Emp and Dept

    • Emp join Dept (using dno)

    Dept

    EMP


    Hash join build

    Hash-Join: Build

    • Read rows of Dept table one at a time and place in a main-memory hash table

    dno % 7


    Hash join build1

    Read rows of Emp table and probe the hash table.

    Hash-Join: Build

    dno % 7


    Hash join build2

    Read rows of Emp table and probe the hash table and produce results when a match is found. Repeat until all rows processed

    Hash-Join: Build

    dno % 7


    Hash join slide for exam etc review only

    Prob: Table used to build hash table larger than memory

    A divide-and-conquer approach:

    Use the inner table (Dept) to construct n memory buckets where each bucket is a hash table.

    Every time memory is exhausted, spill a fixed number of buckets to the disk.

    The build phase terminates with a set of in-memory buckets and a set of disk-resident buckets.

    Read the outer relation (Emp) and probe the in-memory buckets for joining records. For those records that map onto the disk-resident buckets, stream and store them to disk.

    Discard the in memory buckets to free memory space.

    While disk-resident buckets of inner-relation exist:

    Read as many (say i) of the disk-resident buckets of the inner-relation into memory as possible.

    Read the corresponding buckets of the outer relation (Emp) to probe the in-memory buckets for joining records.

    Discard the in memory buckets to free memory space.

    Delete the i buckets of the inner and outer relations.

    Hash-Join (slide for exam etc review only)


    Hash join build3

    Hash-Join: Build

    • Two buckets of Dept table. One in memory and the second is disk-resident.

    dno % 7


    Hash join probe

    Read Emp table and probe the hash table for joining records when dno=1. With dno=2, stream the data to disk.

    Hash-Join: Probe

    dno % 7


    Hash join probe1

    Those rows of Emp table with dno=1 probed the hash table and produce 3 joining records.

    Hash-Join: Probe

    dno % 7


    Hash join while loop

    Read the disk-resident bucket of Dept into memory.

    Probe it with the disk-resident buckets of Emp table to produce the remaining two joining records.

    Hash-Join: While loop

    dno % 7


    Parallelism and hash join1

    Parallelism and Hash-Join

    • Each node may perform hash-join independently when:

      • The join attribute is the declustering attribute of the tables participating in the join operation.

      • The participating tables are declustered across the same number of nodes using the same declustering strategy.

      • The system may re-partition the table (see the next bullet) if its aggregate memory exceeds the size of memory the tables are declustered across.

    • Otherwise, the data must be re-partitioned to perform the join operation correctly.


    Parallelism and hash join2

    Parallelism and Hash-Join

    • R join S where R is the inner table.


    Outline6

    Outline

    Motivation

    Physical System Designs

    Data Clustering

    Failure Management

    Query Processing

    Evaluation and Results

    Bonus : Current Progress, Hadoop, Clustera..

    EECS 584, Fall 2011

    52

    7/31/2014


    System evaluation

    System Evaluation

    • Two Key Metrics

      • Speed-up

        • Given a system with 1 node, does adding n nodes speed it up with a factor of n ?

      • Scale-up

        • Given a system with 1 node, does the response time remain the same with n nodes ?


    Ideal parallelism

    Ideal Parallelism


    Speedup selection

    Speedup (Selection)


    Scaleup selection

    Scaleup (Selection)


    Speedup join

    Speedup (Join)


    Scaleup join

    Scaleup (Join)


    Outline7

    Outline

    Motivation

    Physical System Designs

    Data Clustering

    Failure Management

    Query Processing

    Overview of Results

    Bonus: Current Progress, Hadoop, Clustera..

    EECS 584, Fall 2011

    59

    7/31/2014


    Modern systems are gamma too

    Modern Systems are Gamma too?

    • Ghandeharizadeh’s 2009 summary :

      • All systems below, like Gamma, are “share nothing” and run on 1000s of nodes

    Pig

    Yahoo’s Pig Latin

    Hadoop

    Google’s Map/Reduce Framework

    BigTable

    Google’s Bigtable Data Model

    Google File System

    GFS


    Old is the new new

    Old is the New New?

    • DeWitt’s Clustera :

      • DBMS centric cluster management system

      • Evolved from Gamma

      • Like MapReduce but

        • Full DBMS support

        • SQL optimization

      • Outperforms Hadoop!

    Data complexity

    Map/Reduce

    Parallel SQL

    Condor

    Job complexity


    Old is the new new1

    Old is the New New?


    Outline8

    Outline

    Motivation

    Physical System Designs

    Data Clustering

    Failure Management

    Query Processing

    Overview of Results

    Bonus: Current Progress, Hadoop, Clustera..

    Questions ?

    EECS 584, Fall 2011

    63

    7/31/2014


  • Login