Developing scalable high performance petabyte distributed databases
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Developing Scalable High Performance Petabyte Distributed Databases PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on
  • Presentation posted in: General

Developing Scalable High Performance Petabyte Distributed Databases. CHEP ‘98 Andrew Hanushevsky SLAC Computing Services Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy. BaBar & The B-Factory. High precision investigation of B-meson decays

Download Presentation

Developing Scalable High Performance Petabyte Distributed Databases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Developing scalable high performance petabyte distributed databases

Developing Scalable High PerformancePetabyte Distributed Databases

CHEP ‘98

Andrew Hanushevsky

SLAC Computing Services

Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy


Babar the b factory

BaBar & The B-Factory

  • High precision investigation of B-meson decays

    • Cosmic ray tracking starts October 1998

    • Experiment starts April 1999

  • 500 physicists collaborating from >70 sites in 10 countries

    • USA, Canada, China, France, Germany, Italy, Norway, Russia, UK, Taiwan

  • The experiment produces large quantities of data

    • 200 - 400 TB/year for 10 years

      • Data stored as objects using Objectivity

  • Heavy computational load

    • 5,000 SpecInt95’s

      • 526 Sun Ultra 10’s or 312 Alpha PW600’s

        • Work will be distributed across the collaboration


Handling the data computation

Handling The Data & Computation

RS/6000-F50’s

AIX 4.2

Sun ES10000

Veritas FS/VM

Sun Ultra 2’s

Solaris 2.6

Sun ES4500’s

Veritas FS/VM

Solaris 2.5

HPSS

Compute

Farm

Network

Switch

AMS

Farm

External Collaborators


H igh p erformance s torage s ystem

disk

High Performance Storage System

T a p e

hpss

app

Control

Network

#Bitfile Server

#Name Server

#Storage Servers

#Physical Volume Library

# Physical Volume Repositories

#Storage System Manager

#Migration/Purge Server

#Metadata Manager

#Log Daemon

#Log Client

#Startup Daemon

#Encina/SFS

#DCE

mover

mover

mover

Data Network


A dvanced m ultithreaded s erver

client

ams

disk

Advanced Multithreaded Server

  • Client/Server Application

    • Serves “pages” (512 to 64K byte blocks)

      • Similar to other remote filesystem interfaces (e.g., NFS)

    • Objectivity client can read and write database “pages” via AMS

      • Pages range from 512 bytes to 64K in powers of 2 (e.g., 1K, 2K, 4K, etc.)

    • Enables Data Replication Option (DRO)

    • Enables Fault Tolerant Option (FTO)

ufs protocol

ams protocol


Veritas file system volume manager

Volume Manager

RAID

RAID

RAID

RAID

RAID

Veritas File System & Volume Manager

  • Volume Manager

    • Catenates disk devices to form very large capacity logical devices

      • Also s/w RAID-0,1,5 and dynamic I/O multi-pathing

  • File System

    • High performance journaled file system for fast recovery

      • Maximizes device speed/size performance (30+ MB/Sec for h/w RAID-5)

      • Supports 1TB+ files and file systems

File System


Together alone

Together Alone ….

  • Veritas Volume Manager + Veritas File System

    • Excellent I/O performance (10 - 30 MB/Sec) but

      • Insufficient capacity (1TB) and online cost too high

  • AMS

    • Efficient database protocol and highly flexible but

      • Limited security, low scalability, tied to local filesystem

  • HPSS

    • Highly scalable, excellent I/O performance for large files but

      • High latency for small block transfers (i.e., Objectivity/DB)

  • Need to synergistically mate these three systems but

    • Want to keep them independent so any can be changed


The extensible ams

ufs

hpss

The Extensible AMS

ams

oofs interface

glue

System specific interface

ooss

vfs

vfs

hpss

security


An object oriented interface

An Object Oriented Interface

class oofsDesc {

// General File System Methods

}

class oofsDir {

// Directory-Specific Methods

}

class oofsFile {

// File-Specific Methods

}


The oofs interface

The oofs Interface

  • Provides a standard interface for AMS to get at a filesystem

    • Any filesystem can be used that can implement the functions:

      • close getsizeremove

      • closedir openrename

      • exists opendirsync

      • getmode readtruncate

      • getsectoken readdirwrite

    • Includes all current POSIX-like filesystems

    • The oofs interface is linked with AMS to create an executable

    • Normally transparent to client applications

      • Timing may not be transparent


The hpss interface

The HPSS Interface

  • HPSS implements a “POSIX” filesystem

    • The HPSS API library provides sufficient oofs functionality

      • close()hpss_Close()

      • closedir()hpss_Closedir()

      • exists()hpss_Stat()

      • getmode()hpss_Stat()

      • getsectoken()not applicable

      • getsize()hpss_Fstat()

      • open()hpss_Open() [+ hpss_Create() ]

      • opendir()hpss_Opendir()

      • read()hpss_SetFileOffset() + hpss_Read()

      • readdir()hpss_Readdir()

      • remove()hpss_Unlink()

      • rename()hpss_Rename()

      • sync()not applicable

      • truncate()hpss_Ftruncate()

      • write()hpss_SetFileOffset() + hpss_Write()


Additional issues

app

ams

security

Additional Issues

  • Security

  • Performance

    • Access patterns (e.g., random vs sequential)

  • HPSS staging latency

  • Scalability

ooss

vfs

hpss

security


Object based security model

Object Based Security Model

  • Protocol Independent Client Authentication Model

    • Public or private key

      • PGP, RSA, Kerberos, etc.

        • Can be negotiated at run-time

    • Provides for server authentication

  • AMS Client must call a special routine to enable security

    • oofs_Register_Security()

      • Supplied routine responsible for creating the oofsSecurity object

  • Client Objectivity Kernel creates security objects as needed

    • Security objects supply context-sensitive authentication credentials

  • Works only with Extensible AMS via oofs interface


Supplying performance hints

Supplying Performance Hints

  • Need additional information for optimum performance

    • Different from Objectivity clustering hints

      • Database clustering

      • Processing mode (sequential/random)

      • Desired service levels

  • Information is Objectivity independent

    • Need a mechanism to tunnel opaque information

  • Client supplies hints via oofs_set_info() call

    • Information relayed to AMS in a transparent way

    • AMS relays information to underlying file system via oofs()


Dealing with latency

Dealing With Latency

  • Hierarchical filesystems may have high latency bursts

    • Mounting a tape file

  • Need mechanism to notify client of expected delay

    • Prevents request timeout

    • Prevents retransmission storms

    • Also allows server to degrade gracefully

      • Can delay clients when overloaded

  • Defer Request Protocol

    • Certain oofs() requests can tell client of expected delay

      • For example, open()

    • Client waits indicated amount of time and tries again


Balancing the load i

Balancing The Load I

  • Dynamically distributed databases

    • Single machine can’t manage over a terabyte of disk cache

    • No good way to statically partition the database

  • Dynamically varying database access paths

    • As load increases, add more copies

      • Copies accessed in parallel

    • As load decreases, remove copies to free up disk space

  • Objectivity catalog independence

    • Copies managed outside of Objectivity

      • Minimizes impact on administration


Balancing the load ii

Balancing The Load II

  • Request Redirect Protocol

    • oofs () routines supply alternate AMS location

  • oofs routines responsible for update synchronization

    • Typically, read/only access provided on copies

    • Only one read/write copy conveniently supported

      • Client must declare intention to update prior to access

      • Lazy synchronization possible

  • Good mechanism for largely read/only databases

  • Load balancing provided by an AMS collective

    • Has one distinguished member recorded in the catalogue


The ams collective

ams

ams

ams

ams

ams

ams

ams

ams

client

The AMS Collective

Collective members are

effectively interchangeable

AMS Collective 1

AMS Collective 2

redirect

Distinguished

Members


Overall effects

Overall Effects

  • Extensible AMS

    • Allows use of any type of filesystem via oofs layer

  • Generic Authentication Protocol

    • Allows proper client identification

  • Opaque Information Protocol

    • Allows passing of hints to improve filesystem performance

  • Defer Request Protocol

    • Accommodates hierarchical filesystems

  • Redirection Protocol

    • Accommodates terabyte+ filesystems

    • Provides for dynamic load balancing


Dynamic load balancing hierarchical secure ams

vfs

vfs

vfs

Dynamic Load Balancing Hierarchical Secure AMS

ams

Redwood

ams

Dynamic

Selection

hpss

client

Redwood

ams

Redwood


Summary

Summary

  • AMS is capable of high performance

    • Ultimate performance limited by disk speeds

      • Should be able to deliver average of 20 MB/Sec per disk

  • The oofs interface + other protocols greatly enhance performance, scalability, usability, and security

  • SLAC will be using this combination to store physics data

    • BaBar experiment will produce over a 2 PB database in 10 years

      • 2,000,000,000,000,000 = 2´1015 bytes @ 200,000 3590 Tapes


  • Login