Smart storage and linux an emc perspective
This presentation is the property of its rightful owner.
Sponsored Links
1 / 74

Smart Storage and Linux An EMC Perspective PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Smart Storage and Linux An EMC Perspective. Ric Wheeler [email protected] Why Smart Storage?. Central control of critical data One central resource to fail-over in disaster planning Banks, trading floor, air lines want zero downtime Smart storage is shared by all hosts & OS’es

Download Presentation

Smart Storage and Linux An EMC Perspective

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Smart storage and linux an emc perspective

Smart Storage and LinuxAn EMC Perspective

Ric Wheeler

[email protected]


Why smart storage

Why Smart Storage?

  • Central control of critical data

    • One central resource to fail-over in disaster planning

    • Banks, trading floor, air lines want zero downtime

  • Smart storage is shared by all hosts & OS’es

    • Amortize the costs of high availability and disaster planning over all of your hosts

    • Use different OS’es for different jobs (UNIX for the web, IBM mainframes for data processing)

    • Zero-time “transfer” from host to host when both are connected

    • Enables cluster file systems


Data center storage systems

Data Center Storage Systems

  • Change the way you think of storage

    • Shared Connectivity Model

    • “Magic” Disks

    • Scales to new capacity

    • Storage that runs for years at a time

  • Symmetrix case study

    • Symmetrix 8000 Architecture

    • Symmetrix Applications

  • Data center class operating systems


Traditional model of connectivity

Traditional Model of Connectivity

  • Direct Connect

    • Disk attached directly to host

    • Private - OS controls access and provides security

    • Storage I/O traffic only

      • Separate system used to support network I/O (networking, web browsing, NFS, etc)


Shared models of connectivity

Shared Models of Connectivity

  • VMS Cluster

    • Shared disk & partitions

    • Same OS on each node

    • Scales to dozens of nodes

  • IBM Mainframes

    • Shared disk & partitions

    • Same OS on each node

    • Handful of nodes

  • Network Disks

    • Shared disk/private partition

    • Same OS

    • Raw/block access via network

    • Handful of nodes


New models of connectivity

New Models of Connectivity

FreeBSD

VMS

Linux

  • Every host in a data center could be connected to the same storage system

    • Heterogeneous OS & data format (CKD & FBA)

    • Management challenge: No central authority to provide access control

Solaris

Shared

Storage

IRIX

DGUX

NT

HPUX

MVS


Magic disks

Magic Disks

  • Instant copy

    • Devices, files or data bases

  • Remote data mirroring

    • Metropolitan area

    • 100’s of kilometers

  • 1000’s of virtual disks

    • Dynamic load balancing

  • Behind the scenes backup

    • No host involved


Scalable storage systems

Scalable Storage Systems

  • Current systems support

    • 10’s of terabytes

    • Dozens of SCSI, fibre channel, ESCON channels per host

    • Highly available (years of run time)

    • Online code upgrades

    • Potentially 100’s of hosts connected to the same device

  • Support for chaining storage boxes together locally or remotely


Longevity

Longevity

  • Data should be forever

    • Storage needs to overcome network failures, power failures, blizzards, asteroid strikes …

    • Some boxes have run for over 5 years without a reboot or halt of operations

  • Storage features

    • No single point of failure inside the box

    • At least 2 connections to a host

    • Online code upgrades and patches

    • Call home on error, ability to fix field problems without disruptions

    • Remote data mirroring for real disasters


Symmetrix architecture

Symmetrix Architecture

  • 32 PowerPC 750’s based “directors”

  • Up to 32 GB of central “cache” for user data

  • Support for SCSI, Fibre channel, Escon, …

  • 384 drives (over 28 TB with 73 GB units)


Symmetrix basic architecture

Symmetrix Basic Architecture


Data flow through a symm

Data Flow through a Symm


Read performance

Read Performance


Prefetch is key

Prefetch is Key

  • Read hit gets RAM speed, read miss is spindle speed

  • What helps cached storage array performance?

    • Contiguous allocation of files (extent-based file systems) preserve logical to physical mapping

    • Hints from the host could help prediction

  • What might hurt performance?

    • Clustering small, unrelated writes into contiguous blocks (foils prefetch on later read of data)

    • Truly random read IO’s


Symmetrix applications

Symmetrix Applications

  • Instant copy

    • TimeFinder

  • Remote data copy

    • SRDF (Symmetrix Remote Data Facility)

  • Serverless Backup and Restore

    • Fastrax

  • Mainframe & UNIX data sharing

    • IFS


Business continuance problem normal daily operations cycle

“Race to Sunrise”

2 am

6 am

Business Continuance Problem“Normal” Daily Operations Cycle

Online Day

BACKUP /

DSS

4 Hours of Data Inaccessibility*

Resume Online Day


Timefinder

TimeFinder

  • Creation and control of a copy of any active application volume

  • Capability to allow the new copy to be used by another application or system

    • Continuous availability of production data during backups, decision support, batch queries, DW loading, Year 2000 testing, application testing, etc.

  • Ability to create multiple copies of a single application volume

  • Non-disruptive re-synchronization when second application is complete

BUSINESS

CONTINUANCE

VOLUME

PRODUCTION

APPLICATION

VOLUME

Sales

Backups

Decision Support

Data Warehousing

Euro Conversion

PRODUCTION

APPLICATION

VOLUME

BUSINESS

CONTINUANCE

VOLUME

PRODUCTION

APPLICATION

VOLUME

BUSINESS

CONTINUANCE

VOLUME

BCV is a copy of real production data


Business continuance volumes

Business Continuance Volumes

  • A Business Continuation Volume (BCV) is created and controlled at the logical volume level

  • Physical drive sizes can be different, logical size must be identical

  • Several ACTIVE copies of data at once per Symmetrix


Using timefinder

M1

M2

BCV

Using TimeFinder

  • Establish BCV

  • Stop transactions to clear buffers

  • Split BCV

  • Start transactions

  • Execute against BCVs

  • Re-establish BCV


Re establishing a bcv pair

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

Re-Establishing a BCV Pair

Split BCV Pair

  • BCV pair “PROD” and “BCV” have been split

  • Tracks on “PROD” updated after split

  • Tracks on ‘BCV’ updated after split

  • Symmetrix keeps table of these “invalid” tracks after split

  • At re-establish BCV pair, “invalid” tracks are written from “PROD” to “BCV”

  • Synch complete

BCV M1

PROD M1

Re-Establish BCV Pair

BCV

PROD


Restore a bcv pair

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

UPDATED

Restore a BCV Pair

Split BCV Pair

  • BCV pair “PROD” and “BCV” have been split

  • Tracks on “PROD” updated after split

  • Tracks on “BCV” updated after split

  • Symmetrix keeps table of these “invalid” tracks after split

  • At restore BCV pair, “invalid” tracks are written from “BCV to PROD”

  • Synch complete

BCV

PROD

Restore BCV Pair

BCV

PROD


Make as many copies as needed

M1

M2

BCV 1

BCV 2

BCV 3

Make as Many Copies as Needed

  • Establish BCV 1

  • Split BCV 1

  • Establish BCV 2

  • Split BCV 2

  • Establish BCV 3

4 PM

6 PM

5 PM


The purpose of srdf

The Purpose of SRDF

  • Local data copies are not enough

  • Maximalist

    • Provide a remote copy of the data that will be as usable after a disaster as the primary copy would have been.

  • Minimalist

    • Provide a means for generating periodic physical backups of the data.


Synchronous data mirroring

Synchronous Data Mirroring

  • Write is received from the host into the cache of the source

  • I/O is transmitted to the cache of the target

  • ACK is provided by the target back to the cache of the source

  • Ending status is presented to the host

  • Symmetrix systems destage writes to disk

  • Useful for disaster recovery


Semi synchronous mirroring

Semi-Synchronous Mirroring

  • An I/O write is received from the host/server into the cache of the source

  • Ending status is presented to the host/server.

  • I/O is transmitted to the cache of the target

  • ACK is sent by the target back to the cache of the source

  • Each Symmetrix system destages writes to disk

  • Useful for adaptive copy


Backup restore of big data

Backup / Restore of Big Data

  • Exploding amounts of data cause backups to run on too long

    • How long does it take you to backup 1 TB of data?

    • Shrinking backup window and constant pressure for continuous application up-time

  • Avoid using production environment for backup

    • No server CPU or I/O channels

    • No involvement of regular network

  • Performance must scale to match customer’s growth

  • Heterogeneous host support


Fastrax overview

Fastrax Overview

Fibre Channel PtP Link(s)

Tape Library

Fastrax

Data Engine

SCSI

BCV2

UNIX

R1

R2

SCSI

STD2

STD1

BCV1

Linux

Tape Library

UNIX

Location 2

Location 1

Fastrax EnabledBackup/RestoreApplications

SYMAPI


Host to tape data flow

Fastrax

Symmetrix

Tape Library

Host

Host to Tape Data Flow


Fastrax performance

RAF

RAF

DM

DM

SRDF

DM

RAF

DM

RAF

Fastrax

Fastrax Performance

  • Performance scales with the number of data movers in the Fastrax box & number of tape devices

  • Restore runs as fast as backup

  • No performance impact on host during restore or backup


Smart storage and linux an emc perspective

Moving Data from Mainframes to UNIX


Infomover file system

InfoMover File System

  • Transparent availability of MVS data to Unix hosts

    • MVS datasets available as native Unix files

  • Sharing a single copy of MVS datasets

  • Uses MVS security and locking

    • Standard MVS access methods for locking + security


Smart storage and linux an emc perspective

Minimal Network

Overhead

-- No data transfer over network! --

MVS

Data

IFS Implementation

  • Mainframe

  • IBM MVS / OS390

  • Open Systems

  • IBM AIX

  • HP HP-UX

  • Sun Solaris

ESCON

Channel

Parallel

Channel

FWD SCSI

Ultra SCSI

Fibre

Channel

Symmetrix with ESP


Symmetrix api s

Symmetrix API’s


Symmetrix api overview

Symmetrix API Overview

  • SYMAPI Core Library

    • Used by “Thin” and Full Clients

  • SYMAPI Mapping Library

  • SYMCLI Command Line Interface


Symmetrix api s1

Symmetrix API’s

  • SYMAPI are the high level functions

    • Used by EMC’s ISV partners (Oracle, Veritas, etc) and by EMC applications

  • SYMCLI is the “Command Line Interface” which invoke SYMAPI

    • Used by end customers and some ISV applications.


Basic architecture

Basic Architecture

User access to the Solutions Enabler

is via the SymCli or

Storage Management Application

Other

Storage

Management

Applications

Symmetrix

Command

Line

Interpreter

(SymCli)

Symmetrix Application

Programming Interface

(SymAPI)


Client server architecture

Client Host

Storage

Management

Applications

Server Host

SymAPI

Client

SymAPIlibrary

SymAPI

Server

Thin Client Host

Storage

Management

Applications

Thin

SymAPI

Client

Client-Server Architecture

  • Symapi Server runs on the host computer connected to the Symmetrix storage controller

  • Symapi client runs on one or more host computers


Symmapi components

SymmAPI Components

Initialization

InfoSharing

Gatekeepers

Calypso Controls

Discover and Update

Optimizer Controls

Configuration

DeltaMark Functions

Device Groups

SRDF Functions

Statistics

TimeFinder Functions

Mapping Functions

Base Controls


Data object resolve

Data Object Resolve

RDBMS Data File

File System

Logical Volume

Host Physical Device

Symmetrix Device

Extents


File system mapping

File System Mapping

  • File System mapping information includes:

    • File System attributes and host physical location.

    • Directory attributes and contents.

    • File attributes and host physical extent information, including inode information, fragment size.

I-nodes

Directories

File extents


Data center hosts

Data Center Hosts


Solaris sun starfire

Solaris & Sun Starfire

  • Hardware

    • Up to 62 IO Channels

    • 64 CPU’s

    • 64 GB of RAM

    • 60 TB of disk

    • Supports multiple domains

  • Starfire & Symmetrix

    • ~20% use more than 32 IO channels

    • Most use 4 to 8 IO channels per domain

    • Oracle instance usually above 1 TB


Hpux hp 9000 superdome

HPUX & HP 9000 Superdome

  • Hardware

    • 192 IO Channels

    • 64 CPU’s cards

    • 128 GB RAM

    • 1 PB of storage

  • Superdome and Symm

    • 16 LUNS per target

    • Want us to support more than 4000 logical volumes!


Solaris and fujitsu gp7000f m1000

Solaris and Fujitsu GP7000F M1000

  • Hardware

    • 6-48 I/O slots

    • 4-32 CPU’s

    • Cross-Bar Switch

    • 32 GB RAM

    • 64-bit PCI bus

    • Up to 70TB of storage


Solaris and fujitsu gp7000f m2000

Solaris and Fujitsu GP7000F M2000

  • Hardware

    • 12-192 I/O slots

    • 8-128 CPU’s

    • Cross-Bar Switch

    • 256 GB RAM

    • 64-bit PCI bus

    • Up to 70TB of storage


Aix 5l ibm rs 6000 sp

AIX 5L & IBM RS/6000 SP

  • Hardware

    • Scale to 512 Nodes (over 8000 CPUs)

    • 32 TB RAM

    • 473 TB Internal Storage Capacity

    • High Speed Interconnect 1GB/sec per channel with SP Switch2

    • Partitioned Workloads

    • Thousands of IO Channels


Ibm rs 6000 pseries 680 aix 5l

IBM RS/6000 pSeries 680 AIX 5L

  • Hardware

    • 24 CPUs

      • 64-bit RS64 IV

    • 600MHz

    • 96 MB RAM

    • 873.3 GB Internal Storage Capacity

    • 53 PCI slots

      • 33 – 32bit/20-64bit


Really big data

Really Big Data

  • IBM (Sequent) NUMA

    • 16 NUMA “Quads”

      • 4 way/ 450 MHz CPUs

      • 2 GB Memory

      • 4 x 100MB/s FC-SW

  • Oracle 8.1.5 with up to 42 TB (mirrored) DB

  • EMC Symmetrix

    • 20 Small Symm 4’s

    • 2 Medium Symm 4’s


Windows 2000 on ia32

Windows 2000 on IA32

  • Usually lots of small (1u or 2u) boxes share a Symmetrix

    • 4 to 8 IO channels per box

    • Qualified up to 1 TB per meta volume (although usually deployed with ½ TB or less)

  • Management is a challenge

  • Will 2000 on IA64 handle big data better?


Linux data center wish list

Linux Data Center Wish List


Lots of devices

Lots of Devices

  • Customers can uses hundreds of targets and LUN’s (logical volumes)

    • 128 SCSI devices per system is too few

  • Better naming system to track lots of disks

  • Persistence for “not ready” devices in the name space would help some of our features

  • devfs solves some of this

    • Rational naming scheme

    • Potential for tons of disk devices (need SCSI driver work as well)


Support for dynamic data

Support for Dynamic Data

  • What happens when the LV changes under a running file system? Adding new logical volumes?

    • Happens with TimeFinder, RDF, Fastrax

    • Requires a remounting, reloading drivers, rebooting?

    • API’s can be used to give “heads up” before events

  • Must be able to invalidate

    • Data, name and attribute caches for individual files or logical volumes

  • Support for dynamically loaded, layered drivers

  • Dynamic allocation of devices

    • Especially important for LUN’s

    • Add & remove devices as fibre channel fabric changes


Keep it open

Keep it Open

  • Open source is good for us

    • We can fix it or support it if you don’t want to

    • No need to reverse engineer some closed source FS/LVM

  • Leverage storage API’s

    • Add hooks to Linux file systems, LVM’s, sys admin tools

  • Make Linux manageable

    • Good management tools are crucial in large data centers


New technology opportunities

New Technology Opportunities

  • Linux can explore new technologies faster than most

  • iSCSI

    • SCSI over TCP for remote data copy?

    • SCSI over TCP for host storage connection?

    • High speed/zero-copy TCP is important to storage here!

  • Infiniband

    • Initially targeted at PCI replacement

    • High speed, high performance cluster infrastructure for file systems, LVM’s, etc

      • Multi gigabits/sec (2.5 GB/sec up to 30 GB/sec)

    • Support for IB as a storage connection?

  • Cluster file systems


Linux at emc

EMC Symmetrix Enterprise Storage

EMC Connectrix

Enterprise Fiber

Channel Switch

Centralized

Monitoring and

Management

Linux at EMC

  • Full support for Linux in SymAPI, RDF, TimeFinder, etc

  • Working with partners in the application space and the OS space to support Linux

    • Oracle Open World Demo of Oracle on Linux with over 20 Symms (could reach 1PB of storage!)


Mosix and linux cluster file systems

MOSIX and Linux Cluster File Systems


Our problem code builds

Our Problem: Code Builds

  • Over 70 OS developers

    • Each developer builds 15 variations of the OS

    • Each variation compiles over a million lines of code

    • Full build uses gigabytes of space, with 100k temporary files

  • User sandboxes stored in home directory over NFS

  • Full build took around 2 hours

  • 2 users could build at once


Our original environment

Our Original Environment

  • Software

    • GNU tool chain

    • CVS for source control

    • Platform Computing’s Load Sharing Facility

    • Solaris on build nodes

  • Hardware

    • EMC NFS server (Celerra) with EMC Symmetrix back end

    • 26 SUN Ultra-2 (dual 300 MHz CPU) boxes

    • FDDI ring used for interconnect


Emc s lsf cluster

EMC's LSF Cluster


Lsf architecture

LSF Architecture

  • Distributed process scheduling and remote execution

    • No kernel modifications

    • Prefers to use static placement for load balancing

    • Applications need to link special library

    • License server controls cluster access

  • Master node in cluster

    • Manages load information

    • Makes scheduling decisions for all nodes

  • Uses modified GNU Make (lsmake)


Mosix architecture

MOSIX Architecture

  • Provide transparent, dynamic migration

    • Processes can migrate at any time

    • No user intervention required

    • Process thinks it is still running on its creation node

  • Dynamic load balancing

    • Use decentralized algorithm to continually level load in the cluster

    • Based on number of CPU's, speed of CPU's, RAM, etc

  • Worked great for distributed builds in 1989


Mosix mechanism

MOSIX Mechanism

  • Each process has a unique home node

    • UHN is the node of the processes creation

    • Process appears to be running at its UHN

    • Invisible after migration to others on its new node

  • UHN runs a deputy

    • Encapsulates system state for migrated process

    • Acts as a proxy for some location-sensitive system calls after migration

    • Significant performance hit for IO over NFS, for example


Mosix migration

Node A

Node B

User level

User level

local process

remote

Link layer

Link layer

deputy

Kernel

Kernel

NFS

MOSIX Migration


Mosix enhancements

MOSIX Enhancements

  • MOSIX added static placement and remote execution

    • Leverage the load balancing infrastructure for placement decisions

    • Avoid creation of deputies

  • Lock remotely spawned processes down just in case

  • Fix several NFS caching related bugs

  • Modify some of our makefile rules


Mosix remote execution

Node A

Node B

User level

User level

local process

remote

Link layer

Link layer

deputy

Kernel

Kernel

NFS

NFS

MOSIX Remote Execution


Emc mosix cluster

EMC MOSIX cluster

  • EMC’s original MOSIX cluster

    • Compute nodes changed from LSF to MOSIX

    • Network changed from FDDI to 100 megabit ethernet.

  • The MOSIX cluster immediately moved the bottleneck from the cluster to the network and I/O systems.

  • Performance was great, but we can do better!


Latest hardware changes

Latest Hardware Changes

  • Network upgrades

    • New switch deployed

    • Nodes to switch use 100 megabit ethernet

    • Switch to NFS server uses gigabit ethernet

  • NFS upgrades

    • 50 gigabyte, striped file systems per user (compared to 9 gigabyte non-striped file systems)

    • Fast/wide differential SCSI between server and storage

  • Cluster upgrades

    • Added 28 more compute nodes

    • Added 4 “submittal” nodes


Emc mosix cluster1

SCSI

Cisco 6506

100 Mbit

Gigabit Ether

52 VA Linux 3500s

2 EMC Celerra NFS

4

Symmetrix

Servers

Systems

EMC MOSIX Cluster

Gigabit Ether

SCSI

SCSI

SCSI


Performance

Performance

  • Running Red Hat 6.0 with 2.2.10 kernel (MOSIX and NFS patches applied)

  • Builds are now around 15-20 minutes (down from 1-1.5 hours)

  • Over 35 concurrent builds at once


Build submissions

Build Submissions


Cluster file system mosix

SCSI

Gigabit Ether

100

Mbit

SCSI

Cisco 6506

SCSI

1 EMC Celerra NFS

Server

SCSI

Fibre Channel

Fibre Channel

Fibre Channel

52 VA Linux 3500s

4

Symmetrix

Systems

Cluster File System & MOSIX

Fibre Channel

Fibre Channel

Connectrix


Dfsa overview

DFSA Overview

  • DFSA provides the structure to allow migrated processes to always do local IO

  • MFS (MOSIX File System) created

    • No caching per node, write through

    • Serverless - all nodes can export/import files

    • Prototype for DFSA testing

    • Works like non-caching NFS


Dfsa requirements

DFSA Requirements

  • One active inode/buffer in the cluster for each file

  • Time-stamps are cluster-wide, increasing

  • Some new FS operations

    • Identify: encapsulate dentry info

    • Compare: are two files the same?

    • Create: produce a new file from SB/ID info

  • Some new inode operations

    • Checkpath: verify path to file is unique

    • Dotdot: give true parent directory


Information

Information

  • MOSIX:

    • http://www.mosix.org/

  • GFS:

    • http://www.globalfilesystem.org/

  • Migration Information

    • Process Migration, Milojicic, et al. To appear in ACM Computing Surveys, 2000.

    • Mobility: Processes, Computers and Agents, Milojicic, Douglis and Wheeler, ACM Press.


  • Login