Sunfire 15k enterprise grade server march 14 2003
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

SunFire 15K Enterprise-Grade Server March 14, 2003 PowerPoint PPT Presentation


  • 41 Views
  • Uploaded on
  • Presentation posted in: General

SunFire 15K Enterprise-Grade Server March 14, 2003. Overview. Introduction of SunFire 15K architecture and concepts Hardware RAS features Dynamic Domains System requirements Number of domains Resources for each domain Expansion Process RAS features Risk factors and risk mitigation

Download Presentation

SunFire 15K Enterprise-Grade Server March 14, 2003

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sunfire 15k enterprise grade server march 14 2003

SunFire 15K Enterprise-Grade ServerMarch 14, 2003

Implementation Review


Overview

Overview

  • Introduction of SunFire 15K architecture and concepts

    • Hardware RAS features

    • Dynamic Domains

  • System requirements

    • Number of domains

    • Resources for each domain

    • Expansion

  • Process RAS features

  • Risk factors and risk mitigation

  • Current Status

  • Schedule

Implementation Review


Sunfire 15k enterprise grade server march 14 2003

12 fans each

12 fans each

Sun Fire 15K:

“Highly redundant, symmetric multi-processing server with a shared memory architecture”

Features:

  • 1 to 18 CPU/Memory boards

  • 4CPUs/board, @1050MHZ

  • 1 to 8GB/CPU = 576GB memory

  • 1 to 18 I/O boards, 4 PCI slots/board

  • Can trade I/O boards v.s. extra CPU

  • Partition in 1-18 dynamic domains

Implementation Review


Sunfire 15k high ras features

SunFire 15K: High RAS Features

  • Reliability:

    • Fully redundant CPU/Memory and I/O boards, PCI cards

    • Dual System Controllers

    • Dual System Clock

    • Dual Grid power, redundant power supplies

    • Redundant fans

    • Environmental monitoring

  • Serviceability:

    • Hot Swap CPU/Memory boards

    • Hot Swap I/O boards, PCI components

    • Hot Swap System controller

    • Hot Swap power supplies

    • Hot Swap fans

    • Full Remote Diagnostics

Implementation Review


Dynamic domains

Dynamic Domains

  • CPU/Memory Boards and I/O Boards can be re-assigned on-the-fly to other domains when needed. E.g.:

    $ moveboard SB2 -d BESSIEB

  • By hand:

    • Reallocate resources from operational and development domains to I&T domain for full load performance testing

    • Take boards off-line for maintenance – hot swap

  • Automatically by programs or monitoring software:

    • At times of peak loads, reallocate resources from Development and I&T domains to operational domain

Implementation Review


Data processing software systems

Data Processing Software Systems

  • Pre-archive processing & Ingest:

    • Science data receipt and processing: Science Pipelines (OPUS)

    • Engineering data receipt and processing (EDPS)

    • Archive Ingest

  • Distribution:

    • Archive distribution (DADS)

    • On-the-fly reprocessing (OTFR)

  • Calibration:

    • Calibration pipeline and database

  • Database servers:

    • Pipeline Processing, Ingest/Distribution DB “CATLOG”

    • Archive Catalog Browsing DB “ZEPPO”

  • Will not support user interfaces: StarView, Web, APT

Implementation Review


Number of domains

Number of Domains

  • High level requirements:

    • Separate Development, Integration & Test and Operational environments

    • Protect Ingest from Distribution

    • Respond to user community

  • Other requirements:

    • Separate Pipeline computing from Database servers

    • Separate DB for external users (ZEPPO) from internal operational DB (CATLOG)

    • Isolate OS and COTS testing, patching

  • Maximum number of domains: 3*(2+2) +1 = 13

    • BUT: more domains = more fragmentation = less flexibility

    • Must balance flexibility with need for isolation

Implementation Review


Number of domains cont

Number of Domains (cont.)

  • For Databases, combine development and Integration & Test domains for CATLOG and ZEPPO (saves 3 domains)

  • Combine Pre-archive processing & Ingest with Distribution (saves 3 domains):

    • Use similar processing pipelines

    • Protect Ingest from Distribution by dynamically adding resources when needed

    • Protect Ingest from Distribution by binding Ingest processes to dedicated CPU’s and Memory resources

    • Protect Ingest from Distribution using new features in DADS 10.*

    • Closely monitor performance and fall back to 2 separate domains as contingency plan

  • Number of domains = 13 – 3 – 3 = 7

Implementation Review


The 7 domains

The 7 Domains

Implementation Review


Domain resources

Domain Resources

  • CPU/Memory Boards:

    • 4 CPUs / board

    • CPUs run at 1.05GHz

    • 1-8GB of memory / CPU = 4-32GB / board

  • I/O Boards:

    • Provide external connections to SAN, network, disks

    • 4 PCI slots / board

    • 2 slots @ 33MHz

    • 2 slots @ 66MHz

Implementation Review


1 development domain

1: Development domain

  • Supports Development Teams for:

    • OPUS (EDPS, OTFR, etc)

    • DADS 10.*

    • IRAF/STSDAS

    • Calibration pipelines

    • Calibration reference data

  • Today (combined with Testing), excluding desktops:

    • Tru64, Solaris: ~9-13CPU’s <1GB/CPU, 500MHz

  • Domain requirements:

    • 2 CPU Boards, 8 CPUs, 4GB/CPU

    • 1 I/O Board (not mission critical)

Implementation Review


2 dads opus otfr domain

2: DADS/OPUS/OTFR domain

  • Compare Pre-archive pipelines, Ingest and Distribution Performance Requirements with current performance

  • Use outcome to scale current resources to domain requirements, accounting for faster CPU’s, architecture.

  • Account for new Software architecture of DADS 10.*

  • Account for lack of modeling = safety margin

  • Account for projected growth:

    • Short term: Distribution (ACS)

    • Intermediate: New Algorithms

    • Longer: Pre-archive pipelines, Ingest (SM4)

    • Overall increased use of 20%/year

Implementation Review


2 dads opus otfr domain cont

2: DADS/OPUS/OTFR domain (cont.)

  • Today:

    • baseline Pre Archive processing and Ingest performance within requirements

      • Remember: failures addressed by Architecture

    • baseline Distribution & OTFR barely within requirements

    • Current systems maxed out.

Implementation Review


2 dads opus otfr domain cont1

2: DADS/OPUS/OTFR domain (cont.)

  • Today:

    • Tru64 cluster: 12 CPUs @ 500MHz, 1GB/CPU

    • 1 Sun 280R 2CPUs 750MHz (EDPS)

    • 3 OpenVMS systems: 1 CPU @ 250MHz, 0.5-1.5GB

  • Domain CPU/Memory requirement:

    • 9CPUs @ 1GHz, 4GB/CPU

  • New software architecture requirements (DADS 10.*)

    • 6CPUs @ 1GHz, 4GB/CPU

  • Short term growth, ACS + 20%

    • 3CPUs @ 1GHz, 4GB/CPU

  • Margin

    • 2CPUs @ 1GHz, 4GB/CPU

Implementation Review


2 dads opus otfr domain cont2

2: DADS/OPUS/OTFR domain (cont.)

  • Total Domain CPU/Memory Requirements:

    • 5 CPU Boards, 20CPUs @1GHz, 4GB/CPU

  • Total Domain I/O requirements:

    • Operational, so redundant: 2 I/O Boards

    • Can be multiplexed if necessary for performance

  • Remember: Dynamic domains:

    • We can re-assign resources on-the-fly, esp. from I&T domain to handle peak loads, longer term fluctuations

Implementation Review


3 integration test domain

3 Integration & Test domain

  • Realistic end-to-end load and performance testing

    • Identical to operational DADS/OPUS/OTFR domain

    • Today: non-existent

  • Domain Requirements:

    • 5 CPU Boards, 20 CPUs @ 1GHz, 4GB/CPU

  • Remember: Dynamic domains

    • Full-load performance tests happen regularly, but not daily

    • Full-load performance tests are highly controlled, discrete, and scheduled events

    • I&T resources can be re-assigned to e.g. DADS/OPUS/OTFR domain when not needed

Implementation Review


4 5 6 database domains

4,5,6: Database domains

More details in afternoon “Databases” presentation

  • Operational DB, CATLOG

    • Today: 4CPUs @ 300MHz, 0.5GB total

    • Anticipate increased load because of faster pipelines, new instruments

    • Domain Requirements:

      • 2 CPU Boards, 8 CPUs @ 1GHz, 2GB/CPU

      • 2 I/O Boards (redundancy)

  • Archive Catalog Browsing DB, ZEPPO

    • Today: 2CPUs @ 300MHz, 0.6GB total

    • Domain Requirements:

      • 1 CPU Board, 4CPUs @ 1GHz, 2Gb/CPU

      • 2 I/O Boards (redundancy)

  • Development, test

    • Today: 2*2CPUs @ 200MHz, 1GB total

    • Domain Requirements

      • 1 CPU Board, 4 CPUs @ 1GHz, 2Gb/CPU

      • 1 I/O Board

Implementation Review


7 os cots testing patches

7: OS & COTS testing, patches

  • Test next version of OS

  • Test patches, COTS upgrades, system procedures

  • Today: n.a. or scattered

  • Domain Requirements:

    • 1 CPU Board, 4 CPUs @ 1GHz, 4GB/CPU

    • 1 I/O board (not mission critical)

  • Remember: Dynamic domains

    • It is possible to shut down this domain when not needed

    • Reassign resources e.g. to DADS/OPUS/OTFR domain

Implementation Review


Sunfire 15k nominal domain layout

SunFire 15K Nominal Domain Layout

Implementation Review


Sunfire 15k peak load domain layout

SunFire 15K Peak-load domain layout

Implementation Review


Future growth

Future growth

  • Today (contingencies):

    • Add 1 CPU/Memory Board, 4CPUs

    • Add 8 I/O Boards or 16 “MaxCPU” CPUs

    • Add 300GB of RAM

    • upgrade to 1.2GHz CPUs

  • One to two years:

    • Double number of CPUs: 8 CPUs / board

    • Increased CPU clock speed

  • All within the box

Implementation Review


Process ras features

Process RAS Features

  • STScI Administration and Software Configuration RAS Features

    • Sun Management Center: ease of management, monitoring and capturing of system performance metrics

    • Use dynamic server domains to keep the science flowing

      • Ability to prioritize processing in the event of a problem

Implementation Review


Risk factors and mitigation

Risk factors and mitigation

  • Schedule slippage risk mitigation:

    • Contract imposes penalty for late delivery

    • Decouple Database migration from milestones

    • Can keep old equipment past end of project

    • Loaner system to get head start

  • Technical risks mitigation:

    • Use loaner to detect issues, find solutions early

    • Extensive staff training included in contract to mitigate new technology risks

  • Operational risk and mitigations discussed in later presentations

Implementation Review


Current status

Current Status

  • Order placed, Feb 3rd; expected Time of Arrival, Mar 4

  • Loaner up and running with two domains

  • Training started

  • Completed Site survey, preparation (power, floor, environment)

  • Started interviewing operations staff, engineers, support staff and scientists to refine use model (later presentation)

Implementation Review


High level schedule

High level Schedule

  • Initial domain design, March 12

  • System setup & integration, March 24

    • Physical setup, power

    • Network

    • Sun’s Application Readiness Process

  • System benchmarks

  • Domain configuration, May 1

    • OS Install, patch, institutionalize

    • Test backup/recovery, SMC, basic reporting

    • 3rd party software

    • Documentation, review

    • Clone other domains

Implementation Review


High level schedule cont

High level Schedule (cont)

  • Full system tests

    • Run benchmarks to establish system baseline

  • Develop procedures

    • System, account management

    • Backup/restore, disaster recovery

    • train support staff

  • Hand over 3 Development, I&T and Operational domains to ESS, May 22

  • Database domain configuration

    • Customize OS, 3rd party applications

  • Hand over 3 DB domains to ESS, June 6

Implementation Review


Schedule

Schedule

Implementation Review


  • Login