270 likes | 368 Views
Explore the architecture, RAS features, dynamic domains, and software systems of the SunFire 15K server for robust enterprise deployment.
E N D
SunFire 15K Enterprise-Grade ServerMarch 14, 2003 Implementation Review
Overview • Introduction of SunFire 15K architecture and concepts • Hardware RAS features • Dynamic Domains • System requirements • Number of domains • Resources for each domain • Expansion • Process RAS features • Risk factors and risk mitigation • Current Status • Schedule Implementation Review
12 fans each 12 fans each Sun Fire 15K: “Highly redundant, symmetric multi-processing server with a shared memory architecture” Features: • 1 to 18 CPU/Memory boards • 4CPUs/board, @1050MHZ • 1 to 8GB/CPU = 576GB memory • 1 to 18 I/O boards, 4 PCI slots/board • Can trade I/O boards v.s. extra CPU • Partition in 1-18 dynamic domains Implementation Review
SunFire 15K: High RAS Features • Reliability: • Fully redundant CPU/Memory and I/O boards, PCI cards • Dual System Controllers • Dual System Clock • Dual Grid power, redundant power supplies • Redundant fans • Environmental monitoring • Serviceability: • Hot Swap CPU/Memory boards • Hot Swap I/O boards, PCI components • Hot Swap System controller • Hot Swap power supplies • Hot Swap fans • Full Remote Diagnostics Implementation Review
Dynamic Domains • CPU/Memory Boards and I/O Boards can be re-assigned on-the-fly to other domains when needed. E.g.: $ moveboard SB2 -d BESSIEB • By hand: • Reallocate resources from operational and development domains to I&T domain for full load performance testing • Take boards off-line for maintenance – hot swap • Automatically by programs or monitoring software: • At times of peak loads, reallocate resources from Development and I&T domains to operational domain Implementation Review
Data Processing Software Systems • Pre-archive processing & Ingest: • Science data receipt and processing: Science Pipelines (OPUS) • Engineering data receipt and processing (EDPS) • Archive Ingest • Distribution: • Archive distribution (DADS) • On-the-fly reprocessing (OTFR) • Calibration: • Calibration pipeline and database • Database servers: • Pipeline Processing, Ingest/Distribution DB “CATLOG” • Archive Catalog Browsing DB “ZEPPO” • Will not support user interfaces: StarView, Web, APT Implementation Review
Number of Domains • High level requirements: • Separate Development, Integration & Test and Operational environments • Protect Ingest from Distribution • Respond to user community • Other requirements: • Separate Pipeline computing from Database servers • Separate DB for external users (ZEPPO) from internal operational DB (CATLOG) • Isolate OS and COTS testing, patching • Maximum number of domains: 3*(2+2) +1 = 13 • BUT: more domains = more fragmentation = less flexibility • Must balance flexibility with need for isolation Implementation Review
Number of Domains (cont.) • For Databases, combine development and Integration & Test domains for CATLOG and ZEPPO (saves 3 domains) • Combine Pre-archive processing & Ingest with Distribution (saves 3 domains): • Use similar processing pipelines • Protect Ingest from Distribution by dynamically adding resources when needed • Protect Ingest from Distribution by binding Ingest processes to dedicated CPU’s and Memory resources • Protect Ingest from Distribution using new features in DADS 10.* • Closely monitor performance and fall back to 2 separate domains as contingency plan • Number of domains = 13 – 3 – 3 = 7 Implementation Review
The 7 Domains Implementation Review
Domain Resources • CPU/Memory Boards: • 4 CPUs / board • CPUs run at 1.05GHz • 1-8GB of memory / CPU = 4-32GB / board • I/O Boards: • Provide external connections to SAN, network, disks • 4 PCI slots / board • 2 slots @ 33MHz • 2 slots @ 66MHz Implementation Review
1: Development domain • Supports Development Teams for: • OPUS (EDPS, OTFR, etc) • DADS 10.* • IRAF/STSDAS • Calibration pipelines • Calibration reference data • Today (combined with Testing), excluding desktops: • Tru64, Solaris: ~9-13CPU’s <1GB/CPU, 500MHz • Domain requirements: • 2 CPU Boards, 8 CPUs, 4GB/CPU • 1 I/O Board (not mission critical) Implementation Review
2: DADS/OPUS/OTFR domain • Compare Pre-archive pipelines, Ingest and Distribution Performance Requirements with current performance • Use outcome to scale current resources to domain requirements, accounting for faster CPU’s, architecture. • Account for new Software architecture of DADS 10.* • Account for lack of modeling = safety margin • Account for projected growth: • Short term: Distribution (ACS) • Intermediate: New Algorithms • Longer: Pre-archive pipelines, Ingest (SM4) • Overall increased use of 20%/year Implementation Review
2: DADS/OPUS/OTFR domain (cont.) • Today: • baseline Pre Archive processing and Ingest performance within requirements • Remember: failures addressed by Architecture • baseline Distribution & OTFR barely within requirements • Current systems maxed out. Implementation Review
2: DADS/OPUS/OTFR domain (cont.) • Today: • Tru64 cluster: 12 CPUs @ 500MHz, 1GB/CPU • 1 Sun 280R 2CPUs 750MHz (EDPS) • 3 OpenVMS systems: 1 CPU @ 250MHz, 0.5-1.5GB • Domain CPU/Memory requirement: • 9CPUs @ 1GHz, 4GB/CPU • New software architecture requirements (DADS 10.*) • 6CPUs @ 1GHz, 4GB/CPU • Short term growth, ACS + 20% • 3CPUs @ 1GHz, 4GB/CPU • Margin • 2CPUs @ 1GHz, 4GB/CPU Implementation Review
2: DADS/OPUS/OTFR domain (cont.) • Total Domain CPU/Memory Requirements: • 5 CPU Boards, 20CPUs @1GHz, 4GB/CPU • Total Domain I/O requirements: • Operational, so redundant: 2 I/O Boards • Can be multiplexed if necessary for performance • Remember: Dynamic domains: • We can re-assign resources on-the-fly, esp. from I&T domain to handle peak loads, longer term fluctuations Implementation Review
3 Integration & Test domain • Realistic end-to-end load and performance testing • Identical to operational DADS/OPUS/OTFR domain • Today: non-existent • Domain Requirements: • 5 CPU Boards, 20 CPUs @ 1GHz, 4GB/CPU • Remember: Dynamic domains • Full-load performance tests happen regularly, but not daily • Full-load performance tests are highly controlled, discrete, and scheduled events • I&T resources can be re-assigned to e.g. DADS/OPUS/OTFR domain when not needed Implementation Review
4,5,6: Database domains More details in afternoon “Databases” presentation • Operational DB, CATLOG • Today: 4CPUs @ 300MHz, 0.5GB total • Anticipate increased load because of faster pipelines, new instruments • Domain Requirements: • 2 CPU Boards, 8 CPUs @ 1GHz, 2GB/CPU • 2 I/O Boards (redundancy) • Archive Catalog Browsing DB, ZEPPO • Today: 2CPUs @ 300MHz, 0.6GB total • Domain Requirements: • 1 CPU Board, 4CPUs @ 1GHz, 2Gb/CPU • 2 I/O Boards (redundancy) • Development, test • Today: 2*2CPUs @ 200MHz, 1GB total • Domain Requirements • 1 CPU Board, 4 CPUs @ 1GHz, 2Gb/CPU • 1 I/O Board Implementation Review
7: OS & COTS testing, patches • Test next version of OS • Test patches, COTS upgrades, system procedures • Today: n.a. or scattered • Domain Requirements: • 1 CPU Board, 4 CPUs @ 1GHz, 4GB/CPU • 1 I/O board (not mission critical) • Remember: Dynamic domains • It is possible to shut down this domain when not needed • Reassign resources e.g. to DADS/OPUS/OTFR domain Implementation Review
SunFire 15K Nominal Domain Layout Implementation Review
SunFire 15K Peak-load domain layout Implementation Review
Future growth • Today (contingencies): • Add 1 CPU/Memory Board, 4CPUs • Add 8 I/O Boards or 16 “MaxCPU” CPUs • Add 300GB of RAM • upgrade to 1.2GHz CPUs • One to two years: • Double number of CPUs: 8 CPUs / board • Increased CPU clock speed • All within the box Implementation Review
Process RAS Features • STScI Administration and Software Configuration RAS Features • Sun Management Center: ease of management, monitoring and capturing of system performance metrics • Use dynamic server domains to keep the science flowing • Ability to prioritize processing in the event of a problem Implementation Review
Risk factors and mitigation • Schedule slippage risk mitigation: • Contract imposes penalty for late delivery • Decouple Database migration from milestones • Can keep old equipment past end of project • Loaner system to get head start • Technical risks mitigation: • Use loaner to detect issues, find solutions early • Extensive staff training included in contract to mitigate new technology risks • Operational risk and mitigations discussed in later presentations Implementation Review
Current Status • Order placed, Feb 3rd; expected Time of Arrival, Mar 4 • Loaner up and running with two domains • Training started • Completed Site survey, preparation (power, floor, environment) • Started interviewing operations staff, engineers, support staff and scientists to refine use model (later presentation) Implementation Review
High level Schedule • Initial domain design, March 12 • System setup & integration, March 24 • Physical setup, power • Network • Sun’s Application Readiness Process • System benchmarks • Domain configuration, May 1 • OS Install, patch, institutionalize • Test backup/recovery, SMC, basic reporting • 3rd party software • Documentation, review • Clone other domains Implementation Review
High level Schedule (cont) • Full system tests • Run benchmarks to establish system baseline • Develop procedures • System, account management • Backup/restore, disaster recovery • train support staff • Hand over 3 Development, I&T and Operational domains to ESS, May 22 • Database domain configuration • Customize OS, 3rd party applications • Hand over 3 DB domains to ESS, June 6 Implementation Review
Schedule Implementation Review