Gordon: NSF Flash-based System for Data-intensive Science

Gordon: NSF Flash-based System for Data-intensive Science Mahidhar Tatineni 37th HPC User Forum Seattle, WA Sept. 14, 2010 PIs: Michael L. Norman, Allan Snavely San Diego Supercomputer Center

What is Gordon? • A “data-intensive” supercomputer based on SSD flash memory and virtual shared memory SW • Emphasizes MEM and IOPS over FLOPS • A system designed to accelerate access to massive data bases being generated in all fields of science, engineering, medicine, and social science • The NSF’s most recent Track 2 award to the San Diego Supercomputer Center (SDSC) • Coming Summer 2011

Michael L. Norman Principal Investigator Director, SDSC Allan Snavely Co-Principal Investigator Project Scientist

Why Gordon? • Growth of digital data is exponential • “data tsunami” • Driven by advances in digital detectors, networking, and storage technologies • Making sense of it all is the new imperative • data analysis workflows • data mining • visual analytics • multiple-database queries • data-driven applications

Cosmological Dark Energy Surveys Accelerating universe

Such applications involve “very large data-sets or very large input-output requirements” (NSF Track 2D RFP) Two data-intensive application classes are important and growing Gordon is designed specifically for data-intensive HPC applications • Data Mining • “the process of extracting hidden patterns from data… with the amount of data doubling every three years,data mining is becoming an increasingly important tool to transform this data into information.” Wikipedia • Data-Intensive • Predictive Science • solution of scientific problems via simulations that generate large amounts of data

Red Shift: Data keeps moving further away from the CPU with every turn of Moore’s Law Disk Access Time data due to Dean Klein of Micron

The Memory Hierarchy of a Typical HPC Cluster Shared memory programming Message passing programming Latency Gap Disk I/O

The Memory Hierarchy of Gordon Shared memory programming Disk I/O

Gordon Architecture: “Supernode” • 32 Appro Extreme-X compute nodes • Dual processor Intel Sandy Bridge • 240 GFLOPS • 64 GB • 2 Appro Extreme-X IO nodes • Intel SSD drives • 4 TB ea. • 560,000 IOPS • ScaleMPvSMP virtual shared memory • 2 TB RAM aggregate • 8 TB SSD aggregate 4 TB SSD I/O Node 240 GF Comp. Node 64 GB RAM 240 GF Comp. Node 64 GB RAM vSMP memory virtualization

Gordon Architecture: Full Machine • 32 supernodes = 1024 compute nodes • Dual rail QDR Infiniband network • 3D torus (4x4x4) • 4 PB rotating disk parallel file system • >100 GB/s SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN SN D D D D D D

Gordon Aggregate Capabilities

Dash*: a working prototype of Gordon*available as a TeraGrid resource

Compute node DASH Architecture 2 Nehalem quad-core CPUs 48GB DDR3 memory Compute node I/O node CPU & memory Compute node I/O node InfiniBand RAID controllers/HBAs (Host Bus Adapters) … 16 Compute node Intel® X25-E 64GBflash drive Intel® X25-E 64GBflash drive 16 Intel® X25-E 64GBflash drive …

DASH Architecture Just a cluster ??? Compute node Compute node I/O node InfiniBand … 16 Compute node

DASH Architecture Supernode Supernode Supernode Supernode vSMP system Compute node 128 cores 128 cores 128 cores 128 cores 768GB DRAM 768GB DRAM 768GB DRAM 768GB DRAM Compute node I/O node InfiniBand … 1TB flash drives 1TB flash drives 1TB flash drives 1TB flash drives 16 Compute node

I/O Node: Original Configuration Achieved only 15% of the upper bound after exhaustive tuning. The embedded processor is the bottleneck. I/O node CPU & memory Software RAID 0 Hardware RAID 0 Hardware RAID 0 RAID controller RAID controller 64GB flash 64GB flash 8 64GB flash 64GB flash 64GB flash 8 64GB flash … …

I/O Node: New Configuration Achieved up to 80% of the upper bound. Peripheral hardware designed for spinning disks cannot satisfy flash drives! CPU & memory I/O node Software RAID 0 HBA HBA HBA HBA 64GB flash 64GB flash 64GB flash 64GB flash 64GB flash 64GB flash 64GB flash 64GB flash 4 4 4 4 64GB flash 64GB flash 64GB flash 64GB flash … … … …

I/O System Configuration • CPU: 2 Intel® Nehalem quad-core 2.4GHz • Memory: 48GB DDR3 • Flash drives: 16 Intel® X25-E 64GB SLC SATA • Benchmark: IOR and XDD • File system: XFS

DASH I/O Node Testing* • Distributed shared memory + flash drives • Prototype system: DASH • Design space exploration with 16,800 tests • Stripe sizes • Stripe widths • File systems • I/O schedulers • Queue depths • Revealed hardware and software issues • RAID controller processor • MSI per-vector masking • File system cannot handle high IOPS • I/O schedulers designed for spinning disks • Tuning during testing improved performance by about 9x. * Detailed info in TeraGrid 2010 paper: “DASH-IO: an Empirical Study of Flash-based IO for HPC” Jiahua He, Jeffrey Bennett, and Allan Snavely

The I/O Tuning Story

DASH vSMP Node Tests • STREAM memory benchmark • GAMESS standard benchmark problem • Early user examples • Genomic sequence assembly code (Velvet). Successfully run on Dash with 178GB of memory used. • CMMAP cloud data analysis with data transposition in memory. Early tests run with ~130GB of data.

DASH – Applications using flash as scratch space • Several HPC applications have a substantial per core (local) I/O component (primarily scratch files). Examples are Abaqus, NASTRAN, QChem. • Standard Abaqus test cases (S2A1, S4B) were run on Dash to compare performance between local hard disk and SSDs. *Lustre is not optimal for such I/O. S4B test took 18920s.

Dash wins SC09 Storage Challenge at SC09

3 Application Benchmarks • Massive graph traversal • Semantic web search • Astronomical database queries • Identification of transient objects • Biological networks database queries • Protein interactions

Graph Networks Search (BFS) • Breadth-first-search on 200GB graph network

NIH Biological Networks Pathway Analysis • Interaction networks or graphs occur in many disciplines, e.g. epidemiology, phylogenetics and systems biology • Performance is limited by latency of a Database query and aggregate amount of fast storage available

Biological Networks Query timing

Palomar Transient Factory (PTF) • Nightly wide-field surveys using Palomar Schmidt telescope • Image data sent to LBL for archive/analysis • 100 new transients every minute • Large, random queries across multiple databases for IDs

PTF-DB Transient Search Random Queries requesting very small chunks of data about the candidate observations

Trestles – Coming Soon! • Trestles will work with and span the deployments of SDSC’s recently introduced Dash system and the larger Gordon data-intensive system. • To be configured by SDSC and Appro, Trestles is based on quad-socket, 8-core AMD Magny-Cours compute nodes connected via a QDR InfiniBand fabric. • Trestles will have 324 nodes, 10,368 processor cores, a peak speed of 100 teraflop/s, and 38 terabytes of flash memory. • Each of the 324 nodes will have 32 cores, 64 GB of DDR3 memory, and 120 GB of flash memory.

Conclusions • Gordon architecture customized for data-intensive applications, but built entirely from commodity parts • Basically a Linux cluster with • Large RAM memory/core • Large amount of flash SSD • Virtual shared memory software • 10 TB of storage accessible from a single core • shared memory parallelism for higher performance • Dash prototype accelerates real applications by 2-100x relative to disk depending on memory access patterns • Random I/O accelerated the most

Dash – STREAM on vSMP (test results)

Gordon: NSF Flash-based System for Data-intensive Science