gordon nsf flash based system for data intensive science l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Gordon: NSF Flash-based System for Data-intensive Science PowerPoint Presentation
Download Presentation
Gordon: NSF Flash-based System for Data-intensive Science

Loading in 2 Seconds...

play fullscreen
1 / 33

Gordon: NSF Flash-based System for Data-intensive Science - PowerPoint PPT Presentation


  • 155 Views
  • Uploaded on

Gordon: NSF Flash-based System for Data-intensive Science. Mahidhar Tatineni 37 th HPC User Forum Seattle, WA Sept. 14, 2010 PIs: Michael L. Norman, Allan Snavely San Diego Supercomputer Center. What is Gordon?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Gordon: NSF Flash-based System for Data-intensive Science' - ashby


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gordon nsf flash based system for data intensive science

Gordon: NSF Flash-based System for Data-intensive Science

Mahidhar Tatineni

37th HPC User Forum

Seattle, WA Sept. 14, 2010

PIs: Michael L. Norman, Allan Snavely

San Diego Supercomputer Center

what is gordon
What is Gordon?
  • A “data-intensive” supercomputer based on SSD flash memory and virtual shared memory SW
    • Emphasizes MEM and IOPS over FLOPS
  • A system designed to accelerate access to massive data bases being generated in all fields of science, engineering, medicine, and social science
  • The NSF’s most recent Track 2 award to the San Diego Supercomputer Center (SDSC)
  • Coming Summer 2011
slide3

Michael L. Norman

Principal Investigator

Director, SDSC

Allan Snavely

Co-Principal Investigator

Project Scientist

why gordon
Why Gordon?
  • Growth of digital data is exponential
    • “data tsunami”
  • Driven by advances in digital detectors, networking, and storage technologies
  • Making sense of it all is the new imperative
    • data analysis workflows
    • data mining
    • visual analytics
    • multiple-database queries
    • data-driven applications
gordon is designed specifically for data intensive hpc applications
Such applications involve “very large data-sets or very large input-output requirements” (NSF Track 2D RFP)

Two data-intensive application classes are important and growing

Gordon is designed specifically for data-intensive HPC applications
  • Data Mining
  • “the process of extracting hidden patterns from data… with the amount of data doubling every three years,data mining is becoming an increasingly important tool to transform this data into information.” Wikipedia
  • Data-Intensive
  • Predictive Science
  • solution of scientific problems via simulations that generate large amounts of data
red shift data keeps moving further away from the cpu with every turn of moore s law
Red Shift: Data keeps moving further away from the CPU with every turn of Moore’s Law

Disk Access Time

data due to Dean Klein of Micron

the memory hierarchy of a typical hpc cluster
The Memory Hierarchy of a Typical HPC Cluster

Shared memory

programming

Message passing

programming

Latency Gap

Disk I/O

the memory hierarchy of gordon
The Memory Hierarchy of Gordon

Shared memory

programming

Disk I/O

gordon architecture supernode
Gordon Architecture: “Supernode”
  • 32 Appro Extreme-X compute nodes
    • Dual processor Intel Sandy Bridge
      • 240 GFLOPS
      • 64 GB
  • 2 Appro Extreme-X IO nodes
    • Intel SSD drives
      • 4 TB ea.
      • 560,000 IOPS
  • ScaleMPvSMP virtual shared memory
    • 2 TB RAM aggregate
    • 8 TB SSD aggregate

4 TB SSD

I/O Node

240 GF

Comp.

Node

64 GB

RAM

240 GF

Comp.

Node

64 GB

RAM

vSMP memory

virtualization

gordon architecture full machine
Gordon Architecture: Full Machine
  • 32 supernodes = 1024 compute nodes
  • Dual rail QDR Infiniband network
    • 3D torus (4x4x4)
  • 4 PB rotating disk parallel file system
    • >100 GB/s

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

SN

D

D

D

D

D

D

dash architecture

Compute node

DASH Architecture

2 Nehalem quad-core CPUs

48GB DDR3

memory

Compute node

I/O node

CPU & memory

Compute node

I/O

node

InfiniBand

RAID controllers/HBAs (Host Bus Adapters)

16

Compute node

Intel® X25-E 64GBflash drive

Intel® X25-E 64GBflash drive

16

Intel® X25-E 64GBflash drive

dash architecture15
DASH Architecture

Just a cluster ???

Compute node

Compute node

I/O

node

InfiniBand

16

Compute node

dash architecture16
DASH Architecture

Supernode

Supernode

Supernode

Supernode

vSMP system

Compute node

128 cores

128 cores

128 cores

128 cores

768GB DRAM

768GB DRAM

768GB DRAM

768GB DRAM

Compute node

I/O

node

InfiniBand

1TB flash drives

1TB flash drives

1TB flash drives

1TB flash drives

16

Compute node

i o node original configuration
I/O Node: Original Configuration

Achieved only 15% of the upper bound after exhaustive tuning. The embedded processor is the bottleneck.

I/O node

CPU & memory

Software RAID 0

Hardware RAID 0

Hardware RAID 0

RAID controller

RAID controller

64GB

flash

64GB

flash

8

64GB

flash

64GB

flash

64GB

flash

8

64GB

flash

i o node new configuration
I/O Node: New Configuration

Achieved up to 80% of the upper bound. Peripheral hardware designed for spinning disks cannot satisfy flash drives!

CPU & memory

I/O node

Software RAID 0

HBA

HBA

HBA

HBA

64GB

flash

64GB

flash

64GB

flash

64GB

flash

64GB

flash

64GB

flash

64GB

flash

64GB

flash

4

4

4

4

64GB

flash

64GB

flash

64GB

flash

64GB

flash

i o system configuration
I/O System Configuration
  • CPU: 2 Intel® Nehalem quad-core 2.4GHz
  • Memory: 48GB DDR3
  • Flash drives: 16 Intel® X25-E 64GB SLC SATA
  • Benchmark: IOR and XDD
  • File system: XFS
dash i o node testing
DASH I/O Node Testing*
  • Distributed shared memory + flash drives
  • Prototype system: DASH
  • Design space exploration with 16,800 tests
    • Stripe sizes
    • Stripe widths
    • File systems
    • I/O schedulers
    • Queue depths
  • Revealed hardware and software issues
    • RAID controller processor
    • MSI per-vector masking
    • File system cannot handle high IOPS
    • I/O schedulers designed for spinning disks
  • Tuning during testing improved performance by about 9x.

* Detailed info in TeraGrid 2010 paper:

“DASH-IO: an Empirical Study of Flash-based IO for HPC”

Jiahua He, Jeffrey Bennett, and Allan Snavely

dash vsmp node tests
DASH vSMP Node Tests
  • STREAM memory benchmark
  • GAMESS standard benchmark problem
  • Early user examples
    • Genomic sequence assembly code (Velvet). Successfully run on Dash with 178GB of memory used.
    • CMMAP cloud data analysis with data transposition in memory. Early tests run with ~130GB of data.
dash applications using flash as scratch space
DASH – Applications using flash as scratch space
  • Several HPC applications have a substantial per core (local) I/O component (primarily scratch files). Examples are Abaqus, NASTRAN, QChem.
  • Standard Abaqus test cases (S2A1, S4B) were run on Dash to compare performance between local hard disk and SSDs.

*Lustre is not optimal for such I/O. S4B test took 18920s.

3 application benchmarks
3 Application Benchmarks
  • Massive graph traversal
    • Semantic web search
  • Astronomical database queries
    • Identification of transient objects
  • Biological networks database queries
    • Protein interactions
graph networks search bfs
Graph Networks Search (BFS)
  • Breadth-first-search on 200GB graph network
nih biological networks pathway analysis
NIH Biological Networks Pathway Analysis
  • Interaction networks or graphs occur in many disciplines, e.g. epidemiology, phylogenetics and systems biology
  • Performance is limited by latency of a Database query and aggregate amount

of fast storage available

palomar transient factory ptf
Palomar Transient Factory (PTF)
  • Nightly wide-field surveys using Palomar Schmidt telescope
  • Image data sent to LBL for archive/analysis
  • 100 new transients every minute
  • Large, random queries across multiple databases for IDs
ptf db transient search
PTF-DB Transient Search

Random Queries requesting very small chunks of data about the candidate observations

trestles coming soon
Trestles – Coming Soon!
  • Trestles will work with and span the deployments of SDSC’s recently introduced Dash system and the larger Gordon data-intensive system.
  • To be configured by SDSC and Appro, Trestles is based on quad-socket, 8-core AMD Magny-Cours compute nodes connected via a QDR InfiniBand fabric.
  • Trestles will have 324 nodes, 10,368 processor cores, a peak speed of 100 teraflop/s, and 38 terabytes of flash memory.
  • Each of the 324 nodes will have 32 cores, 64 GB of DDR3 memory, and 120 GB of flash memory.
conclusions
Conclusions
  • Gordon architecture customized for data-intensive applications, but built entirely from commodity parts
  • Basically a Linux cluster with
    • Large RAM memory/core
    • Large amount of flash SSD
    • Virtual shared memory software
    • 10 TB of storage accessible from a single core
    • shared memory parallelism for higher performance
  • Dash prototype accelerates real applications by 2-100x relative to disk depending on memory access patterns
    • Random I/O accelerated the most