生命科学、气象行业
This presentation is the property of its rightful owner.
Sponsored Links
1 / 51

生命科学、气象行业 高性能计算解决方案及成功案例分享 PowerPoint PPT Presentation


  • 254 Views
  • Uploaded on
  • Presentation posted in: General

生命科学、气象行业 高性能计算解决方案及成功案例分享. 凌巍才 高性能计算产品技术顾问 戴尔(中国)有限公司. 内容. 生命科学高性能计算解决方案 GPU 加速解决方案 高性能存储解决方案 WRF V3.3 ( 气象行业应用 ) 在 Dell R720 服务器 程序测试及优化 g cc 编译器器 Intel 编译器 成功案例分享. 生命科学 HPC GPU 方案. 在生命科学领域中 很多用户采用 GPU 加速解决方案. CPU + GPU 计算. HPCC GPU 异构平台.

Download Presentation

生命科学、气象行业 高性能计算解决方案及成功案例分享

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


1847848

生命科学、气象行业高性能计算解决方案及成功案例分享

凌巍才

高性能计算产品技术顾问

戴尔(中国)有限公司

Confidential


1847848

内容

  • 生命科学高性能计算解决方案

    • GPU加速解决方案

    • 高性能存储解决方案

  • WRF V3.3 ( 气象行业应用) 在 Dell R720 服务器程序测试及优化

    • gcc 编译器器

    • Intel 编译器

  • 成功案例分享

Confidential


Hpc gpu

生命科学HPC GPU 方案


1847848

在生命科学领域中很多用户采用GPU加速解决方案

Confidential


Cpu gpu

CPU + GPU 计算

Confidential


Hpcc gpu

HPCCGPU 异构平台

Confidential


Gpu dell 2012 12

支持GPU的 Dell 服务器方案(2012年,12代服务器)

Internal Solutions

External Solutions (PowerEdge C)

Confidential


Gpu gpu dell poweredge c410x

GPU 扩展箱方案 (GPU外置方案)DellPowerEdge C410x

PCIe EXPANSION CHASSIS CONNECTING 1-8 HOSTS TO 1-16 PCIe

Great for: HPC including universities, oil & gas, biomed research, design, simulation, mapping, visualization, rendering, and gaming

  • 3U chassis, 19” wide, 143 pounds

  • PCI express modules: 10 front, 6 rear

  • PCI form factors: HH/HL and FH/HL

  • Up to 225W per module

  • PCIe inputs: 8PCIe x16 IPASS ports

  • PCI fan out options: x16 to 1 slot, x16 to 2 slot, x16 to 3 slot, x16 to 4 slot

  • GPUs supported: NVIDIA M1060, M2050, M2070 (TBD)

  • Thermals: high-efficiency 92mm fans; N + 1 fan redundancy

  • Management: On-board BMC; IPMI 2.0; dedicated management port

  • Power supplies: 4 x 1400W hot-plug, high efficiency PSUs; N+1 power redundancy

  • Services vary by region: IT Consulting, Server and Storage Deployment, Rack Integration (US only), Support Services

Confidential


Poweredge c410x pcie

PowerEdge C410x PCIe模块

  • Serviceable PCIe module (taco) capable of supporting any half-height, half-length (HH/HL) or full-height/half-length (FH/HL) cards

  • FH/FL cards supported with extended PCIe module

  • Future-proofing on next generations of NVIDIA

    and AMD ATI GPU cards

Power connector

for GPGPU card

LED

Board-to-board

connector for

X16 Gen PCIe

signals and power

GPU card

Confidential


Poweredge c410x configurations

PowerEdge C410x Configurations

  • Enabling HPC applications to optimize cost / performance equation off single x16

1 GPU / x168GPU/7U

2 GPU / x1616GPU/7U

x16

x16

x16

x16

Host

PCI

Switch

GPU

Host

PCI

Switch

GPU

HIC

HIC

x16

C6100

C410x

GPU

C6100

C410x

iPass cable

7U = (1) C410x + (2) C6100

iPass cable

7U = (1) C410x + (2) C6100

3 GPU / x1612GPU/5U

4 GPU / x1616GPU/5U

x16

x16

x16

x16

Host

PCI

Switch

GPU

Host

PCI

Switch

GPU

HIC

HIC

x16

x16

C6100

GPU

GPU

C6100

x16

x16

iPass cable

GPU

GPU

iPass cable

C410x

x16

GPU

C410x

5U = (1) C410x + (1) C6100

5U = (1) C410x + (1) C6100

GPU/U ratios assume PowerEdge C6100 host with 4 servers per 2U chassis

Confidential


Flexibility of the poweredge c410x

Flexibility of the PowerEdge C410x

  • Increases to 8:1 possible with dual x16

x16

GPU

x16

GPU

iPass cable

x16

iPass cable

x16

GPU

GPU

x16

x16

x16

x16

GPU

PCI

Switch

PCI

Switch

GPU

Host

Host

HIC

HIC

HIC

HIC

x16

x16

x16

x16

GPU

PCI

Switch

PCI

Switch

GPU

x16

x16

GPU

GPU

iPass cable

iPass cable

C410x

x16

GPU

x16

GPU

C410x

Confidential


Poweredge c6100 configurations 2 1 sandwich

PowerEdge C6100 Configurations “2:1 Sandwich”

Details

  • Two C6100

    • 8 system boards

      • 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host

      • Single port x16 HIC (iPASS)

  • Single C410x

    • 16 GPUs (fully populated)

  • PCIe x8 per GPU

    • Total space = 7U

C410x

C6100

C6100

Summary

C6100 “2:1 Sandwich”

One Dell C410x (16 GPUs)

Two C6100 (8 nodes)

One x16 slot for each node to 2 GPUs

7U total

16 GPUs total

8 nodes total (2 GPUs per board)

Note: This configuration is equivalent to

using the C6100 and the NVIDIA S2050

but this configuration is more dense

Confidential


Poweredge c6100 configurations 4 1 sandwich

PowerEdge C6100 Configurations “4:1 Sandwich”

Details

C410x

  • One C6100

    • 4 system boards

      • 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host

      • Single port x16 HIC (iPASS)

  • Single C410x

    • 16 GPUs (fully populated)

  • PCIe x4 per GPU

    • Total space = 5U

C6100

Summary

C6100 “4:1 Sandwich”

One Dell C410x (16 GPUs)

One C6100 (4 nodes)

One x16 slot for each node to 4 GPUs

5U total

16 GPUs total

4 nodes total (4 GPUs per board)

Confidential


Poweredge c6100 configurations 8 1 sandwich possible future development

PowerEdge C6100 Configurations “8:1 Sandwich” (Possible Future Development)

Details

C410x

  • One C6100

    • 4 system boards

      • 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host

      • Single port x16 HIC (iPASS)

  • Two C410x

    • 32 GPUs (fully populated)

  • PCIe x2 per GPU

    • Total space = 8U

    • See later table for metrics

C410x

C6100

Summary

C6100 “8:1 Sandwich”

Two Dell C410x (32 GPUs)

One C6100 (4 nodes)

One x16 slot for each node to 8 GPUs

8U total

32 GPUs total

4 nodes total (8 GPUs per board)

Confidential


Poweredge c6145 configurations 8 1 sandwich

PowerEdge C6145 Configurations “8:1 Sandwich”

5U of Rack Space

Details

  • One C6145

    • 2 system boards

      • 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host

      • 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS)

  • One C410x

    • 16 GPUs (fully populated)

  • PCIe x4-x8 per GPU

    • Total space = 5U

C6145

C410x

Details

C6145 “16:1 Sandwich”

One Dell C410x (16 GPUs)

One C6145 (2 nodes)

Two-Four HIC slots for each node to 16 GPUs

5U total

16GPUs total

2 nodes total (16 GPUs per board)

Dell Confidential


Poweredge c6145 configurations 16 1 sandwich

PowerEdge C6145 Configurations “16:1 Sandwich”

8U of Rack Space

Details

  • One C6145

    • 2 system boards

      • 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host

      • 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS)

  • Two C410x

    • 32 GPUs (fully populated)

  • PCIe x4 per GPU

    • Total space = 8U

C410x

C6145

C410x

Details

C6145 “16:1 Sandwich”

Two Dell C410x (32 GPUs)

One C6145 (2 nodes)

Four HIC slots for each node to 16 GPUs

8U total

32 GPUs total

2 nodes total (16 GPUs per board)

Dell Confidential


Poweredge c410x block diagram

PowerEdge C410x Block Diagram

GPUs x 16

Switch Level 2 x 4

Switch Level 1 x 8

Host Connections x 8


C410x bmc

C410X BMC控制台配置界面


1847848

GPU 扩展箱支持服务器列表

HIC/C410x Support Matrix

  • Dell external GPU solution support

    • Hardware Interface Card (HIC) in PCIe slot connects to external GPU(s) in C410x

    • Dell ‘slot validates’ NVIDIA interface cards to verify power, thermals, etc.


1847848

  • 生命科学应用测试: GPU-HMMER

1.8X

2.7X

2.8X

2.9X

Dell High Performance Computing


1847848

  • GPU:Host Scaling : GPU-HMMER

Speedup

1.8X

3.6X

7.2X

3.6X

Dell High Performance Computing


1847848

  • GPU:Host Scaling: NAMD

Speedup

4.7X

8.2X

15.2X

9.5X

Dell High Performance Computing


1847848

  • GPU:Host Scaling : LAMMPS JL-Cut

Speedup

8.5X

13.5X

14.4X

14.0X

Dell High Performance Computing


1847848

生命科学存储方案


1847848

生命科学计算、数据容量增长率


The lustre parallel file system

Clients

Meta Data Server (MDS)

OSS

OSS

OSS

The Lustre Parallel File System

  • Key Lustre Components:

    • Clients (compute nodes)

      • “Users” of the file system where applications run

      • The Dell HPC Cluster

    • Meta Data Server (MDS)

      • Holds meta-data information

    • Object Storage Server (OSS)

      • Provides back-end storage for the users’ files

      • Additional OSS units increase throughput linearly


1847848

Confidential


Infiniband ipoib nfs performance sequential read

InfiniBand (IPoIB) NFS Performance: Sequential Read

  • Peaks:

    • NSS Small: 1 node doing IO (fairly level until 4 nodes)

    • NSS Medium: 4 nodes doing IO (not much drop-off)

    • NSS Large: 8 nodes doing IO (good performance over range)


Infiniband ipoib nfs performance sequential write

Infiniband (IPoIB) NFS Performance: Sequential Write

  • Peaks:

    • NSS Small: 1 node doing IO (steady drop off to 16 nodes)

    • NSS Medium: 2 nodes doing IO (good performance for up to 8 nodes)

    • NSS Large: 4 nodes doing IO (good performance over range)


1847848

Confidential


Wrf v3 3

WRF V3.3 应用程序测试调优


1847848

Dell 测试环境

  • Dell R720

    • cpu : 2x Intel Sandy Bridge E5- 2650,

    • Memory: 8x 8MB (64GB Memory)

    • Harddisk: 2x 300 GB 15Krpm (Raid 0)

  • BIOS Setting

    • disable HT

    • memory optimized

    • High Performance enable ( Power Max)

  • OS

    • Redhat Enterprise Linux 6.3

Confidential


1847848

Gcc测试

  • gcc, gfortran, gc++

  • Zlib 1.2.5

  • HDF5 1.8.8

  • Netcdf 4

  • WRF V3.3

Confidential


1847848

测试结果

  • 输出文件 wrf : 2011年11月30日至 2011年12月5日 (13H9M53S)

    • wrf.exe starts at:Sun Apr 29 09:35:36 CST 2012…

    • wrf: SUCCESS COMPLETE WRF

    • wrf.exe completed at:Sun Apr 29 22:45:29 CST 2012

Confidential


1847848

配置文件

  • # Settings for x86_64 Linux, gfortran compiler with gcc (smpar)

  • DMPARALLEL = 1

  • OMPCPP = -D_OPENMP

  • OMP = -fopenmp

  • OMPCC = -fopenmp

  • SFC = gfortran

  • SCC = gcc

  • CCOMP = gcc

  • DM_FC = mpif90 -f90=$(SFC)

  • DM_CC = mpicc -cc=$(SCC)

  • FC = $(SFC)

  • CC = $(SCC) -DFSEEKO64_OK

  • LD = $(FC)

  • RWORDSIZE = $(NATIVE_RWORDSIZE)

  • PROMOTION = # -fdefault-real-8 # uncomment manually

  • ARCH_LOCAL = -DNONSTANDARD_SYSTEM_SUBR

  • CFLAGS_LOCAL = -w -O3 -c -DLANDREAD_STUB

  • LDFLAGS_LOCAL =

  • CPLUSPLUSLIB =

  • ESMF_LDFLAG = $(CPLUSPLUSLIB)

  • FCOPTIM = -O3 -ftree-vectorize -ftree-loop-linear -funroll-loops

  • FCREDUCEDOPT= $(FCOPTIM)

  • FCNOOPT= -O0

  • FCDEBUG = # -g $(FCNOOPT)

  • FORMAT_FIXED = -ffixed-form

  • FORMAT_FREE = -ffree-form -ffree-line-length-none

  • FCSUFFIX =

  • BYTESWAPIO = -fconvert=big-endian -frecord-marker=4

  • FCBASEOPTS_NO_G = -w $(FORMAT_FREE) $(BYTESWAPIO)

  • FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG)

  • MODULE_SRCH_FLAG =

  • TRADFLAG = -traditional

  • CPP = /lib/cpp -C -P

  • AR = ar

  • ARFLAGS = ru

  • M4 = m4 -G

  • RANLIB = ranlib

  • CC_TOOLS = $(SCC)

Confidential


Wrf out

Wrf.out

….

WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS = 16

WRF TILE 1 IS 1 IE 250 JS 1 JE 10

WRF TILE 2 IS 1 IE 250 JS 11 JE 20

WRF TILE 3 IS 1 IE 250 JS 21 JE 30

WRF TILE 4 IS 1 IE 250 JS 31 JE 39

WRF TILE 5 IS 1 IE 250 JS 40 JE 48

WRF TILE 6 IS 1 IE 250 JS 49 JE 57

WRF TILE 7 IS 1 IE 250 JS 58 JE 66

WRF TILE 8 IS 1 IE 250 JS 67 JE 75

WRF TILE 9 IS 1 IE 250 JS 76 JE 84

WRF TILE 10 IS 1 IE 250 JS 85 JE 93

WRF TILE 11 IS 1 IE 250 JS 94 JE 102

WRF TILE 12 IS 1 IE 250 JS 103 JE 111

WRF TILE 13 IS 1 IE 250 JS 112 JE 120

WRF TILE 14 IS 1 IE 250 JS 121 JE 130

WRF TILE 15 IS 1 IE 250 JS 131 JE 140

WRF TILE 16 IS 1 IE 250 JS 141 JE 150

WRF NUMBER OF TILES = 16

…..

Confidential


1847848

系统资源分析 CPU

  • CPU: (mpstat –P ALL)

  • Linux 2.6.32-257.el6.x86_64 (r720)      04/29/2012      _x86_64_        (16 CPU)

  • 04:06:40 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle

  • 04:06:40 PM  all   85.27    0.00    2.62    0.01    0.00    0.00    0.00    0.00   12.10

  • 04:06:40 PM    0   85.71    0.00    2.58    0.01    0.00    0.00    0.00    0.00   11.69

  • 04:06:40 PM    1   85.05    0.00    2.77    0.05    0.00    0.04    0.00    0.00   12.09

  • 04:06:40 PM    2   85.26    0.00    2.69    0.00    0.00    0.00    0.00    0.00   12.05

  • 04:06:40 PM    3   85.24    0.00    2.65    0.01    0.00    0.00    0.00    0.00   12.10

  • 04:06:40 PM    4   87.36    0.00    1.90    0.00    0.00    0.00    0.00    0.00   10.73

  • 04:06:40 PM    5   84.97    0.00    2.70    0.00    0.00    0.00    0.00    0.00   12.33

  • 04:06:40 PM    6   85.23    0.00    2.64    0.00    0.00    0.00    0.00    0.00   12.13

  • 04:06:40 PM    7   84.97    0.00    2.71    0.00    0.00    0.00    0.00    0.00   12.32

  • 04:06:40 PM    8   85.33    0.00    2.60    0.00    0.00    0.00    0.00    0.00   12.06

  • 04:06:40 PM    9   85.32    0.00    2.57    0.00    0.00    0.00    0.00    0.00   12.11

  • 04:06:40 PM   10   84.88    0.00    2.77    0.00    0.00    0.00    0.00    0.00   12.35

  • 04:06:40 PM   11   84.93    0.00    2.69    0.00    0.00    0.00    0.00    0.00   12.38

  • 04:06:40 PM   12   85.16    0.00    2.62    0.00    0.00    0.00    0.00    0.00   12.21

  • 04:06:40 PM   13   85.00    0.00    2.69    0.00    0.00    0.00    0.00    0.00   12.31

  • 04:06:40 PM   14   84.91    0.00    2.75    0.00    0.00    0.00    0.00    0.00   12.34

  • 04:06:40 PM   15   85.02    0.00    2.65    0.00    0.00    0.00    0.00    0.00   12.33

Confidential


Memory

系统资源分析 (Memory)

  • Memory : (free)  

             total       used       free     shared    buffers     cached

Mem:      65895488   32823072   33072416          0      38220   26885024

-/+ buffers/cache:    5899828   59995660

Swap:     66027512          0   66027512

Confidential


Io hdd

系统资源分析 (IO, HDD)

IO: (iostat)

Device:            tpsBlk_read/s   Blk_wrtn/s   Blk_readBlk_wrtn

sda               9.01       125.71      2063.47    3096354   50823660

dm-0              0.64        12.63         1.99     311170      49016

dm-1              0.01         0.10         0.00       2576          0

dm-2            258.17       112.05      2061.48    2759698   50774616

HDD : (df)

Filesystem           1K-blocks      Used Available Use% Mounted on

/dev/mapper/vg_r720-lv_root

                      51606140   5002372  43982328  11% /

tmpfs                 32947744        88  32947656   1% /dev/shm

/dev/sda1               495844     37433    432811   8% /boot

/dev/mapper/vg_r720-lv_home

                     458559680  58258760 377007380  14% /home

Confidential


Intel

Intel 测试

Confidential


1847848

Confidential


Intel links

Intel links

  • http://software.intel.com/en-us/articles/building-the-wrf-with-intel-compilers-on-linux-and-improving-performance-on-intel-architecture/

  • http://software.intel.com/en-us/articles/wrf-and-wps-v311-installation-bkm-with-inter-compilers-and-intelr-mpi/

  • http://www.hpcadvisorycouncil.com/pdf/WRF_Best_Practices.pdf

Confidential


Intel compilers flags

Intel Compilers Flags

Confidential


Intel1

Intel 调优

http://software.intel.com/en-us/articles/performance-hints-for-wrf-on-intel-architecture/

  • 1。Reducing MPI overhead:

    • -genv I_MPI_PIN_DOMAIN omp

    • -genv KMP_AFFINITY=compact

    • -perhost

  • 2。 Improving cache and memory bandwidth utilization:

    • numtiles = X

  • 3。Using Intel® Math Kernel Library (MKL) DFT for polar filters:

    • Depending on workload, Intel® MKL DFT may

    • provide up to 3x speedup of simulation speed

  • 4。Speeding up computations by reducing precision:

    • -fp-model fast=2 -no-prec-div -no-prec-sqrt

Confidential


1847848

案例分享


1847848

华大基因研究院


1847848

清华大学生命科学院


Success references in life science

Success References in Life Science

  • 国内

    • Beijing Genome Institute (BGI)

    • Tsinghua University Life Institute

    • Beijing Normal University

    • Jiang Su Tai Cang Life Institute

    • The 4th Military Medical University

  • 国外

    • David H. Murdock Research Institute

    • Virginia Bioinformatics Institute

    • University of Florida speeds up memory intensive gene

    • UCSF

    • National Center for Supercomputing Applications

Confidential


1847848

谢谢!

Confidential


  • Login