- 71 Views
- Uploaded on
- Presentation posted in: General

Benchmarks of a Weather Forecasting Research Model

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Benchmarks of a Weather Forecasting Research Model

Daniel B. Weber, Ph.D.

Research Scientist

CAPS/University of Oklahoma

****CONFIDENTIAL****

August 3, 2001

- 20% increase in compute time for 2proc/node configuration on Intel Based systems due to bus competition
- File system very slow on Intel based systems without fiber channel
- File system is a weak link (UNM-LL)
- 5.5mb/sec sustained for 480 2proc/node tests writing 2.1mb files from 8 separate processors simultaneously
- passing through linux file server not r6000

- ES-40 Alpha EV-67 (TCS) is 5 times faster computationally than the INTEL PIII/733
- Alpha (TCS) file system is very slow at times, need to look at the configuration, shows potential for very fast transfer rates
- MPI overhead for a 256 processor TCS job is on the order of 15%, very good network performance.

- ES-45 Alpha EV-67 (TCS) is 1.5 times faster computationally than the ES-40
- 4-5 times faster than Intel PIII-1Ghz (using the Intel F90 compiler).

- Two modes:
- Loop Optimization
- MPI optimization

- MPI requirements 30% on 450 processors on the Platinum IA-32 NCSA Cluster.
- Calculations (70+%) primarily 3-D DO-Loops.

- MPI Optimization:
- Hide communications via calculations
- Requires hand coding and knowledge of the computational structure - a very time intensive task.
- Maximum gain is limited to the communication costs (30%), realistically we may obtain a 15% improvement

- Loop Optimization for Vector Processors
- Issues:
- Length of vector pipeline, the longer the better.
- KMA work shows nearly 75% peak (6GFLOPS per processor on the SX-5).
- Code was hand tuned, hundreds of loops.

- Loop Optimization for Scalar Processors
- Issues:
- Cheap, fast processors.
- Cache reuse is very important.
- Rethink the order/layout of the computational structure of ARPS.
- Some optimization was included in 1997, that removed redundant computations and combined loops (good for both Vector and Scalar machines).
- CPU utilization only 10-20% of peak.

- New Approach to Loop Optimization
- Combine loops further, the result is reduced loads and stores. This is very important on the new Intel technology.
- Cache reuse is critical!
- Force improvements in the compiler technology.
- Our goal is to generate optimizations that are platform INDEPENDENT.
- Example

- Horizontal Advection - Original Version
- DO k=2,nz-2 ! compute avgx(u) * difx(u)
- DO j=1,ny-1
- DO i=1,nx-1
- tem2(i,j,k)=tema*(u(i,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i,j,k,2))
- END DO’s
- DO k=2,nz-2 ! compute avg2x(u)*dif2x(u)
- DO j=1,ny-1
- DO i=2,nx-1
- tem3(i,j,k)=tema*(u(i-1,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i-1,j,k,2))
- END DO’s
- DO k=2,nz-2 ! compute 4/3*avgx(tem2)+1/3*avg2x(tem3)
- DO j=1,ny-1 ! signs are reversed for force array.
- DO i=3,nx-2
- uforce(i,j,k)=uforce(i,j,k)
- : +tema*(tem3(i+2,j,k)+tem3(i-1,j,k))
- : -temb*(tem2(i-1,j,k)+tem2(i,j,k))
- END DO’s

- Horizontal Advection - Modified Version
- Three loops are merged into one large loop that reuses data and reduces loads and stores.
- DO k=2,nz-2
- DO j=1,ny-1
- DO i=3,nx-2
- uforce(i,j,k)=uforce(i,j,k)
- : +tema*((u(i,j,k,2)+u(i+2,j,k,2))*(u(i+2,j,k,2)-u(i,j,k,2))
- : +(u(i-2,j,k,2)+u(i,j,k,2))*(u(i,j,k,2)-u(i-2,j,k,2)))
- : -temb*((u(i,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i,j,k,2))
- : + (u(i-1,j,k,2)+u(i,j,k,2))*(u(i,j,k,2)-u(i-1,j,k,2)))
- END DO’s...

optimized

original

optimized

original