1 / 23

Apex-Map Status Erich Strohmaier and Hongzhang Shan

Apex-Map Status Erich Strohmaier and Hongzhang Shan. Apex-Map generator. Benchmark code will be generated based on the following performance parameters: PARALLEL: N/Y PARALLEL LANGUAGE: MPI / SHMEM / UPC / CAF ACCESS PATTERN: RANDOM / STRIDE

canton
Download Presentation

Apex-Map Status Erich Strohmaier and Hongzhang Shan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apex-Map StatusErich Strohmaier and Hongzhang Shan

  2. Apex-Map generator • Benchmark code will be generated based on the following performance parameters: • PARALLEL: N/Y • PARALLEL LANGUAGE: MPI / SHMEM / UPC / CAF • ACCESS PATTERN: RANDOM / STRIDE • SPATIAL LOCALITY (L): [1, M] Default: {1, 4, 16, …, 65536} • CONCURRENCY (I) : [1, X] Default: 1024 • TEMPORAL LOCALITY (a): [0,1] Default: {1.0 0.5 0.25 0.1 0.05 0.025 0.01 0.005 0.0025 0.001} • MEMORY SIZE (M) : Default: 67,108,864 Words = 512MB / process • REGISTER PRESSURE ( R ): [1, X] Default: 1 • COMPUTATIONAL INTENSITY (CI) : [1, X] Default: 1 • ACCESS MODE: FUSED / NESTED • RESULTS: SCALAR / ARRAY (left hand side of statement) • REPEAT TIMES: 100 • WARMUP TIMES: 10 • CPU MHZ: 1900 • PLATFORM: BASSI • VERSION: 1.6 • STRIDE: X • X: any positive integer

  3. Apex-Map Meets Kernels

  4. NAS CG (one stream) Source Code: ========== DO j = 1, lastrow-firstrow+1 sum = 0.d0 DO k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) ENDDO w(j) = sum ENDDO One-Stream Approach: using one Apex-Map stream to simulate NAS CG performance behavior. Temporal locality currently needs to be defined by experiments.

  5. Performance Prediction for CG (using one stream) The results indicate that the performance of CG for different data sets can be simulated by Apex-Map using one stream with temporal locality ranging from 0.03 - 0.01 (exception: data set S on Jacquard)

  6. NAS CG (two streams) Source Code: =========== DO j = 1, lastrow-firstrow+1 sum = 0.d0 DO k = rowstr(j), rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) ENDDO w(j) = sum ENDDO Two-Stream Approach: (a, p are treated differently) Perf. of CG = 1/(1/Perf_stream1+1/Perf_stream2)

  7. Performance Prediction for CG (using two streams) Using two-stream approach, performance matches very well on Jacquard. However, on Franklin, only large data sets match well.

  8. GUPS Source Code: ========== For ( i = 0; i < NUPDATE; i++) { ran = (ran << 1)^ (((s64int) ran < 0) ? POLY : 0); Table[ran & (TableSize -1)] ^= ran; } Results Match Well!

  9. Matrix-Mul (stride) Source Code: ========== For ( i = 0; i < N; i++) { For ( j = 0; j < K; j++) { tmp = 0; For ( k = 0; k < M; k++) { tmp += a[i*M+k] * b[k*K+j]; } c[i*K+j] = tmp; } } There are two choices for Apex-Map: Use random stream Use stride stream

  10. Performance Prediction for Matrix-Mul (stride) 1. Stride stream matches well. 2. Big performance gap between MM and Apex-Map using random stream

  11. Matrix-Mul (vector) Source Code: ========== For ( i = 0; i < N; i++) For ( k = 0; k < M; k++) For ( j = 0; j < K; j++) c[i*K+j] += a[i*M+k] * b[k*K+j]; On Franklin, perf. Matches well when temp. locality is 0.02. On Jacquard, not a close match (compiler inefficiency for Apex-Map kernels ?)

  12. NBODY Source Code (Loop Body): =================== SUBVEC(p->position, bod->position, diff) DOTPROD(diff, diff, distSq) distSq += SOFTSQ dist = sqrt(distSq) factor = p->mass/dist bod->phi -= factor Factor = factor / distSq MULTVEC(diff, factor, extraAcc) ADDVEC(bod->acc, extraAcc, bod->acc) • FDIV, and FSQRT are implemented differently across platforms and will affect the computation of MF/s and Computational Intensity (CI): • use a test program to determine the ratio between fdiv, fsqrt and fadd to decide CI for Apex-Map • use No. Loops/second executed as performance metric instead of MF/s

  13. Performance Prediction for Nbody Apex-Map results match well with Nbody on Franklin, big difference on Jacquard

  14. STREAM Source Code: ========== For ( i = 0; i < N; i++) c[i] = a[i] For ( i = 0; i < N; i++) b[i] = s*c[i] For ( i = 0; i < N; i++) c[i] = a[i]+b[i] For ( i = 0; i < N; i++) a[i] = b[i]+s*c[i] Big Perf. Difference Due to: 1. Static vs. Dynamic mem alloc 2. Kernel impl. details

  15. STREAM: Static vs. Dynamic Static: .text .align 16 .globl tuned_STREAM_Copy tuned_STREAM_Copy: ..Dcfb4: subq $8,%rsp ..Dcfi4: ## lineno: 0 ..EN5: ## lineno: 395 movl $c+0,%edi movl $a+0,%esi movl $1048576,%edx .p2align 4,,1 call __c_mcopy8 ## lineno: 396 addq $8,%rsp ret Dynamic: .text .align 16 .globl tuned_STREAM_Copy tuned_STREAM_Copy: ..Dcfb4: ## lineno: 0 ..EN5: ## lineno: 402 xorl %ecx,%ecx movl $524288,%edx movl $8,%eax .align 16 .LB2164: ## lineno: 402 movq a(%rip),%rsi movq c(%rip),%r8 decl %edx movq (%rsi,%rcx),%rdi movq %rdi,(%r8,%rcx) addq $16,%rcx movq (%rsi,%rax),%r9 movq %r9,(%r8,%rax) addq $16,%rax testl %edx,%edx jg .LB2164 ## lineno: 403 ret Different codes are generated for Static and Dynamic (may cause 50% perf diff)

  16. Random Nested (R=1, CI=1) • Array • for (i = 0; i < times; i++) { • index- length = B / L; • initIndexArray(index- length); • CLOCK(time1); • for (j = 0; j < index- length; j++) { • for (k = 0; k < L; k++) { • W0[j*L+k] = W0[j*L+k]+c0*(data[ind0[j]+k]); • } • } • CLOCK(time2); • } • Scalar • for (i = 0; i < times; i++) { • index-length = B / L; • initIndexArray(index- length); • CLOCK(time1); • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]); • } • } • CLOCK(time2); • } • initIndexArray (length): • for (i = 0; i < length; i++) { • ind0[i] = getIndex(0) * L; • } How many Load/Store count?

  17. Random Fused(R=1, CI=1) • Array • for (i = 0; i < times; i++) { • index-length = B; • initIndexArray(index- length); • CLOCK(time1); • for (j = 0; j < index-length; j++) { • W0[j] = W0[j]+c0*(data[ind0[j]]); • } • CLOCK(time2); • } • Scalar • for (i = 0; i < times; i++) { • index-length = B; • initIndexArray(index- length); • CLOCK(time1); • for (j = 0; j < index-length; j++) { • W0 = W0+c0*(data[ind0[j]]); • } • CLOCK(time2); • } • initIndexArray (length): • for (i = 0; i < length; i += L) { • ind0[i] = getIndex(0) * L; • for (j = 1; (j < L) && (i+j < length); j++) { • ind0[i+j] = ind0[i] + j; • }}

  18. Random Nested Scalar ( R ) • R=2 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]); • W1 = W1+c1*(data[ind1[j]+k]); • } • } • R=1 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]); • } • } • initIndexArray (length): • for (i = 0; i < length; i++) { • ind0[i] = getIndex(0) * L; • } • initIndexArray (length): • for (i = 0; i < length; i++) { • ind0[i] = getIndex(0) * L; • ind1[I] = getIndex(1) * L; • } ind0 ind0 ind1

  19. Random Nested Scalar ( CI ) • R=1, CI = 1 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]); • } } • R=1, CI = 2 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind0[j]+k])); • } } • R=2, CI = 4 • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind1[j]+k]+c0*(data[ind0[j]+k]+c0*(data[ind1[j]+k])))); • W1 = W1+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k])))); • } }

  20. Random Nested Scalar ( R, CI ) • R=3, CI = 3 • for (i = 0; i < times; i++) { • index-length = B / L; • initIndexArray(index-length); • CLOCK(time1); • for (j = 0; j < index-length; j++) { • for (k = 0; k < L; k++) { • W0 = W0+c0*(data[ind0[j]+k]+c0*(data[ind2[j]+k]+c0*(data[ind1[j]+k]))); • W1 = W1+c1*(data[ind1[j]+k]+c1*(data[ind0[j]+k]+c1*(data[ind2[j]+k]))); • W2 = W2+c2*(data[ind2[j]+k]+c2*(data[ind1[j]+k]+c2*(data[ind0[j]+k]))); • } • } • CLOCK(time2); • }

  21. Register Pressure ( R ) Effect

  22. Computational Intensity (CI) Effect

  23. % Peak for Random Nested Scalar (R=1, CI=1)

More Related