고성능 컴퓨팅을 위한 성능 최적화 기법

고성능 컴퓨팅을 위한 성능 최적화 기법 한국과학기술정보연구원 슈퍼컴퓨팅센터 김정호 2006년 11월 25일

목 차 • 서론 • 컴퓨터 구조 • 컴파일러 및 성능 분석 기법 • CPU 최적화 • 메모리 최적화 • 최적화 기술 적용 사례 • 요약 및 결론 KISTI Supercomputing Center

목 차 • 서론 • 성능최적화의 의의 및 필요성 • 컴퓨터 구조 • 컴파일러 및 성능 분석 기법 • CPU 최적화 • 메모리 최적화 • 최적화 기술 적용 사례 • 요약 및 결론 KISTI Supercomputing Center

성능최적화의 의의 및 필요성 • 컴퓨터 기술의 발달에 따라 과학 기술 연구개발에 있어서 컴퓨터를 이용한 수치연산 기법의 비중 확대 • 연구개발을 위한 계산 도구로서의 컴퓨터의 활용도 및 중요성 크게 증가 • 컴퓨터 기술 발달에 따라 컴퓨터의 구조 복잡해짐 • 컴퓨터의 구조 및 작동방식에 대한 기본적인 이해도 없이 수치연산 프로그래밍 • 컴퓨터가 최대로 낼 수 있는 성능의 10%에도 크게 못미치는 성능으로 계산 수행하는 경우가 많음 KISTI Supercomputing Center

성능최적화의 의의 및 필요성 • 컴퓨터의 구조 및 작동 방식에 대한 기본적 이해를 바탕으로 주어진 하드웨어 상에서 주어진 알고리즘이 최대의 성능을 내도록 구현함 : 성능 최적화 기법 컴퓨터 시뮬레이션을 이용한 연구를 좀 더 효율적으로 수행하기 위해 고성능 수치연산과 관련된 컴퓨터의 구조 및 작동 방식에 대한 기본적 이해 필요 High-performance program Best algorithm + Best implementation KISTI Supercomputing Center

목 차 • 서론 • 컴퓨터 구조 • 컴퓨터의 성능 결정 요인 • 프로세서 구조 • 이론최대성능 • 메모리 구조 • 캐쉬 메모리 • 컴파일러 및 성능 분석 기법 • CPU 최적화 • 메모리 최적화 • 최적화 기술 적용 사례 • 요약 및 결론 KISTI Supercomputing Center

컴퓨터의 성능 결정 요인 • Machine Cycle • 프로세서가 작동하는 최소주기 (예> Pentium 4 3.0GHz) • 반도체 기술에 의해 좌우됨 • Instruction/Cycle • 한 Machine Cycle에 처리할 수 있는 instruction 수 • 반도체 기술의 한계를 극복하기 위해 이 값을 증가시킴 • 단일 프로세서 내 : Pipeline, Superscalar, SIMD, FMA • 다중 프로세서 사용 : SMP, MPP, Cluster (병렬처리) • 데이터 전송 능력 • 계산에 필요한 데이터를 프로세서로 공급하는 능력 • Memory to CPU, Disk to Memory, System to System KISTI Supercomputing Center

Pipelining functional unit RD RD RD RD IF EX WB IF IF EX EX WB WB IF EX WB … 1 2 3 4 5 6 7 8 9 Cycle IF Instruction fetch and decode. 메모리로부터 프로세서로 명령어를 가져 오고 해석함 RD Read data메모리로부터 데이터를 가져 옴 EX Execution 명령어를 수행함 WB Write-back 실행 결과를 기록함 단일 명령어 수행과정 프로세서 구조 • Pipelining • 단일 명령어의 수행단계를 분리해서 각 단계가 독립적으로 수행될 수 있도록 함으로써 수행성능 향상 KISTI Supercomputing Center

RD RD RD RD RD IF IF IF EX EX EX WB WB WB IF IF EX EX WB WB RD RD RD IF IF IF EX EX EX WB WB WB Cycle 프로세서 구조 • Superscalar • 파이프라이닝 기능 유닛(Pipelining functional unit)을 여러 개포함하여 한 사이클 당 여러 개의 명령어가 동시에 처리될 수 있도록 함으로써 수행시간을 더욱 단축시키는 기법 • RISC 프로세서에서 채용 (POWER, Alpha, PA-RISC, …) 1 2 3 4 5 6 KISTI Supercomputing Center

프로세서 구조 • SIMD (Single Instruction Multiple Data) • 하나의 명령어(instruction)를 여러 개의 데이터에 대하여 수행하여 연산 성능 향상 : Data parallelism • 벡터형 슈퍼컴퓨터에서 주로 사용되던 개념 • Multimedia data의 고속처리를 위하여 x86 계열 프로세서에서채용 (Intel : MMX/SSE/SSE2, AMD : 3Dnow!) + SISD : ADD A(1), B(1) + SIMD : ADD A(1), B(1) KISTI Supercomputing Center

프로세서 구조 • FMA (Fused Mulply and Add] • 덧셈과 곱셈을 하나의 명령어로 수행 예> FMA a,b,c : a=a+b*c • 한 cycle 당 2개의 실수연산 수행 가능 • 필요에 따라 덧셈기(adder) 또는 곱셈기(multiplier)로도 작동 • 연산을 적절한 순서로 배열할 필요가 있음 a = b + c*d + e*f : ((b + c*d) + e*f) a = b*c + d*e + f : (((b*c) + d*e) + f) • POWER, PA-RISC, … • Out-of-order execution / Speculative execution • 주어진 instruction의 순서를 벗어나서 연산처리 가능 KISTI Supercomputing Center

이론최대성능 • 프로세서의 수치연산 유닛이 100% 가동시 이론적으로 낼 수 있는 최대 성능 : Peak performance • FLOPS (FLoating-point Operations Per Seconds) • Pipeline이 100% 가동 시 한 machine cycle 당 한 pipeline에서 1회의 계산 결과 획득 가능 • Superscalar 채용시 pipelining functional unit의 개수만큼 각각으로부터 계산 결과 획득 가능 • SIMD 작동 시 한번에 처리할 수 있는 데이터의 개수만큼 계산 결과 획득 가능 KISTI Supercomputing Center

이론최대성능 • 이론최대성능 = [ machine cycle ] x [ 1 또는 2(FMA) ] x [ pipelining functional unit의 개수 또는SIMD data member의 개수 ]FLOPS • 예제 • IBM POWER4 1.7GHz • Superscalar with 2 pipelining functional unit • FMA (Fused Mulply and Add) • 이론최대성능 = 1.7x2x2 = 6.8 GFLOPS • Intel Pentium 4 3.0GHz • SIMD with 2 double precision data members • 이론최대성능 = 3.0x1x2 = 6.0 GFLOPS • NEC SX-5 (312.5MHz) • 16 adder pipelines + 16 multiplier pipelines • 이론최대성능 = 0.3125x(16+16) = 10.0 GFLOPS KISTI Supercomputing Center

메모리 구조 • 계산을 수행하기 위해 메모리와 데이터 교환 필요 • Load : 메모리로부터 CPU(register)로 데이터 전송 • Store : CPU(register)의 데이터를 메모리에 기록 • Register : CPU에서 연산처리에 필요한 데이터를 보관하는 임시저장소 • 메모리의 성능 요소 • Memory Bandwidth : 프로세서와 메모리 간의 데이터 전송속도 (초당 전송 바이트수) • Memory Latency : CPU의 데이터 입출력 요구(Load/Store)에 메모리가 반응하는 시간 KISTI Supercomputing Center

메모리 구조 • 프로세서의 성능 향상에 비하여 메모리의 데이터 전송 속도가 상대적으로 느리게 향상됨 • 1 GFLOPS의 CPU가 연산 수행을 위해 표준적으로 24 Gbyte/sec의 memory bandwidth 필요 • 예 : A(I)=B(I)+C(I) • 현존 CPU들은 대부분 5 Gbyte/sec 이하의 memory bandwidth 제공 (실제 전송 속도는 훨씬 낮음) 메모리의 데이터 전송 속도가 시스템 성능에 있어 주요 Bottleneck으로 작용 KISTI Supercomputing Center

메모리 구조 • 메모리 시스템의 성능한계 극복 방안 • Interleaving/Pipelining : 독립적 메모리 뱅크에 순환적으로 주소 지정, Pipelined access (fixed stride) • Memory bandwidth 증가 • 벡터형 슈퍼컴퓨터에서 주요하게 활용됨 • Cache : 소규모의 고속 메모리에 자주 사용되는 데이터 저장 • Memory access에 국소성(locality)가 있는 경우에 효과가 있음 CPU Vector Processor Register Vector Register L1 cache L2 cache ……. Bank0 Bank1 Bank1023 Memory KISTI Supercomputing Center

메모리 구조 • Virtual memory • The addresses used by the program (virtual address) are decoupled from the actual addresses where the data is stored in memory (physical address) • Page table • Map containing translation information from virtual memory address to physical memory address • Translation Lookaside Buffer (TLB) • special cache(buffer) for virtual-to-physical-memory-address translation KISTI Supercomputing Center

캐쉬 메모리 • 주 메모리의 데이터의 일부를 보관할 수 있는 소규모의 고속 메모리 • 주 메모리에 비하여 훨씬 낮은 latency와 높은 bandwidth를 가짐 • 일반적으로 2 또는 3단계로 구성됨 • Cache Hit(데이터가 캐쉬에 있는 경우) / Cache Miss KISTI Supercomputing Center

캐쉬 메모리 • Cache line • 한 번에 메모리로부터 Load/Store되는 데이터의 크기. 한 번에 여러 개의 데이터를 Load/Store 함으로써 메모리 억세스 회수를 줄여 Latency의 효과를 감소시킴 • Cache 메모리의 구성 방식 • Direct mapped (associativity = 1) • Each word of main memory can be stored in exactly one word of the cache memory • Fully associative • A main memory word can be stored in any location in the cache • Set associative (associativity = k) • Each main memory word can be stored in one of k places in the cache KISTI Supercomputing Center

캐쉬 메모리 • 전형적인 캐쉬 메모리 구성 • AMD Athlon (from “Thunderbird” on): • L1 : 64 KB, L2 : 256 KB • Intel Pentium 4 • L1 : 8 KB, 4-way, line size = 64 • L2 : 256 KB, 8-way, line size = 128 • IBM POWER4 • L1 : 32 KB, 2-way • L2 : 1440 KB, 8-way, shared by 2CPUs • L3 : 128 MB, 8-way, shared by MCM (8CPUs) KISTI Supercomputing Center

CPU CPU c(1,1) c(2,1) ...... c(n,1) c(1,2) c(2,2) c(n,2) ...... c(1,n) c(2,n) ...... c(n,m) 256 lines of 128 bytes each real c(n,m) 8-way Cache Store Load 0 KB 32 KB 64 KB Memory 96 KB 128 KB c(1,1) c(2,1) ...... c(n,1) c(1,2) c(2,2) c(n,2) ...... c(1,n) c(2,n) ...... c(n,m) …… 32*n KB 32 elements 캐쉬 메모리 • Intel Pentium 4의 캐쉬 구조 (L2) KISTI Supercomputing Center

CPU CPU c(1,1) c(2,1) ...... c(n,1) c(1,2) c(2,2) c(n,2) ...... c(1,n) c(2,n) ...... c(n,m) 128 lines of 128 bytes each real c(n,m) 2-way Cache Store Load 0 KB 16 KB 32 KB Memory 48 KB 64 KB c(1,1) c(2,1) ...... c(n,1) c(1,2) c(2,2) c(n,2) ...... c(1,n) c(2,n) ...... c(n,m) …… 16*n KB 32 elements 캐쉬 메모리 • IBM POWER4의 캐쉬 구조 (L1) KISTI Supercomputing Center

목 차 • 서론 • 컴퓨터 구조 • 컴파일러 및 성능 분석 기법 • 컴파일러의 역할 및 기능 • 성능 분석 기법 • CPU 최적화 • 메모리 최적화 • 최적화 기술 적용 사례 • 요약 및 결론 KISTI Supercomputing Center

컴파일러의 역할 및 기능 • 소스 코드를 실행가능 파일로 변환 • 수행단계 : Preprocessing-Compile-Link • 컴파일 과정에서 기본적인 코드 최적화 기능 수행 (컴파일 옵션으로 조정) • Loop transform (unrolling/interchange/fusion) • Reordering operations • Inlining / Array padding • 대표적인 포트란 컴파일러 • Linux : g77, Absoft Fortran, Intel Fortran • Windows : Compaq Visual Fortran, Intel Fortran • IBM AIX : XL Fortran 90 • 그 외 각 하드웨어 Vendor 별 포트란 컴파일러 KISTI Supercomputing Center

컴파일러의 역할 및 기능 • 컴파일러의 최적화 기능 • IBM XL Fortran 90 -O3 –qarch=pwr4 –qtune=pwr4 –qcache=auto • NEC SX Fortran 90 -Chopt • HP(Compaq) GS320/HPC320 DEC Fortran 90 -O5 -fast KISTI Supercomputing Center

컴파일러의 역할 및 기능 • 소스 코드 리스팅 • IBM XL Fortran 90 –qlist -qsource • NEC SX Fortran 90 -R1, R2, R3, R4, R5 KISTI Supercomputing Center

성능 분석 기법 • 시간 측정 • 프로그램 전체 또는 특정 부분에 시간 측정 루틴을 삽입함으로써 성능을 측정할 수 있음 • 컴파일 옵션이나 코드 수정을 통해서 특정 부분의 성능 변화를 측정하는데 유용 • 시간 측정 루틴 사용 예 real t1,t2 t1=rtc() …… call do_something …… t2=rtc() print*, 'Elapsed time=',t2-t1 real t1,t2,ta(2) t1=etime(ta) …… call do_something …… t2=etime(ta) print*, 'Elapsed time=',t2-t1 IBM XL Fortran NEC SX Fortran KISTI Supercomputing Center

성능 분석 기법 • 성능 측정 • IBM : hpmcount • hpmcount exec • NEC : F_PROGINF or C_PROGINF (환경변수) • setenv F_PROGINF DETAIL • exec hpmcount (V 2.4.3) summary Total execution time (wall clock time): 186.691589 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 600.030000 seconds Total amount of time in system mode : 0.160000 seconds Maximum resident set size : 56208 Kbytes Average shared memory use in text segment : 150532 Kbytes*sec Average unshared memory use in data segment : 33491704 Kbytes*sec ………… Utilization rate : 234.223 % Load and store operations : 25415.127 M MIPS : 2514.435 Instructions per cycle : 0.826 HW Float points instructions per Cycle : 0.031 Floating point instructions + FMAs : 17446.736 M Float point instructions + FMA rate : 93.452 Mflip/s FMA percentage : 98.156 % Computation intensity : 0.686 KISTI Supercomputing Center

성능 분석 기법 • 프로파일러 • 프로그램 각 부분의 성능(소요시간)을 서브루틴 별로 또는 라인별로 분석하여 보여주는 프로그램 분석 도구 • 서브루틴 간의 호출 관계 및 종속 관계를 보여줌으로써 프로그램 구조의 분석을 용이하게 함 • 프로그램 성능 최적화를 위한 시작점 • 프로파일러의 종류 • prof, gprof 사용법 • 컴파일 시 –p (-pg) 옵션 추가 • 프로그램 실행 • 프로파일러 실행 • prof 실행파일이름 • gprof 실행파일이름 KISTI Supercomputing Center

성능 분석 기법 • 프로파일러 출력 데이터 예 (gprof) called/total parents index %time self descendents called+self name index called/total children 0.00 70.99 1/1 .__start [2] [1] 99.5 0.00 70.99 1 .main [1] 0.00 70.93 1/1 .ant_pml [3] 0.04 0.00 1/1 .reset [72] 0.01 0.00 1/1 .init [85] 0.00 0.01 1/1 .out_ant [86] 0.00 0.00 1/1 .in_cnl [147] 0.00 0.00 1/1 .in_geo [148] ----------------------------------------------- 6.6s <spontaneous> [2] 99.5 0.00 70.99 .__start [2] 0.00 70.99 1/1 .main [1] ----------------------------------------------- 0.00 70.93 1/1 .main [1] [3] 99.4 0.00 70.93 1 .ant_pml [3] 0.00 27.46 4096/4096 .adv_ph [4] 0.00 26.43 4096/4096 .adv_pe [5] 6.91 0.00 4096/4096 .adv_ie [14] 4.74 0.00 4096/4096 .adv_ih [19] KISTI Supercomputing Center

목 차 • 서론 • 컴퓨터 구조 • 컴파일러 및 성능 분석 기법 • CPU 최적화 • Loop unrolling • Fused Multiply-Add (FMA) instructions • ExposingInstruction-Level Parallelism (ILP) • Special functions • Eliminating overheads • 메모리 최적화 • 최적화 기술 적용 사례 • 요약 및 결론 KISTI Supercomputing Center

Loop unrolling • 루프를 둘 이상의 연속되는 루프 인덱스에 대해 풀어 씀 • 컴파일러의 최적화 기능에 의해 자동적으로 수행되기도 함 • Loop unrolling의 가장 단순한 효과 : test/jump instruction의 감소 (fatter loop body, less loop overhead) • 단위 실수연산 당 보다 적은 메모리 Load 구현 가능 • 기타 다른 여러 가지 긍정적 부수 효과 발생 가능 KISTI Supercomputing Center

Loop unrolling • Example : DAXPY operation ii= imod(N,4) do i= 1,ii a(i)= a(i)+b(i)*c enddo do i= 1+ii,N,4 a(i)= a(i)+b(i)*c a(i+1)= a(i+1)+b(i+1)*c a(i+2)= a(i+2)+b(i+2)*c a(i+3)= a(i+3)+b(i+3)*c enddo Preconditioning loop handles cases when N is no multiple of 4 do i=1,N a(i)= a(i)+b(i)*c enddo Loop unrolling KISTI Supercomputing Center

Loop unrolling • Improving flop/load ratio do i= 1,n,4 t0=y(i) t1=y(i+1) t2=y(i+2) t3=y(i+3) do j= 1,m t0=t0+a(j,i)*x(j) t1=t1+a(j,i+1)*x(j) t2=t2+a(j,i+2)*x(j) t3=t3+a(j,i+3)*x(j) enddo y(i)=t0 y(i+1)=t1 y(i+2)=t2 y(i+3)=t3 enddo do i= 1,N do j= 1,M y(i)=y(i)+a(j,i)*x(j) enddo enddo Innermost loop: 2 loads and 2 flops performed; i.e., we have one load per flop Innermost loop: 5 loads and 8 flops! Exposes instruction-level parallelism KISTI Supercomputing Center

Fused Multiply-Add (FMA) instructions • On many CPUs (e.g., IBM Power3/Power4) there is an instruction which multiplies two operands and adds the result to a third • a=a+b*c is performed by 1 instruction • Consider code a= b + c*d + f*g versus a= c*d + f*g + b • Can reordering be done automatically? KISTI Supercomputing Center

Exposing ILP program nrm2 real a(n) tt1= 0d0 tt2= 0d0 do j= 1,n,2 tt1= tt1 + a(j)*a(j) tt2= tt2 + a(j+1)*a(j+1) enddo tt= tt1 + tt2 print *, tt end program nrm1 real a(n) tt= 0d0 do j= 1,n tt= tt + a(j) * a(j) enddo print *,tt end KISTI Supercomputing Center

Exposing ILP • Superscalar CPUs have a high degree of on-chip parallelism that should be exploited • The optimized code uses temporary variables to indicate independent instruction streams • This is more than just loop unrolling! • Can this be done automatically? • Change in rounding errors? • SIMD ? KISTI Supercomputing Center

Exposing ILP • Software pipelining • Arranging instructions in groups that can be executed together in one cycle • Again, the idea is to exploit instruction-level parallelism (on-chip parallelism) • Often done by optimizing compilers, but not always successfully • Closely related to loop unrolling • Less important on out-of-order CPUs se = x * B(1) te = y * C(1) do i = 1,n-1 so = x * B(i+1) to = y * C(i+1) A(i) = A(i) + se + te se = so te = to enddo A(n) = A(n) + se + te do i = 1,n A(i) = A(i) + x * B(i) + y * C(i) enddo KISTI Supercomputing Center

Special functions • Expensive special functions (up to several dozen cycles) • / (divide), sqrt, exp, log, sin, cos, pow, … • Use mathematical identities; • log(x) + log(y) = log(x*y) • x**3 = x*x*x • Use special libraries that • vectorize when many of the same functions must be evaluated (MASS library for IBM, Vector library in Intel MKL) • trade accuracy for speed, when appropriate do i=1,n x(i)=x(i)/dx enddo oodx=1./dx do i=1,n x(i)=x(i)*oodx enddo KISTI Supercomputing Center

Eliminating overheads • if statements … • Prohibit some optimizations (e.g., loop unrolling in some cases) • Evaluating the condition expression takes time • CPU pipeline may be interrupted (dynamic jump prediction) • Goal: avoid if statements in the innermost loops • No generally applicable technique exists do i=1,n if(i.le.2) x(i)=0 else x(i)=y(i) enddo x(1)=0 x(2)=0 do i=3,n x(i)=y(i) enddo KISTI Supercomputing Center

Eliminating overheads • Eliminating subroutine calling overheads • Subroutine calls are expensive (on the order of up to 100 cycles) • Passing value arguments (copying data) can be extremely expensive, when used inappropriately • Passing reference arguments (as in FORTRAN) may be dangerous from a point of view of correct software • Generally, in tight loops, no subroutine calls should be used • Inlining subroutines : can be done by compilers! • Use inline keyword in C++, macros in C KISTI Supercomputing Center

목 차 • 서론 • 컴퓨터 구조 • 컴파일러 및 성능 분석 기법 • CPU 최적화 • 메모리 최적화 • Data layout optimizations • Data access optimizations • 최적화 기술 적용 사례 • 요약 및 결론 KISTI Supercomputing Center

Data layout optimizations • Stride-1 access : innermost loop iterates over first index • Either by choosing the right data layout (array transpose) or • By arranging nested loops in the right order (loop interchange) : data access optimization real a(N,M), b(N,M) do i=1,N do j=1,M a(i,j)=a(i,j)+b(i,j) enddo enddo real a(M,N), b(M,N) do i=1,N do j=1,M a(j,i)=a(j,i)+b(j,i) enddo enddo change data layout (transpose) Stride-N access Stride-1 access KISTI Supercomputing Center

Data layout optimizations • Stride-1 access • Transpose matrix • Loop interchange? real s(N), b(N,M), c(M) do i=1,N do j=1,M s(i)=s(i)+b(i,j)*c(j) enddo enddo real s(N), b(M,N), c(M) do i=1,N do j=1,M s(i)=s(i)+b(j,i)*c(j) enddo enddo KISTI Supercomputing Center

Data layout optimizations • Stride-1 access • Gauss elimination : change data layout do 10 i=1,n-1 do 20 j=i+1,n f=A(j,i)/A(i,i) do 30 k=i+1,n A(j,k)=A(j,k)-f*A(i,k) 30 continue do 31 k=1,m b(j,k)=b(j,k)-f*b(i,k) 31 continue 20 continue 10 continue do 10 i=1,n-1 do 20 j=i+1,n f=A(i,j)/A(i,i) do 30 k=i+1,n A(k,j)=A(k,j)-f*A(k,i) 30 continue do 31 k=1,m b(k,j)=b(k,j)-f*b(k,i) 31 continue 20 continue 10 continue KISTI Supercomputing Center

Data layout optimizations • Array padding • Idea: Allocate arrays larger than necessary • Change relative memory distances • Avoid severe cache thrashing effects • Can be done automatically by compilers • Example (FORTRAN: column-major order): double precision u(1024, 1024) double precision u(1024+pad, 1024) real A(4096,4096) do i = 1,4096 A(i,j) = A(i,j) + 0.4*(A(i,j-2)+A(i,j-1))+ 0.6*(A(i,j+1)+A(i,j+2)) enddo real A(4096+pad,4096) do i = 1,4096 A(i,j) = A(i,j) + 0.4*(A(i,j-2)+A(i,j-1))+ 0.6*(A(i,j+1)+A(i,j+2)) enddo KISTI Supercomputing Center

Data access optimizations • Loop unrolling (see above) • Loop interchange • Loop fusion • Loop split = loop distribution • Loop blocking KISTI Supercomputing Center

Data access optimizations • Loop interchange • Changing data access pattern real a(N,M), b(N,M) do j=1,M do i=1,N a(i,j)=a(i,j)+b(i,j) enddo enddo real a(N,M), b(N,M) do i=1,N do j=1,M a(i,j)=a(i,j)+b(i,j) enddo enddo change data access pattern Stride-N access Stride-1 access KISTI Supercomputing Center

Data access optimizations • Loop fusion do i= 1,N a(i)= a(i)+b(i) enddo do i= 1,N a(i)= a(i)*c(i) Enddo do i= 1,N a(i)= (a(i)+b(i))*c(i) Enddo ais loaded into the cache only once a is loaded into the cache twice (if sufficiently large) KISTI Supercomputing Center

Data access optimizations • Loop split • The inverse transformation of loop fusion • Divide work of one loop into two to make body less complicated • Leverage compiler optimizations • Enhance instruction cache utlization KISTI Supercomputing Center

고성능 컴퓨팅을 위한 성능 최적화 기법

고성능 컴퓨팅을 위한 성능 최적화 기법

Presentation Transcript