1 / 19

Levels of Processor/Memory Hierarchy

Levels of Processor/Memory Hierarchy. Can be Modeled by Increasing Dimensionality of Data Array. Additional dimension for each level of the hierarchy. Envision data as reshaped to reflect increased dimensionality.

sabine
Download Presentation

Levels of Processor/Memory Hierarchy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Levels of Processor/Memory Hierarchy • Can be Modeled by Increasing Dimensionality of Data Array. • Additional dimension for each level of the hierarchy. • Envision data as reshaped to reflect increased dimensionality. • Calculus automatically transforms algorithm to reflect reshaped data array. • Data, layout, data movement, and scalarization automatically generated based on reshaped data array.

  2. Levels of Processor/Memory Hierarchycontinued • Math and indexing operations in same expression • Framework for design space search • Rigorous and provably correct • Extensible to complex architectures Mathematics of Arrays y= conv (x) Map Approach Example: “raising” array dimensionality < 0 1 2 > x: < 0 1 2 … 35 > < 3 4 5 > P0 Main Memory < 6 7 8 > < 9 10 11 > L2 Cache < 12 13 14 > L1 Cache < 15 16 17 > P1 Memory Hierarchy Map: < 18 19 20 > < 21 22 23 > < 24 25 26 > < 27 28 29 > P2 Parallelism < 30 31 32 > < 33 34 35 >

  3. Application DomainSignal Processing Pulse Compression Doppler Filtering Beamforming Detection 3-d Radar Data Processing Composition of Monolithic Array Operations Convolution Matrix Multiply Change algorithmto better match hardware/memory/communication. Lift dimension algebraically • Hardware Info: • - Memory • - Processor Algorithm is Input Architectural Information is Input Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3

  4. Some Psi Calculus Operations

  5. Convolution: PSI Calculus Description < 0 0 1 2 . . . > < 0 1 2 3 . . . > < 1 2 3 4 . . . > < 0 0 1 > < 0 1 2 > < 1 2 3 > < 0 0 7 > < 0 6 14 > < 5 12 21 > Definition of y=conv(h,x) y[n]= where x‘has N elements, h has M elements, 0≤n<N+M-1, andx’ is x padded by M-1 zeros on either end Psi Calculus Algorithm step Algorithm and PSI Calculus Description x= < 1 2 3 4 > h= < 5 6 7 > Initial step x= < 1 2 3 4 > h= < 5 6 7 > Form x’ x’= x’=cat(reshape(<k-1>, <0>), cat(x, reshape(<k-1>,<0>)))= < 0 0 1 . . . 4 0 0 > rotate x’ (N+M-1) times x’ rot= x’ rot=binaryOmega(rotate,0,iota(N+M-1), 1 x’) take the size of h part of x’rot x’ final= x’ final=binaryOmega(take,0,reshape<N+M-1>,<M>,=1,x’ rot Prod= multiply Prod=binaryOmega (*,1, h,1,x’ final) sum Y=unaryOmega (sum, 1, Prod) < 7 20 38 . . . > Y= PSI Calculus operators compose to form higher level operations

  6. Experimental Platform and Method Hardware • DY4 CHAMP-AV Board • Contains 4 MPC7400’s and 1 MPC 8420 • MPC7400 (G4) • 450 MHz • 32 KB L1 data cache • 2 MB L2 cache • 64 MB memory/processor Software • VxWorks 5.2 • Real-time OS • GCC 2.95.4 (non-official release) • GCC 2.95.3 with patches for VxWorks • Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method • Run many iterations, report average, minimum, maximum time • From 10,000,000 iterations for small data sizes, to 1000 for large data sizes • All approaches run on same data • Only average times shown here • Only one G4 processor used • Use of the VxWorks OS resulted in very low variability in timing • High degree of confidence in results

  7. Convolution and Dimension Lifting • Model Processor and Level 1 cache. • Start with 1-d inputs(input dimension). • Envision 2nd dimension ranging over output values. • Envision Processors • Reshaped into a 3rd dimension. • The 2nd dimension is partitioned. • Envision Cache • Reshaped into a 4th dimension. • The 1st dimension is partitioned. • “psi” Reduce to Normal Form

  8. Envision 2nd dimension ranging over output values. • Let tz=N+M-1 • M=th=4 • N=tx 4 x S h3 h2 h1 h0 0 0 0 x0 tz tz

  9. - Envision Processors • Reshaped into a 3rd dimension. • The 2nd dimension is partitioned. Let p = 2 -4 -4 - -4- x x S S tz tz 2 2 2

  10. Envision Cache • Reshaped into a 4th dimension • The 1st dimension is partitioned. x x S S 2 2 S S 2 2 x x 2 2 tz/2 tz 2 Tz/2 Tz/2 2

  11. ONF for the Convolution Decomposition with Processors & Cache Generic form- 4 dimensional after “psi” Reduction • For i0= 0 to p-1 do: • For i11= 0 to tz/p –1 do: • sum  0 • For icacherow= 0 to M/cache -1 do: • For i3 = 0 to cache –1 do: • sum  sum + h [(M-((icacherow* cache) + i3))-1]* x’[(((tz/p * i0)+i1) + icacherow* cache) + i3)] Let tz=N+M-1 M=th N=tx P r o c e s s o r l o o p C a c h e l o o p T I m e l o o p sum is calculated for each element of y. Time Domain

  12. Outline • Overview • Array Algebra: MoA and Index Calculus: Psi Calculus • Time Domain Convolution • Other algorithms in Radar • Modified Gram-Schmidt QR Decompositions • MOA to ONF • Experiments • Composition of Matrix Multiplication in Beamforming • MoA to DNF • Experiments • FFT • Benefits of Using Moa and Psi Calculus

  13. Algorithms in Radar Mechanize Using Expression Templates Time DomainConvolution (x,y) ONF for 1proc • Lift dimension • - Processor • - L1 cache • reformulate Use toreason about RAW Manual description & derivation for 1 processor DNF DNF ONF Thoughtson an Abstract Machine Modified Gram SchmidtQR (A) ImplementDNF/ONFFortran 90 CompilerOptimizationsDNF to ONF Benchmark at NCSAw/LAPACK A x (BH x C) Beamforming MoA & y Calculus

  14. Benefits of Using Moa and Psi Calculus • Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level. • Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities • Matches memory hierarchy levels • Avoids materialization of intermediate arrays. • Algorithm can be automatically(algebraically) transformed to reflect array reshapings above. • Facilitates programming expressed at a high level • Facilitates intentional program design and analysis • Facilitates portability • This approach is applicable to many other problems in radar.

  15. ONF for the QRDecomposition with Processors & Cache Initialization ProcessorLoop ComputeNorm MainLo o p ProcessorLoop Normalize ProcessorCache Loop DoTProduct Processor CacheLoop Ortothogonalize Modified Gram Schmidt

  16. DNF for the Composition of A x (BH x C) Generic form- 4 dimensional • Z=0 • For i=0 to n-1 do: • For j=0 to n-1 do: • For k=0 to n-1 do: • z[k;]z[k;]+A[k;j]xX[j;i]xB[i;] Given A, B, X, Z n by n arrays Beamforming

  17. Typical C++ Operator Overloading Example: A=B+C vector add 2 temporary vectors created Main 1. Pass B and C references to operator + Additional Memory Use B&, C& • Static memory • Dynamic memory (also affects execution time) Operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4. Return copy of temporary temp B+C temp Additional Execution Time temp copy 5. Pass results reference to operator= • Cache misses/page faults • Time to create anew vector • Time to create a copy of a vector • Time to destructboth temporaries temp copy & Operator = temp copy A 6. Perform assignment

  18. C++ Expression Templates and PETE Parse Tree Expression Type Expression + BinaryNode<OpAdd, Reference<Vector>, Reference<Vector > > ExpressionTemplates A=B+C C B Main Parse trees, not vectors, created Parse trees, not vectors, created 1. Pass B and Creferences to operator + Reduced Memory Use B&, C& Operator + • Parse tree only contains references 2. Create expressionparse tree + B& C& 3. Return expressionparse tree copy Reduced Execution Time 4. Pass expression treereference to operator • Better cache use • Loop fusion style optimization • Compile-time expression tree manipulation copy & Operator = 5. Calculate result andperform assignment B+C A • PETE, the Portable Expression Template Engine, is available from theAdvanced Computing Laboratory at Los Alamos National Laboratory • PETE provides: • Expression template capability • Facilities to help navigate and evaluating parse trees PETE: http://www.acl.lanl.gov/pete

  19. Implementing Psi Calculus with Expression Templates take 4 drop rev 3 B size=4 A[i]=B[-i+6] size=7 A[i] =B[-(i+3)+9] =B[-i+6] size=10 A[i] =B[-i+B.size-1] =B[-i+9] size=10 A[i]=B[i] Example: A=take(4,drop(3,rev(B))) B=<1 2 3 4 5 6 7 8 9 10> A=<7 6 5 4> Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[i]=y[stride*i+ offset] l ≤ i < u 1. Form expression tree 3. Apply Psi Reduction rules 2. Add size information take size=4 drop 4 size=7 Reduction Size info rev 3 size=10 size=10 B 4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root • Iterators used for efficiency, rather than recalculating indices for each i • One “for” loop to evaluate each sub-expression iterator: offset=6stride=-1 size=4

More Related