Loading in 2 Seconds...

Efficient Radar Processing Via Array and Index Algebras

Loading in 2 Seconds...

- By
**studs** - Follow User

- 111 Views
- Uploaded on

Download Presentation
## Efficient Radar Processing Via Array and Index Algebras

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

**Efficient Radar Processing Via Array and Index Algebras**Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry B. Hunt III, Xingmin Luo University at Albany, SUNY NSF CCR 0105536**Outline**• Overview • Motivation • Radar Software Processing: to exceed 1 x 1012 ops/second • The Mapping Problem: • Efficient Use of Memory Hierarchy, Portable, Scalable, … • Radar uses Linear and Multi-linear Operators: • Array Based Operations • Array Operations Require Array Algebra and Index Calculus • Array Algebra: MoA and Index Calculus: Psi Calculus • Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension • High-Level Monolithic Operations: Remove Temporaries • Time Domain Convolution • Benefits of Using MoA and Psi Calculus**Levels of Processor/Memory Hierarchy**• Can be Modeled by Increasing Dimensionality of Data Array. • Additional dimension for each level of the hierarchy. • Envision data as reshaped to reflect increased dimensionality. • Calculus automatically transforms algorithm to reflect reshaped data array. • Data, layout, data movement, and scalarization automatically generated based on reshaped data array.**Levels of Processor/Memory Hierarchycontinued**• Math and indexing operations in same expression • Framework for design space search • Rigorous and provably correct • Extensible to complex architectures Mathematics of Arrays y= conv (x) Map Approach Example: “raising” array dimensionality < 0 1 2 > x: < 0 1 2 … 35 > < 3 4 5 > P0 Main Memory < 6 7 8 > < 9 10 11 > L2 Cache < 12 13 14 > L1 Cache < 15 16 17 > P1 Memory Hierarchy Map: < 18 19 20 > < 21 22 23 > < 24 25 26 > < 27 28 29 > P2 Parallelism < 30 31 32 > < 33 34 35 >**Application DomainSignal Processing**Pulse Compression Doppler Filtering Beamforming Detection 3-d Radar Data Processing Composition of Monolithic Array Operations Convolution Matrix Multiply Change algorithmto better match hardware/memory/communication. Lift dimension algebraically • Hardware Info: • - Memory • - Processor Algorithm is Input Architectural Information is Input Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3**Current Abstraction Approaches**Some Modern Programming Languages with Monolithic Arrays C++ w/classesfunctions, templates Fortran 95 ZPL MATLAB PETE AST Preprocessor Loop transformationsTheories Grammar Changes CompilerAST OptimizationsGrammar Changes StandardCompilerOptimizations Interpreted Compiled PartialAlgebras Requireshighly skilledprogrammers Blas, Linpack, LAPACK, SCALAPACK ATLAS PVL, Blitz++,MTL Libraries Libraries High Performance Fine Tune Scalable/Portable Classical CompilerTechnology &Optimization Even when operations compose, they don’t compose, X(YZ) without temporary arrays**Outline**• Overview • Array Algebra: MoA and Index Calculus: Psi Calculus • Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension • High-Level Monolithic Operations: Remove Temporaries • Time Domain Convolution • Benefits of Using MoA and Psi Calculus**PSI Calculus**• Basic Properties: • Index calculus: Centers around psi function. • Shape polymorphic functions and operators: • Operations are defined using shapes and psi. • Fundamental type is the array modeled as (shape_vector, components). • scalars are 0-dimensional arrays, that is: (empty_vector, scalar value). • Denotational Normal Form(DNF) = reduced form in Cartesian coordinates (independent of data layout: row major, column major, regular sparse, …) • Operational Normal Form(ONF) = reduced form for 1-d memory layout(s).**Psi Reduction**This becomes by “psi” Reduction A=cat(rev(B), rev(C)) A[i]=B[B.size-1-i] if 0≤ i < B.size A[i]=C[C.size+B.size-1-i] if B.size ≤ i < B.size+C.size) ONF has minimum number of reads/writes PSI Calculus rules applied mechanically to produce ONF which is easily translated to optimal loop implementation**Convolution: PSI Calculus Description**< 0 0 1 2 . . . > < 0 1 2 3 . . . > < 1 2 3 4 . . . > < 0 0 1 > < 0 1 2 > < 1 2 3 > < 0 0 7 > < 0 6 14 > < 5 12 21 > Definition of y=conv(h,x) y[n]= where x‘has N elements, h has M elements, 0≤n<N+M-1, andx’ is x padded by M-1 zeros on either end Psi Calculus Algorithm step Algorithm and PSI Calculus Description x= < 1 2 3 4 > h= < 5 6 7 > Initial step x= < 1 2 3 4 > h= < 5 6 7 > Form x’ x’= x’=cat(reshape(<k-1>, <0>), cat(x, reshape(<k-1>,<0>)))= < 0 0 1 . . . 4 0 0 > rotate x’ (N+M-1) times x’ rot= x’ rot=binaryOmega(rotate,0,iota(N+M-1), 1 x’) take the size of h part of x’rot x’ final= x’ final=binaryOmega(take,0,reshape<N+M-1>,<M>,=1,x’ rot Prod= multiply Prod=binaryOmega (*,1, h,1,x’ final) sum Y=unaryOmega (sum, 1, Prod) < 7 20 38 . . . > Y= PSI Calculus operators compose to form higher level operations**Experimental Platform and Method**Hardware • DY4 CHAMP-AV Board • Contains 4 MPC7400’s and 1 MPC 8420 • MPC7400 (G4) • 450 MHz • 32 KB L1 data cache • 2 MB L2 cache • 64 MB memory/processor Software • VxWorks 5.2 • Real-time OS • GCC 2.95.4 (non-official release) • GCC 2.95.3 with patches for VxWorks • Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method • Run many iterations, report average, minimum, maximum time • From 10,000,000 iterations for small data sizes, to 1000 for large data sizes • All approaches run on same data • Only average times shown here • Only one G4 processor used • Use of the VxWorks OS resulted in very low variability in timing • High degree of confidence in results**Experiment: Conv(x,h)**• Cost of temporaries in regular C++ approach more pronounced due to large number of operations • Cost of expression tree manipulation also more pronounced**Convolution and Dimension Lifting**• Model Processor and Level 1 cache. • Start with 1-d inputs(input dimension). • Envision 2nd dimension ranging over output values. • Envision Processors • Reshaped into a 3rd dimension. • The 2nd dimension is partitioned. • Envision Cache • Reshaped into a 4th dimension. • The 1st dimension is partitioned. • “psi” Reduce to Normal Form**Envision 2nd dimension ranging over output values.**• Let tz=N+M-1 • M=th=4 • N=tx 4 x S h3 h2 h1 h0 0 0 0 x0 tz tz**- Envision Processors**• Reshaped into a 3rd dimension. • The 2nd dimension is partitioned. Let p = 2 -4 -4 - -4- x x S S tz tz 2 2 2**Envision Cache**• Reshaped into a 4th dimension • The 1st dimension is partitioned. x x S S 2 2 S S 2 2 x x 2 2 tz/2 tz 2 Tz/2 Tz/2 2**ONF for the Convolution Decomposition with Processors &**Cache Generic form- 4 dimensional after “psi” Reduction • For i0= 0 to p-1 do: • For i11= 0 to tz/p –1 do: • sum 0 • For icacherow= 0 to M/cache -1 do: • For i3 = 0 to cache –1 do: • sum sum + h [(M-((icacherow* cache) + i3))-1]* x’[(((tz/p * i0)+i1) + icacherow* cache) + i3)] Let tz=N+M-1 M=th N=tx P r o c e s s o r l o o p C a c h e l o o p T I m e l o o p sum is calculated for each element of y. Time Domain**Outline**• Overview • Array Algebra: MoA and Index Calculus: Psi Calculus • Time Domain Convolution • Other algorithms in Radar • Modified Gram-Schmidt QR Decompositions • MOA to ONF • Experiments • Composition of Matrix Multiplication in Beamforming • MoA to DNF • Experiments • FFT • Benefits of Using Moa and Psi Calculus**Algorithms in Radar**Mechanize Using Expression Templates Time DomainConvolution (x,y) ONF for 1proc • Lift dimension • - Processor • - L1 cache • reformulate Use toreason about RAW Manual description & derivation for 1 processor DNF DNF ONF Thoughtson an Abstract Machine Modified Gram SchmidtQR (A) ImplementDNF/ONFFortran 90 CompilerOptimizationsDNF to ONF Benchmark at NCSAw/LAPACK A x (BH x C) Beamforming MoA & y Calculus**ONF for the QRDecomposition with Processors & Cache**Initialization ProcessorLoop ComputeNorm MainLo o p ProcessorLoop Normalize ProcessorCache Loop DoTProduct Processor CacheLoop Ortothogonalize Modified Gram Schmidt**DNF for the Composition of A x (BH x C)**Generic form- 4 dimensional • Z=0 • For i=0 to n-1 do: • For j=0 to n-1 do: • For k=0 to n-1 do: • z[k;]z[k;]+A[k;j]xX[j;i]xB[i;] Given A, B, X, Z n by n arrays Beamforming**Mechanizing MoA and Psi Reduction**Index Theory IntroducedAbrams 1972 • MoA & y calculus theory: Mullin ’88 • Prototype compiler: output C, F90, HPF: Mullin and Thibault’94 • HPF compiler: AST manipulations: Mullin, et al ‘96 • SAC: functional C: Mullin and Bodo’96 • C++ classes: Helal, Sameh and Mullin’01 • C++ expression templates: Mullin, Ruttledge, Bond’02 • PVL with the portable expression template engine(PETE) • Parallel and distributed processing • Abstract machine • Automate cost and determine optimizations: minimize search space Fortran C Theory applied to embedded systems C++ Lifting Compiler Optimizations to Application Programmer Interface**On-going research**• we are implementing the psi calculus using expression templates. • we are building on work done at MIT and we are working with MTL library developers (lumsdaine) at Indiana University and STL library developer, musser, at rpi.**Benefits of Using Moa and Psi Calculus**• Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level. • Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities • Matches memory hierarchy levels • Avoids materialization of intermediate arrays. • Algorithm can be automatically(algebraically) transformed to reflect array reshapings above. • Facilitates programming expressed at a high level • Facilitates intentional program design and analysis • Facilitates portability • This approach is applicable to many other problems in radar.**Email and Question?**• Lenore R. Mullin, lenore@cs.albany.edu • Daniel J. Rosenkrantz, djr@cs.albany.edu • Harry B. Hunt III, hunt@cs.albany.edu • Xingmin Luo, xluo@cs.albany.edu *The End***Typical C++ Operator Overloading**Example: A=B+C vector add 2 temporary vectors created Main 1. Pass B and C references to operator + Additional Memory Use B&, C& • Static memory • Dynamic memory (also affects execution time) Operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4. Return copy of temporary temp B+C temp Additional Execution Time temp copy 5. Pass results reference to operator= • Cache misses/page faults • Time to create anew vector • Time to create a copy of a vector • Time to destructboth temporaries temp copy & Operator = temp copy A 6. Perform assignment**C++ Expression Templates and PETE**Parse Tree Expression Type Expression + BinaryNode<OpAdd, Reference<Vector>, Reference<Vector > > ExpressionTemplates A=B+C C B Main Parse trees, not vectors, created Parse trees, not vectors, created 1. Pass B and Creferences to operator + Reduced Memory Use B&, C& Operator + • Parse tree only contains references 2. Create expressionparse tree + B& C& 3. Return expressionparse tree copy Reduced Execution Time 4. Pass expression treereference to operator • Better cache use • Loop fusion style optimization • Compile-time expression tree manipulation copy & Operator = 5. Calculate result andperform assignment B+C A • PETE, the Portable Expression Template Engine, is available from theAdvanced Computing Laboratory at Los Alamos National Laboratory • PETE provides: • Expression template capability • Facilities to help navigate and evaluating parse trees PETE: http://www.acl.lanl.gov/pete**Implementing Psi Calculus with Expression Templates**take 4 drop rev 3 B size=4 A[i]=B[-i+6] size=7 A[i] =B[-(i+3)+9] =B[-i+6] size=10 A[i] =B[-i+B.size-1] =B[-i+9] size=10 A[i]=B[i] Example: A=take(4,drop(3,rev(B))) B=<1 2 3 4 5 6 7 8 9 10> A=<7 6 5 4> Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[i]=y[stride*i+ offset] l ≤ i < u 1. Form expression tree 3. Apply Psi Reduction rules 2. Add size information take size=4 drop 4 size=7 Reduction Size info rev 3 size=10 size=10 B 4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root • Iterators used for efficiency, rather than recalculating indices for each i • One “for” loop to evaluate each sub-expression iterator: offset=6stride=-1 size=4

Download Presentation

Connecting to Server..