1 / 26

Case Studies of Porting Two Algorithms to Reconfigurable Processors

Case Studies of Porting Two Algorithms to Reconfigurable Processors. Reconfigurable Systems Summer Institute Wednesday July 13 2005 Craig Steffen National Center for SuperComputing Applications. Reminder: FPGA computational Strengths. Parallel elements

abel-franco
Download Presentation

Case Studies of Porting Two Algorithms to Reconfigurable Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute Wednesday July 13 2005 Craig Steffen National Center for SuperComputing Applications

  2. Reminder: FPGA computational Strengths • Parallel elements • Integer processing means small footprint • Processor cache structure—data reuse • Simple problems: maximum utilization • Simple problems to describe

  3. Example Algorithm: Matrix Multiply • Popular and much-used algorithm • Well-known API in place • Advantages: Simple, dividable, inherently parallel, data re-use • Disadvantage: floating point

  4. Matrix Multiply Algorithm • Parallel computations • Multiple data uses • Lends well to MAC units A B C a b c d k q e f g h X m = r n p q = ak + bm + cn + dp r = ek + fm + gn + hp

  5. Matrix Multiply Algorithm(matrix dimensions) δ θ δ a b c d k q α α e f g h X θ m = r n p q = ak + bm + cn + dp r = ek + fm + gn + hp

  6. Matrix Multiply Implementation in C for(i=0; i<α; i++){ for(j=0; j<δ; j++){ C[i][j] = 0.0; for(k=0; k<θ; k++){ C[i][j] += A[i][k] * B[k][j]; } } }

  7. Naïve Implementation Performance • Generated a version that timed matrix multiply with inputs α,δ,θ and N (iterations) • Going from 40,800,250,40 to 400,800,250,40 caused a 2.5x slowdown (cache issues) (data re-use rears its head) • Speed was 500M MACs per second, or 1B operations per second on a 2.8 GHz CPU • Real optimized library would run about 6x faster

  8. Matrix Multiply: block-wise divisible • Any block of elements may be multiplied as a unit • As long as the general rules are followed the final result is the same as it would have been • This can be exploited to take advantage of specialized units with preferred operand sizes X = = X + X

  9. 64-bit Floating-Point FPGA Matrix Multiplication • Yong Dou, S. Vassiliadis, G. K. Kuzmanov, G. N. Gaydadjiev • FPGA ’05, February 20-22, 2005, Monterey, California, USA • Contains a 12-stage pipelined MAC block design • Multi-FPGA master-slave multiple simultaneous execution design

  10. Plan for Implementation on SRC MAP Processor • MAP runs at 100 MHz clock speed • Assuming fully pipelined logical units (MAC in this case), requires 5 MACs running in parallel (disregarding transfer latencies)

  11. Single Data Use: MAP Starves MAP Processor • RAM to MAP data pipe can feed 2 MACs • 6 required to make it worthwhile 64 bit 32 bit 32 bit 64 bit 32 bit 32 bit

  12. Using Caching: Now Equals CPU Speed Data re-use in OBM MAP Processor FPGA On-Board Memory On-Board Memory 64 bit 32 bit 32 bit On-Board Memory 64 bit 32 bit 32 bit On-Board Memory On-Board Memory On-Board Memory

  13. More Speedup:Requires Data Re-use in FPGA Block Ram MAP Processor FPGA On-Board Memory On-Board Memory 64 bit 32 bit 32 bit On-Board Memory 64 bit 32 bit 32 bit On-Board Memory On-Board Memory On-Board Memory

  14. Matrix Multiply Status • On hold for the moment • Need to understand programming and access issues for Block Ram

  15. BLAST: DNA Comparison Code • Basic Local Alignment Search Tool • Biology code for comparing DNA and protein sequences • Multiple modes: DNA-DNA, DNA-protein, protein-DNA, protein-protein • Answers the question: “Is A a subset of B?” where A is short and B is very long • Not a complete match—takes into account DNA combinatoric and protein substitution rules

  16. blastp: compare DNA to Protein • First, translate DNA to Protein DNA Amino Acid sequence

  17. blastp: compare DNA to Protein • First, translate DNA to Protein DNA Frame 1 Amino Acid sequences Frame 2

  18. blastp: compare DNA to Protein • First, translate DNA to Protein • Translate all forward frames DNA Frame 1 Amino Acid sequences Frame 2 Frame 3

  19. blastp: compare DNA to Protein • First, translate DNA to Protein • Translate all forward frames • Complete “6-Frame translation” DNA Amino Acid sequences

  20. BLAST Method (per Frame): • Finds small local matches • Tries to expand matches to improve them, by changing matches or inserting gaps; all of which have weights which are tied to the probability of one combination mutating to another • Each change causes a change in the “goodness of match” score • Many combinations are attempted until the goodness value peaks • This method very unsuited for FPGA: each step depends on the previous one, one small comparison loop carrying all the weight Initial match: Score: two elements Score: 2 el. – 1 gap Score: 4 el. – 1 gap

  21. BLAST: Solve the same problem differently • Component-by-component comparison for multiple offsets • For each offset record a score based on the number and/or arrangement of matches • After all comparisons are finished, then (perhaps) do iterative matching Position score: 0 Position score: 2 Position score: 0 Position score: 2 Position score: 0

  22. Performance • Problem provided by Matt Hudson of Crop Sciences: a protein that is built by a DNA in a certain plant chromosome. • Protein is ~1100 amino acids, Chromosome is 31 million bases • BLAST detects two very strong hits and two weak ones, taking 3 CPU-seconds • My algorithm, when coded naively, takes 20 minutes • With reasonable speed-ups, takes 19 to 40 seconds

  23. FPGA Advantages to this Algorithm: • Each offset is independent • each element’s comparision is independent • Data re-use factor is the length of the short sequence • 6-fold parallelism due to 6-frame translation Position score: 0 Position score: 2 Position score: 0 Position score: 2 Position score: 0

  24. Implementation: • Do comparisons in parallel • Shift dna sequence, push new element into pipe • repeat

  25. Current Status • Element-wise shift not trivially defined in MAP-C, defined Perl-expanded macro • Must create parallel comparision and adder tree to finish ? + ? Total offset score + ? + ?

  26. Conclusion • Tools are gaining in usefulness and sophistication • The programmer must explicitly deal with memory architectures and data movement • Some things just don’t work as you’re used to thinking about them

More Related