1 / 21

Distortion Correction ECE 6276 Project Review

Distortion Correction ECE 6276 Project Review. Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev. Initial Results. Problems. Old code was very slow Matlab was ported line-by-line Redundant computations Loops not nested correctly

Download Presentation

Distortion Correction ECE 6276 Project Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distortion CorrectionECE 6276Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev

  2. Initial Results

  3. Problems Old code was very slow Matlab was ported line-by-line Redundant computations Loops not nested correctly Not able to exploit Catapult C features fully

  4. Target & Test Vectors for Catapult Catapult C was targeted for the Stratix III FPGA with a clock frequency of 100 MHz For the following Catapult results used a 320x240 image like shown below:

  5. Test Vectors Images distorted in matlab so that ground truth exists Flattened into binary streams Identical format for matlab, plain C, AC Datatypes results

  6. Optimizations after CDR Look Up Tables Optimal fixed point bit sizes Algorithmic changes Streamlined loops (allows for optimal pipelining/unrolling) Math optimizations

  7. 1. Original Power Series with AC types div() Area: 11734 Throughput Cycles: 5,145,841 (67 per pixel) AC Datatypes div() function uses only bit operations and additions

  8. 2. Use of Fast division (iterative Newton’s method) Area: 12851.12 Throughput Cycles: 3,763,441 (49 per pixel) Initial was 5,145,841 Requires mult elements

  9. 3. Combined Power Series and Division Area: 17705 Throughput Cycles: 2,765,041 (36 per pixel) Initial was 5,145,841 Appears to be an example of loop shrinking using properties of add and multiply Found by writing out the sums and substituting the power series result as a sum into the div() iterative loop.

  10. 4. Add approximate square root (Taylor Series sum) Area: Throughput Cylces: 1,843,441 (24 per pixel) Initial was 5,145,841 279% total improvement in throughput Impractical total increase in area for this solution- the ROM is huge Not able to meet timing with fast square root

  11. Why the approximate sqrt ROM is difficult If equal step size in variable used, 256 size ROM works everywhere except near center Getting enough precision with equal step size requires too many entries (8192) Conclusion: the AC Datatypes sqrt() is quite good, it solves bit-at-a-time in the output. Only shifts and bit operations are needed. It takes a number of iterations but if the pixels are pipelined as a large block it doesn’t matter much. Smaller ROM fails- circle artifact in the middle

  12. Memory Size and Storage Optimization Change LUT to 256x4 (right side is power of 2 as well), tolerate slightly more error in approximation of inverse distortion function Use 2D arrays, get rid of indexing add and multiply See line-to-line comparison below; huge area savings! Before After

  13. Catapult C Results Summary

  14. Catapult C Results Summary (cont…)

  15. Catapult C Results Summary (cont…) Can meet up to 150MHz Optimized for 1 clock cycle per pixel

  16. Catapult C Results Summary (cont…) Not optimal (@168MHz) Notice negative slack

  17. Catapult C Results Summary (cont…) Optimal results for various images Meet 1280x960 @150MHz with minimal area overhead (according to Catapult)

  18. Verification

  19. Conclusions

  20. Future Work Parallelize algorithm Work in blocks of pixels Optimize buffer/memory usages Use streaming buffers Streamline algorithm Allow variable decimation/interpolation to make smoother undistortions

  21. Questions? ?

More Related