1 / 23

Memory System Support for Image Processing

Memory System Support for Image Processing. Lixin Zhang, John B. Carter, Wilson C. Hsieh, Sally A. Mckee Department of Computer Science University of Utah Presented by Lixin Zhang. Characteristics of Image Processing. High data bandwidth needs Large cache footprints Lack of data reuse

vasanti
Download Presentation

Memory System Support for Image Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory System Support for Image Processing Lixin Zhang, John B. Carter, Wilson C. Hsieh, Sally A. Mckee Department of Computer Science University of Utah Presented by Lixin Zhang

  2. Characteristics of Image Processing • High data bandwidth needs • Large cache footprints • Lack of data reuse • Non-unit strides • Bottom line: traditional memory system does not work well for image processing! • But their access patterns are often predictable!

  3. Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Benchmarks • Performance • Conclusion

  4. Basic Idea: An Innovative Memory System (Impulse) • Allow sparsely-stored data to be accessed densely • Load only the data needed by the processor into caches

  5. virtualspace physical space real physical memory MMU/TLB Impulse MC Real physical space Shadow address space Approaches • Using “unused” physical space (shadow addresses) to reorganize data • Memory controller (Impulse MC) maps shadow addresses to physical memory • OS/Compiler support

  6. Simple Example • Sum of diagonal elements of a dense matrix • Problems • Wasted bus bandwidth • Low cache utilization • Low cache hit ratio for(i = 0; i < n; i++) sum += A[i][i]; Physical Memory Cache Conventional Memory Controller Wasted bus bandwidth

  7. Using Impulse • Strided remapping • Benefits • No wasted bus bandwidth • Better cache utilization • Higher cache hit ratio diagonal = Impulse_remap(A,n,...); for(i = 0; i < n; i++) sum +=diagonal[i]; Physical Memory Cache Impulse Memory Controller shadow address

  8. Address Translations Physical Memory Virtual Memory MMU/TLB Conventional System A MMU/TLB diagonal Pseudo Virtual Memory Shadow Memory Physical Memory Virtual Memory Impulse System

  9. Impulse MC Internals • Shadow descriptor: stores remapping information • ALU: shadow addresses ==> pseudo-virtual addresses • Page table: pseudo-virtual addresses ==> real addresses

  10. Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Transpose Remapping • Scatter/gather through an indirection vector • Benchmarks • Performance • Conclusion

  11. Transpose Remapping • Create the transposed version of a matrix TA = Impulse_map(...) for(j=0;j<n;j++) for(i=0;i<m;i++) ..TA[j][i]..; for(j=0;j<n;j++) for(i=0;i<m;i++) ..A[i][j]..; • MC maps TA[j][i] to A[i][j] • Benefit • Unit-stride accesses instead of row-size-stride

  12. Scatter/gather through An Indirection Vector • Reorganize an array according to an indirection vector NA = Impulse_map(...) for(i=0;i<n;i++) ..NA[i]..; for(i=0;i<n;i++) ..A[iv[i]]..; • MC maps NA[i] to A[iv[i]] • Benefits: • Sequentially accessNA • No need to access iv in the processor

  13. Outline • Characteristics of image processing • Basic idea • Two remapping algorithms • Benchmarks • Volume rendering • Image rotation • Image filtering • Performance • Conclusion

  14. Volume Rendering • Algorithm • Brute-force ray tracing, orthographic tracer, 4x4x4 macro cells • Optimization • Pre-compute voxel sequences being visited • Impulse • Map voxels on a ray to contiguous shadow addresses

  15. Image Rotation • Algorithm • Separable image warp • Three-shear image rotation: horizontal, vertical, horizontal • Optimization • Tile the second shear operation • Impulse • Create transposed versions for the second shear operation

  16. Image Filtering • Algorithm • Binomial filter: applying a two-dimensional mask • Decomposed into a pair of linear filter: first row, then column • Impulse • Create transposed versions of both input and output image for walk along column Mask of order-3

  17. Performance of Volume Rendering • The rays are parallel to x-axis, 1kx1kx1k volume P.S. Time is in million cycles; TLB misses is in millions.

  18. Performance of Volume Rendering • The rays are perpendicular to x-axis

  19. Performance of Image Rotation • Rotate 1kx1k color image through one radian, 32x32 tile

  20. Performance of Image Filtering • Order-121 binomial filter on a 1024x1024 color image

  21. Conclusion • Impulse memory system • Reorganize data in shadow space • MC maps them back to DRAM • Improve performance of memory system • Improved image processing benchmarks • Volume rendering by 226% • Image rotation by 19% • Image filtering by 44.7% • Looking for applications with • Poor cache/TLB behaviors • And probably predictable access patterns

  22. Simulation Environment • 120MHz HP PA-RISC 1.1 processor • 120MHz HP Runway Bus • L1 Cache: • 32Kbytes, 32-byte line, direct-mapped, virtual-indexed, physically-tagged, 1-cycle latency, write-through • L2 Cache • 256Kbytes, 128-byte line, 2-way associative, physically-indexed, physically-tagged, 8-cycle latency, write-allocate, write-back. • Mcache • 8Kbytes for non-remapped data; 512bytes for each remapped data structure

  23. MC-based Prefetching • Basic idea: prefetch data from DRAMs to MC • Hide DRAM latency • MCache: a small SRAM at the MC • A buffer for non-remapped data • A small buffer for each remapped data structure • Simple scheme: • Sequential prefetch for non-remapped data • Configurable-strided prefetch for remapped data

More Related