1 / 39

Evaluation of Multi-core Architectures for Image Processing Algorithms

Evaluation of Multi-core Architectures for Image Processing Algorithms. Masters Thesis Presentation by Trupti Patil July 22, 2009. Overview. Motivation Contribution & scope Background Platforms Algorithms Experimental Results Conclusion. Motivation.

jennis
Download Presentation

Evaluation of Multi-core Architectures for Image Processing Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009

  2. Overview • Motivation • Contribution & scope • Background • Platforms • Algorithms • Experimental Results • Conclusion

  3. Motivation • Fast processing response a major requirement in many image processing applications. • Image processing algorithms can be computationally expensive • Data needs to be processed in parallel, and optimized for real-time execution • Recent introduction of massively-parallel computer architectures promising significant acceleration. • Some architectures haven’t been actively explored yet.

  4. Overview • Motivation • Contribution & scope • Background • Platforms • Algorithms • Experimental Results • Conclusion

  5. Contribution & scope of the thesis • This thesis adapts and optimizes three image processing and computer vision algorithms for four multi-core architectures. • The timings are found • Obtained timings are compared against available corresponding previous work (intra-class) and architecture type (inter-class). • Appropriate deductions are made based on results.

  6. Overview • Motivation • Contribution & scope • Background • Platforms • Algorithms • Implementation • Conclusion

  7. Background • Need for Parallelization • SIMD Optimization • The need for faster execution time • Related work • Canny edge detection on CellBE [Gupta et al.] and on GPU [Luo et al.] • KLT tracking implementation on GPU [Sinha et al., Zach et al.]

  8. Overview • Motivation • Contribution & scope • Background • Platforms • Algorithms • Implementation • Experimental Results • Conclusion

  9. Hardware & Software Platforms

  10. Intel NetBurst & Core Microarchitectures • Improved performance/watt factor. • SSSE3 support for effective XMM registers’ utilization. • Supports SSE4 • Scales upto Quad-core • Can execute legacy IA-32 and SIMD applications at higher clock rate. • HT allows simultaneous multithreading. • Has two logical processors on each physical processor • Support for upto SSE3

  11. Cell Broadband Engine (CBE) PPE PPU L1 Instruction Cache L1 Data Cache Structural diagram of the Cell Broadband Engine L2 Cache SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE Graphics Device PPE Graphics Device Graphics Device PPE PPE EIB EIB EIB I/O Devices SPE Main Memory I/O Devices I/O Devices SPU Main Memory Main Memory Local Store (LS) SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE Memory Flow Controller (MFC)

  12. Cell processor overview • One Power-based PPE, with VMX • 32/32kB I/D L1, and 512kB L2 • dual issue, in order PPU, 2 HW threads • Eight SPEs, with up to 16x SIMD • dual issue, in order SPU • 128 registers (128b wide) • 256 kB local store (LS) • 2x 16B/cycle DMA, 16 outstanding req. • Element Interconnect Bus (EIB) • 4 rings, 16B wide (at 1:2 clock) • 96B/cycle peak, 16B/cycle to memory • 2x 16B/cycle BIF and I/O • External communication • Dual XDR memory controller (MIC) • Two configurable bus interfaces (BIC) • Classical I/O interface • SMP coherent interface

  13. Graphics Processing Unit (GPU) Data flow in GPU F R A M E B U F F E R Vertex Processor Assemble & Rasterize Fragment Processor Frame buffer Operations Application Textures

  14. NvidiaGeForce 8 Series GPU Graphics pipeline in NVIDIA GeForce 8 Series GPU

  15. Compute Unified Device Interface (CUDA) • Computing engine in Nvidia GPUs • Makes GPU a compute device into a highly multithreaded coprocessor. • Provides both low level and a higher level APIs • Has several advantages over GPUs using graphics APIs (e.g.: OpenGL)

  16. Overview • Motivation • Contribution & scope • Background • Platforms • Algorithms • Experimental Results • Conclusion

  17. Algorithm 1: Gaussian Smoothing • Gaussian smoothing is a filtering kernel • Removes small-scale texture and noise for given spatial extent • 1-D Gaussian kernel written as: • 2-D Gaussian kernel: Separable

  18. Gaussian Smoothing (example)

  19. Algorithm 2: Canny Edge Detection • Edge detection a commonly operation in image processing • Edges are discontinuities in image gray levels, have strong intensity contrast. • Canny Edge Detection is an optimal edge-detector algorithm. • Illustrated ahead with an example.

  20. Canny Edge Detection (example)

  21. Algorithm 3: KLT Tracking • First proposed by Lucas and Kanade. Extended by Tomasi and Kanade and Shi and Tomasi . • Firstly, determine what feature(s) to track through feature selection • Secondly, track the selected feature(s) across image sequence. • Rests on three assumptions: temporal persistence, spatial coherence and brightness constancy

  22. Algorithm 3: KLT Tracking

  23. Overview • Motivation • Contribution & scope • Background • Platforms • Algorithms • Results • Conclusion

  24. Gaussian Smoothing: Results Lenna Mandrill

  25. Results: Gaussian Smoothing

  26. Canny edge detection: Results Lenna Mandrill

  27. Results: Canny edge detection

  28. Results: Canny Edge Detection Comparison with other implementations on Cell Comparison with other implementations on GPU

  29. Results: KLT Tracking

  30. Results: KLT Tracking Comparison with other implementations on GPU Comparison with other implementations on GPU • No known implementations yet.

  31. Overview • Motivation • Contribution & scope • Background • Platforms • Algorithms • Results • Conclusion & Extension

  32. Conclusion & Future work • GPU still ahead of other architectures, most suited for image processing applications. • Optimizing PS3 could improve timings to narrow the gap between its and GPU timings. We could provide: • Support for faster color Canny. • Support for kernel width larger than 5 • Better management of thread alignment in GPU if not a multiple of 16 • Include Intel Xeon & Larrabee as potential architectures.

  33. Questions..

  34. Additional Slides

  35. CBE Architecture • Contains traditional microprocessor, PowerPC Processor Element (PPE) – Controls tasks • 64-bit PPC: 32 KB L1 instruction cache, 32 KB L1 data cache, and 512 KB L2 cache. • PPE controls 8 synergistic processor elements (SPEs) operating as SIMD units • Each SPE has an SPU and a memory flow controller (MFC) - data intensive tasks • SPU (RISC) with 128 128-bit SIMD registers 256KB local store (LS). • PPE, SPE, MIC, BIC connected by Element Interconnect Bus (EIB) – for data movement - ring bus consisting of four 16 byte channels providing sustained b/w of 204.8 GB/s. MFC connectionto Rambus XDR memory and BIC interface to I/O devices connected via RapidIO provide 25.6 GB/s of data b/w.

  36. CBE: What makes it fast? • Huge inter-SPE bandwidth • 205 GB/s sustained output • Fast main memory • 256.5 GB/s bandwidth for Rambus XDR memory • Predictable DMA latency and throughput • DMA traffic has negligible impact on SPE local store bandwidth • Easy to overlap data movement with computation • High performance, low-power SPE cores

  37. NvidiaGeForce (Continued) • GPU has K multiprocessors (MP) • Each MP has L scalar processors (SP) • Each MP performs block processing in batches • A block is processed by only one MP • Each block is split into SIMD groups of threads (warps) • A warp is executed physically in parallel • A scheduler switches between warps • A warp contains threads of increasing, consecutive thread IDs • Currently a warp size is 32 threads

  38. CUDA: Programming model Grid of thread blocks Block (0,0) Block (1,0) Block (2,0) Block (3,0) • Grid consist of thread blocks • Each thread executes the kernel • Grid and block dimensions specified by application. Max. by GPU memory • 1/ 2/ 3-D grid layout • Thread and Block-IDs are unique Block (0,1) Block (1,1) Block (2,1) Block (3,1) Block (2,1) Thread (0,0) Thread (1,0) Thread (3,0) Thread (4,0) Thread (5,0) Thread (0,1) Thread (1,1) Thread (3,1) Thread (4,1) Thread (5,1) Thread (0,7) Thread (1,7) Thread (3,7) Thread (4,7) Thread (5,7) Warp 1 Warp 2

  39. CUDA: Memory model • Shared memory(R/W) - For sharing data within block • Texture memory – spatially cached • Constant memory – About 20K, cached • Global Memory – Not cached, coalesce • Explicit GPU memory alloc/de-allocation • Slow copying between CPU and GPU memory

More Related