1 / 24

Exploring the Design Space of a Parallel Object Recognition System

Exploring the Design Space of a Parallel Object Recognition System. Bor-Yiing Su, subrian@eecs.berkeley.edu Kurt Keutzer, keutzer@eecs.berkeley.edu and the PALLAS group http://parlab.eecs.berkeley.edu/wiki/pallas Parallel Computing Lab, University of California, Berkeley.

zamir
Download Presentation

Exploring the Design Space of a Parallel Object Recognition System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring the Design Space of a Parallel Object Recognition System Bor-Yiing Su, subrian@eecs.berkeley.edu Kurt Keutzer, keutzer@eecs.berkeley.edu and the PALLAS group http://parlab.eecs.berkeley.edu/wiki/pallas Parallel Computing Lab, University of California, Berkeley

  2. Successes of the Pallas Group • MRI: Murphy M, Keutzer K, Vasanawala S, Lustig M (2010) Clinically Feasible Reconstruction for L1-SPIRiT Parallel Imaging and Compressed Sensing MRI. ISMRM 2010. • SVM: Catanzaro B, Sundaram N, Keutzer K (2008) Fast support vector machine training and classification on graphics processors. ICML 2008, pp 104-111. • Contour: Catanzaro B, Su B, Sundaram N, Lee Y, Murphy M, Keutzer K (2009) Efficient, high quality image contour detector. ICCV 2009, pp. 2381-2388. • Object Recognition: Su B, Brutch T, Keutzer K (2011) A parallel region based object recognition system. WACV 2011. • Optical Flow: Sundaram N, Brox T, Keutzer K (2010) Dense Point Trajectories by GPU-accelerated Large Displacement Optical Flow. ECCV 2010. • Speech: Chong J, Gonina E, You K, Keutzer K (2010) Exploring Recognition Network Representations for Efficient Speech Inference on Highly Parallel Platforms. International Speech Communication Association, pp. 1489-1492. • Value-at-risk: Dixon M, Chong J, Keutzer K (2009) Acceleration of market value-at-risk estimation. Workshop on High Performance Computing in Finance at Super Computing.

  3. Outline • Design Space • An Object Recognition System • Exploring the Design Space of the Object Recognition system • Conclusion

  4. Three Layers of Design Space Full Lanczos Solver ? Selective No with CW test Graph Partition ? BFS Graph Traversal Parallel Task Queue 2 x 8 Blocking Dimensions ? 4 x 4 8 x 2 • The design space of parallel applications is composed of three layers

  5. Statements Algorithm Layer Parallelization Strategy Layer Platform Layer Design Space Exploring the design space is necessary to achieve high performance on a hardware platform of choice Take advantage of domain knowledge is necessary to understand trade-offs among different parallelization methods and achieve peak performance

  6. Outline • Design Space • An Object Recognition System • Exploring the Design Space of the Object Recognition system • Future Work

  7. Object Recognition System Trained Categories Bottles Swans Giraffes Mugs Outputs Image Queries Object Recognition System

  8. Targeting Object Recognition System Training Flow Classification Flow • C. Gu, J. Lim, P. Arbelǽz, and J. Malik. Recognition using regions. In CVPR, 2009

  9. Outline • Design Space • An Object Recognition System • Exploring the Design Space of the Object Recognition system • Lanczos Eigensolver • Breadth First Search (BFS) Graph Traversal Kernel • Pair-wise Distance Kernel • Overall Performance • Demo • Conclusion

  10. Lanczos Eigensolver • The LanczosEigensolver is the most time consuming part in the contour detection computation • By linking pixels to its neighbors, we can construct an affinity matrix among the pixels • We can find good contours by performing the normalized cut spectral graph partitioning algorithm on the graph • The normalized cut can be solved by finding the eigenvectors of the affinity matrix

  11. Exploring the Algorithm Layer Full Reorthogonalization No Reorthogonalization Selective Reorthogonalization • Three different re-orthogonalization methods • Full re-orthogonalization • Selective re-orthogonalization • No re-orthogonalization with Cullum - Willoughby test

  12. Experimental Results • Explored Design Space • Three different re-orthogonalization methods • Conclusion: • Use the no re-orthogonalization with Cullum- Willoughby test method on a GPU in our system

  13. BFS Graph Traversal Kernel • The image segmentation component heavily relies on the BFS graph traversal kernel • Image Graph: • Nodes represent image pixels • Edges represent neighboring relationships • BFS graph traversal kernel: propagate information from some pixels to other pixels

  14. Exploring the Algorithm Layer • Reverse algorithm: Each node checks whether it can be updated by one or more neighboring nodes • Structured grid computation • Direct algorithm: Each source node propagates information to its neighbors • Traditional BFS graph traversal algorithm

  15. Exploring the Parallelization Strategy Layer ... Graph Partition Task Queue • Two strategies can be used to parallelize the traditional BFS graph traversal algorithm • Graph partition • Parallel task queue

  16. Associating Design Space Exploration with Input Data Properties Computation Time (s) • Conclusion: • Use the structured grid method on a GPU in our system Maximum Traversal Distance • Explored Design Space • Parallel Task queue on Intel Core i7 using OpenMP with 8 threads • Graph partition on Intel Core i7 using OpenMP with 8 threads • Structured grid on Nvidia GTX 480

  17. Pair-wise χ2 Distance Kernel Definition of the χ2 distance: • In both the training stage and the classification stage, we need to compute the pair-wise distance between two region sets • Similar regions have shorter distance • Different regions have longer distance • It is a matrix matrix multiplication computation • Replacing dot product into χ2 distance

  18. Exploring the Design Space • Platform Layer • Cache Mechanisms • No Cache • Hardware Controlled Cache (Texture memory on GPU) • Software Controlled Cache (Shared memory on GPU) • Outer χ2 distance • Algorithm Layer • Inner χ2 distance

  19. Associating Design Space Exploration with Input Data Properties • Conclusion: • Use the outer χ2distance method with software controlled cache in the training stage • Use the inner χ2distance method with software controlled cache in the classification stage • Explored Design Space • The combination of two algorithms and three cache mechanisms on Nvidia GTX 480

  20. Overall Performance: Speedups Training Classification

  21. Overall Performance: Detection Accuracy Detection Rate Detection Rate FPPI FPPI Serial Parallel

  22. Demo

  23. Conclusion Bor-Yiing Su, Tasneem Brutch, Kurt Keutzer, "A Parallel Region Based Object Recognition System," in IEEE Workshop on Applications of Computer Vision (WACV 2011), Hawaii, January 2011 Bryan Catanzaro, Bor-Yiing Su, Narayanan Sundaram, Yunsup Lee, Mark Murphy, and Kurt Keutzer, "Efficient, High-Quality Image Contour Detection," in International Conference on Computer Vision (ICCV-2009), pp. 2381-2388, Kyoto Japan, September 2009. Exploring the design space is necessary to achieve high performance on a hardware platform of choice Take advantage of domain knowledge is necessary to understand trade-offs among different parallelization methods and achieve peak performance We have developed a parallel object recognition system with comparable detection accuracy while achieving 110x-120x times speedup Work presented at ICCV 2009 and WACV 2011

  24. Thank You

More Related