1 / 24

Robert Farber and Harold Trease Pacific Northwest National Laboratory

Massively Parallel Near-Linear Scalability Algorithms with Application to Unstructured Video Analysis. Robert Farber and Harold Trease Pacific Northwest National Laboratory Acknowledge: Adam Wynne (PNNL) , Lynn Trease (PNNL), Tim Carlson (PNNL), Ryan Mooney (now at google.com).

theola
Download Presentation

Robert Farber and Harold Trease Pacific Northwest National Laboratory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Massively Parallel Near-Linear Scalability Algorithms with Application to Unstructured Video Analysis Robert Farber and Harold Trease Pacific Northwest National Laboratory Acknowledge: Adam Wynne (PNNL) , Lynn Trease (PNNL), Tim Carlson (PNNL), Ryan Mooney (now at google.com) 1

  2. Image/Video Analysis and Applications“Have we seen this person’s face before?” • Goals: Image/Video content analysis, 1 million frames-per-second of processing capability (~1 TByte/sec) • Streaming, unstructured video data represents high-volume, low information content data • Huge volumes of archival data • Requirement: Scalable algorithms to transform unstructured data into large sparse graphs for analysis • This talk will focus on the Principle Component Analysis of video signatures • The framework is generally applicable to other problems! • Video analysis has many applications • Face recognition (and object recognition) • Social networks • Many others 2

  3. First Task: Isolate the Faces 1 2 3 4 • Original frame • RGB-to-HIS • Sobel edge detection • Only skin pixels The bottom row contains frames of skin pixel patches that ID the three faces in this frame 3

  4. Signatures Workflow Archival Data YouTube (huge!) 10k cameras = ~300,000 fps = ~300 GB/sec Split into frames and calculate entropic measures Algorithms PCA NLPCA MDS Clustering Others • Frames/Faces are separable • Faces form trajectories • Face DB • Derive social networks 4

  5. Signatures • First steps are embarrassingly parallel • Split video into separate frames • Calculate signature of each frame and write to file Workflow Archival Data YouTube (huge!) 10k cameras = ~300,000 fps = ~300 GB/sec Split into frames and calculate entropic measures 5

  6. Split into frames and calculate entropic measures Signatures Workflow Archival Data YouTube (huge!) 10k cameras = ~300,000 fps = ~300 GB/sec Algorithms PCA NLPCA MDS Clustering Others • Frames/Faces are separable • Faces form trajectories • Face DB • Derive social networks 6

  7. Working with large data sets (Think BIG: 108 signatures and greater) • Formulate PCA (NLPCA, MDS, and others) as an objective function • Use your favorite solver (Conjugate Gradient) • Map to massively parallel hardware (SIMD, MIMD, SPMD, etc.) • Ranger, NVIDIA GPUs, others Massive parallelism needed to handle large data sets • 10,000 video cameras = ~300,000 fps = ~300 GB/sec • Consider all of YouTube as a video archive • Our Supercomputing 2005 data set = 2.2M frames • Test YouTube dataset consisted of over 22M frames 7

  8. Formulate PCA as objective function • Calculate the PCA by passing information though a bottleneck layer in a linear feed-forward neural network • Oja, Erkki (November 1982). "Simplified neuron model as a principal component analyzer". Journal of Mathematical Biology 15 (3): 267-273. • Sanger, Terence D. (1989). "Optimal unsupervised learning in a single-layer linear feedforward neural network". Neural Networks 2 (6): 459-473. • Use your favorite solver (Conjugate Gradient …) • Saul A. William H. Press, Brian P. Flannery and William T. Vetterling. “Numerical Recipes in C: The Art of Scientific Computing”. Cambridge University Press, 1993. 8

  9. Pass the information through a bottleneck 9

  10. Map to massively parallel hardware • Large data sets require parallel data load to deliver necessary bandwidth • Use Lustre because it scales: PNNL achieved 136 GB/s sustained read, 86 GB/s sustained write • Broadcast filename plus data size and file offset to each MPI client • Each client opens the data file, seeks to location and reads appropriate data 10

  11. Scales by P Scales by P Scales by P Optimization Routine Optimization Routine Optimization Routine Optimization Routine Optimization Routine Optimization Routine Optimization Routine Optimization Routine (Powell, Conjugate Gradient, or other method) (Powell, Conjugate Gradient, or other method) (Powell, Conjugate Gradient, etcetera) (Powell, Conjugate Gradient, etcetera) (Powell, Conjugate Gradient, etcetera) (Powell, Conjugate Gradient, etcetera) (Powell, Conjugate Gradient, etcetera) (Powell, Conjugate Gradient, etcetera) Objective Function Objective Function Objective Function Objective Function Objective Function Objective Function Energy = func(P Energy = func(P Energy = func(P Energy = func(P Energy = func(P Energy = func(P ,P ,P ,P ,P ,P ,P , , , , , , … … … … … … , P , P , P , P , P , P ) ) ) ) ) ) 1 1 1 1 1 1 2 2 2 2 2 2 N N N N N N Step 1 Step 1 Step 1 Scales by (data/Nproc) Broadcast Broadcast Broadcast Step 1 Step 1 Parameters, P Parameters, P Parameters, P Step 1 Step 2 Step 2 Step 2 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Step 3 Broadcast Broadcast Broadcast Calculate Partial Calculate Partial Calculate Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Sum Partial Parameters, P Parameters, P Parameters, P Energies Energies Energies Energies Energies Energies Energies Energies Energies Energies Energies Energies Energies Energies Energies Energies Energies Core 1 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 , P , P P P ,P ,P , , … … P P ,P ,P , , … … , P , P P P ,P ,P , , … … , P , P P P ,P ,P , , … … , P , P 1 1 2 2 N N 1 1 2 2 N N 1 1 2 2 N N 1 1 2 2 N N P ,P , … , P P ,P , … , P P ,P , … , P P ,P , … , P 1 2 N 1 2 N 1 2 N 1 2 N Examples Examples Examples Examples Examples Examples Examples Examples Examples 1,N 1,N N+1,2N 2N+1,3N 3N+1,4N 1,N N+1,2N 2N+1,3N 3N+1,4N O(log2(Nproc)) O(log2(Nproc)) O(log2(Nproc)) Evaluate objective function in massively parallel manner 11

  12. Report Effective Rate • Every evaluation of the objective function requires: • Broadcasting a new set of parameters • Calculating the partial sum of the errors on each node • Obtain the global sum of the partial sum of the errors • Treduce is highly network dependent • Low bandwidth and/or high latency is bad! 12

  13. Very efficient and near-linear scaling on Ranger Note: 32k and 64k runs will occur when possible 13

  14. Reduce operation does affect scaling 14

  15. Objective function performance scaling by data size on Ranger(Synthetic benchmark with no communications) • Without Prefetch • Achieved 8 GF/s per core using SSE • Interesting performance segregation • With Prefetch • Achieved nearly 8 GF/s per core • Bizarre jump at 800k examples 15

  16. Most time (> 90%) is spent in objective function when solving PCA problem Note: data sizes were kept constant per node, which meant each trial trained on different data 16

  17. Mapping works with other problems(and architectures) • SIMD version used by Farber since early 1980s on 64k processor Connection Machines (and other SIMD, MIMD, SPMD, Vector, & Cluster architectures) • R.M Farber, “Efficiently Modeling Neural Networks on Massively Parallel Computers”, Los Alamos National Laboratory Technical Report LA‑UR‑92‑3568. • Kurt Thearling, “Massively Parallel Architectures and Algorithms for Time Series Analysis” , published in the 1993 Lectures in Complex Systems, edited by L. Nadel and D. Stein, Addison-Wesley, 1995. • Alexander Singer, “Implementations of Artificial Neural Networks on the Connection Machine. Technical Report” RL90-2, Thinking Machines Corporation, 245 First Street, Cambridge, MA 02142, January 1990. • Many different applications aside from PCA • Independent Component Analysis • K-means • Fourier approximation • Expectation Maximization • Logistic regression • Gaussian Discriminative Analysis • Locally weighted Linear Regression • Naïve Bayes • Support Vector Machines • Others 17

  18. PCA components form trajectories in 3-space • Separable trajectories – can build face DB! • Different faces form separate tracks • Same faces continuous across cameras • Multiple faces extracted from individual frames - can infer social networks! 18

  19. Preliminary results using PCA • Public ground truth datasets are scarce – work in progress • PCA was first step (funding limited) • NLPCA, MDS and other methods promise to increase accuracy • Using Eucledian distance between points as a recognition metric: • 99.9% accuracy in one data set • 2 false positives in a 2k database of known faces • Each face in the database was compared against the entire database as a self-consistency check. • Social networks have been created and are being evaluated • Again, ground truth data is scarce 19

  20. Summary: High-Performance Video Analysis Streaming video  Face database SC05 Videos Building Social Network Graphs From Face Data and Face DB  Social Network Partitioning face based graphs to discover relationships 20

  21. Two video examples (in conjunction with Blogosphere text analysis by Michelle Gregory and Andrew Cowell) 351 videos, ~3.6 million frames, ~4.4 Tbytes (Each point is a video frame, each color is a different video, coordinates are PCA projection of N-d feature vector into 3-D) 512 YouTube videos, ~22.6 million frames, ~5.2 Tbytes 21

  22. Connecting the points and forming the sparse graph connectivity for analysis Delaunay/Voronoi mesh: Shows the mesh connections where “points” are connected by “edges” which equals a graph Adjacency matrix: Each row represents a frame, columns represent connected frames Clusters and social network defines how one frame (face) relates to another 22

  23. Graph Partitioning (using Voronoi/Delaunay mesh connected graphs) Adjacency before partitioning Adjacency after partitioning Delaunay/Voronoi mesh Point distribution Partitioned Mesh 23

  24. Classification, Characterization and Clustering of High-Dimensional Data 24

More Related