1 / 20

Judy Qiu xqiu@indiana http :// SALSA hpcdiana

Analysis Tools for Data Enabled Science. Indexed HBase. Judy Qiu xqiu@indiana.edu http :// SALSA hpc.indiana.edu School of Informatics and Computing Indiana University. Summer Workshop on Algorithms and Cyberinfrastructure for large scale optimization/AI, August 9, 2013.

abram
Download Presentation

Judy Qiu xqiu@indiana http :// SALSA hpcdiana

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis Tools for Data Enabled Science IndexedHBase Judy Qiu xqiu@indiana.edu http://SALSAhpc.indiana.edu School of Informatics and ComputingIndiana University Summer Workshop on Algorithms and Cyberinfrastructure for large scale optimization/AI, August 9, 2013

  2. Big Data Challenge (Source: Helen Sun, Oracle Big Data)

  3. Learning from Big DataConverting raw data to knowledge discovery • Exponential data growth • Continuous analysis of streaming data • A variety of algorithms and data structures • Multi/Manycoreand GPU architectures • Thousands of cores in clusters and millions in data centers • Cost and time trade-off • Parallelism is a must to process data in a meaningful length of time

  4. Programming Runtimes High-level programming models such as MapReduceadopt a data-centered design Computation starts from data Support moving computation to data Shows promising results for data-intensive computing ( Google, Yahoo, Amazon, Microsoft …) Challenges: traditional MapReduce and classical parallel runtimes cannot solve iterative algorithms efficiently Hadoop: repeated data access to HDFS, no optimization to (in memory) data caching and (collective) intermediate data transfers MPI: no natural support of fault tolerance; programming interface is complicated Pig Latin, Hive Workflows, Swift, Falkon Hadoop MapReduce PaaS: Worker Roles Classic Cloud: Queues, Workers MPI, PVM, HPF DAGMan, BOINC Chapel, X10 Achieve Higher Throughput Perform Computations Efficiently

  5. Applications & Different Interconnection Patterns Input Input map map iterations Input Pij map Output reduce reduce No Communication Collective Communication MPI Domain of MapReduce and Iterative Extensions

  6. Data Analysis Tools MapReduce optimized for iterative computations Twister: the speedy elephant • Abstractions • Portability • HPC (Java) • Azure Cloud (C#) • Map-Collective • Communication patterns optimized for large intermediate data transfer • In-Memory • Cacheable map/reduce tasks • Data Flow • Iterative • Loop Invariant • Variable data • Thread • Lightweight • Local aggregation

  7. Programming Model for Iterative MapReduce Loop Invariant Data Loaded only once Cacheable map/reduce tasks (in memory) Configure() Variable data Main Program Map(Key, Value) while(..) { runMapReduce(..) } Faster intermediate data transfer mechanism Reduce (Key, List<Value>) Combiner operation to collect all reduce outputs Combine(Map<Key,Value>) Distinction on loop invariant data and variable data (data flow vs. δ flow) Cacheable map/reduce tasks (in-memory) Combine operation

  8. Map-Collective Communication Model • Patterns • Map-ReduceScatter • PageRank, Belief Propagation • Map-AllReduce • KMeansClustering, MDS-StressCalc • MapReduce • Wordcount, Grep • MapReduce-MergeBroadcast • KMeansClustering, PageRank • Map-AllGather • MDS-BCCalc H-Collectives H-Collectives H-Collectives • We generalize the Map-Reduce concept to Map-Collective, noting that large collectives are a distinguishing feature of data intensive and data mining applications. • Collectives generalize Reduce to include all large scale linked communication-compute patterns. • MapReduce already includes a step in the collective direction with sort, shuffle, merge as well as basic reduction.

  9. Case Studies: Data Analysis Algorithms Support a suite of parallel data-analysis capabilities • Clustering using image data • Parallel Inverted Indexing used for HBase • Matrix algebra as needed • Matrix Multiplication • Equation Solving • Eigenvector/value Calculation

  10. Iterative Computations K-means Matrix Multiplication Performance of K-Means Parallel Overhead Matrix Multiplication

  11. PageRank Partial Adjacency Matrix Current Page ranks (Compressed) M Partial Updates R Iterations Partially merged Updates C [1] Pagerank Algorithm, http://en.wikipedia.org/wiki/PageRank [2] ClueWeb09 Data Set, http://boston.lti.cs.cmu.edu/Data/clueweb09/ Well-known page rank algorithm [1] Used ClueWeb09 [2] (1TB in size) from CMU Hadoop loads the web graph in every iteration Twister keeps the graph in memory Pregelapproach seems natural to graph-based problems

  12. Data Intensive Kmeans Clustering Images Clusters Patches HOG Features Collaboration with Prof. David Crandall 99000070-0 99000070 I 99000070 99000070-4 99000076 II 99000076 99000432-0 99000432 III 99000432 99000432-4 • Image Classification: 7 million images; 512 features per image; 1 million clusters; 10K Map tasks; 64G broadcasting data (1GB data transfer per Map task node); 20 TB intermediate data in shuffling. Feature Extraction Clustering

  13. High Dimensional Image Data • K-means Clustering algorithm is used to cluster the images with similar features. • Each image is characterized as a data point (vector) with dimensions in the range of 512 ~ 2048. Each value (feature) ranges from 0 to 255. • A full execution of the image clustering application • We successfully cluster 7.42 million vectors into 1 million cluster centers. 10000 map tasks are created on 125 nodes. Each node has 80 tasks, each task caches 742 vectors. • For 1 million centroids, broadcasting data size is about 512 MB. Shuffling data is 20 TB, while the data size after local aggregation is about 250 GB. • Since the total memory size on 125 nodes is 2 TB, we cannot even execute the program unless local aggregation is performed.

  14. Image Clustering Control Flow in Twister with new local aggregation feature in Map-Collective to drastically reduce intermediate data size Broadcast from Driver Worker 2 Worker 3 Worker 1 Map Map Map Local Aggregation Local Aggregation Local Aggregation Shuffle We explore operations such as high performance broadcasting and shuffling, then add them to Twister iterative MapReduce framework. There are different algorithms for broadcasting. • Pipeline (works well for Cloud) • minimum-spanning tree • bidirectional exchange • bucket algorithm Reduce Reduce Reduce Combine to Driver

  15. Broadcast Comparison: Twister vs. MPI Performance comparison of Twister chain method and Open MPI MPI_Bcast Performance comparison of Twister chain method and MPJ broadcasting method (MPJ 2GB is prediction only) Chain method with/without topology-awareness • The new topology-aware chain broadcasting algorithm gives 20% better performance than best C/C++ MPI methods (four times faster than Java MPJ) • A factor of 5 improvement over non-optimized (for topology) pipeline-based method over 150 nodes.

  16. Broadcast Comparison: Local Aggregation Comparison between shuffling with and without local aggregation Communication cost per iteration of the image clustering application • Left figure shows the time cost on shuffling is only 10% of the original time • Right figure presents the collective communication cost per iteration, which is 169 seconds (less than 3 minutes).

  17. Triangle Inequality and Kmeans • Dominant part of Kmeans algorithm is finding nearest center to each pointO(#Points * #Clusters * Vector Dimension) • Simple algorithms findmin over centers c: d(x, c) = distance(point x, center c) • But most of d(x, c) calculations are wasted, as they are much larger than minimum value • Elkan[1] showed how to use triangle inequality to speed up relations like:d(x, c) >= d(x, c-last) – d(c, c-last)c-last position of center at last iteration • So compare d(x,c-last) – d(c, c-last) with d(x, c-best) where c-best is nearest cluster at last iteration • Complexity reduced by a factor = Vector Dimension, and so this is important in clustering high dimension spaces such as social imagery with 512 or more features per image [1] Charles Elkan, Using the triangle inequality to accelerate k-means, in TWENTIETH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, Tom Fawcett and Nina Mishra, Editors. August 21-24, 2003. Washington DC. pages. 147-153.

  18. Fast Kmeans Algorithm d(x(P), m(now, c1)) ≥ d(x(P), m(last, c1)) – d(m(now, c1), m(last, c1)) (1) lower_bound = d(x(P), m(last, c)) – d(m(now, c), m(last, c)) ≥ d(x(P), m(last, c - current_best)) (2) Graph shows fraction of distances d(x, c) calculated in each iteration for a test data set 200K points, 124 centers, Vector Dimension 74

  19. Results on Fast Kmeans Algorithm Histograms of distance distributions for 3200 clusters for 76800 points in a 2048 dimensional space. The distances of points to their nearest center is shown as triangles; the distance to other centers (further away) as crosses; the distances between centers are the filled circles

  20. Data Analysis Architecture Support Scientific Simulations (Data Mining and Data Analysis) Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping Applications/Algorithms Security, Provenance, Portal Services and Workflow Programming Model High Level Language Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling) Runtime Distributed File Systems Object Store Data Parallel File System Storage Windows Server HPC Bare-system Amazon Cloud Azure Cloud Grid Appliance Linux HPC Bare-system Infrastructure Virtualization Virtualization CPU Nodes GPU Nodes Hardware

More Related