Parallelizing KMeans using MapReduce IPKMeans vs PKMeans

Patrick Killeen School of Computer Science University of Ottawa, Ottawa, Canada pkill013@uottawa.ca Parallelizing KMeans using MapReduce IPKMeans vs PKMeans

1. Introduction 1.1 Internet of Things – Big Data 1.2 KMeans Algorithm 1.3 Hadoop MapReduce 1.4 PKMeans 1.5 IPKMeans 2. My Experimental Results 2.1 Hadoop Cluster Architecture 2.2 Results 3. Questions 4. References Table of Contents

1 Introduction

IoT and its applications Network connected devices Sensors and actuators Military asset tracking [7] Challenges: Big Data Big Data’s 5 Vs: Velocity, veracity, variety, value, and volume 1.1 Internet of Things – Big Data [8] [9]

Applications Opinion mining[13] Image pattern recognition[14] Stock exchange analysis[12] Steps Choose k initial random centroids (data points) Label points to their nearest centroids Recompute k centroids using their cluster’s average Go back to step 2 until convergence (centroids haven’t change) 1.2 KMeans Algorithm Figure 1. Example KMeans clustering result of 3 clusters Figure 2. Example KMeans centroids convergence of 3 clusters

Open source [1], based on Google’s work Hadoop cluster Many machines on a rack Huge files partitioned/split and stored on many machines MapReduce Slaves perform data analytic jobs on local data with following software component: Mapper: pre-processing phase Reducer: post-processing phase 1.3 Hadoop MapReduce [10] Figure 3. Overview of Example Hadoop Cluster

Mapper Labels data points to nearest centroid Any number of mappers Reducer Recomputes centroid using cluster average Number of reducers = k (number of clusters) PKMeans is proposed by [5] 1.4 PKMeans Figure 4. PKMeans example job, 5 mappers, 3 centroids, 3 reducers

Phase 1 Data Partitioning KDTree Create subgroups Phase 2 Parallel Kmeans Run KMeans on each subgroup Phase 3 Centroid Merging Pick best (most central) centroids found from phase 2 IPKMeans proposed by [4] 1.5 IPKMeans Figure 5. IPKMeans phase execution overview

2 My Experimental Results

Used Bitnami Hadoop VM Client Node Submit jobs SSH tunnel Master Node Manages jobs Service Node Job history management Slave Nodes Mapper Reducer Distributed Data Storage For more information on Hadoop configuration, see: [2][3] 2.1 Hadoop Cluster Architercutre Figure 6. My Openstack VM Hadoop Cluster Configuration

2.2 Results Figure 8. Increasing dataset size with 10 nodes and 7 reducers Figure 7. Data set with 3000 point and 3 Gaussian distributed clusters Figure 9. Varying initial centroids with 3000 points, 7 reducers, and 10 nodes Figure 10. Varying initial centroids with 84000 points, 7 reducers, and 10 nodes

Question 1: What is a centroid in the KMeans algorithm? Question 2: What is Hadoop? Question 3: What is Big Data? 3 Questions [11]

[1] Apache hadoop project. https://hadoop.apache.org/. Accessed: 2018-11-26. [2] Creating a hadoop cluster. https://docs.bitnami.com/bch/apps/hadoop/getstarted/hadoop-cluster/. Accessed: 2018-11-26. [3] Understanding hadoop clusters and the network. http://bradhedlund.com/2011/09/10/understandinghadoop-clusters-and-the- network/. Accessed: 2018-11-26. [4] Shikai Jin, Yuxuan Cui, and Chunli Yu. A new parallelization method for k- means. arXiv preprint arXiv:1608.06347, 2016. [5] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing, pages 674–679. Springer, 2009. [7] Stonebraker M, Çetintemel U, Zdonik S. The 8 requirements of real-time stream processing. ACM SIGMOD Rec 2005;34(4):42–7 [8] https://www.wired.com/2012/06/wireless-power/ [9] https://www.smartdatacollective.com/can-business-intelligence-answer-questions-asked-without-big-data/ [10] https://commons.wikimedia.org/wiki/File:Hadoop_logo.svg [11] : http://images.clipartpanda.com/light-bulbs-Light-Bulb.jpg [12]:Oussama Lachiheb, Mohamed Salah Gouider, and Lamjed Ben Said. An improved mapreduce design of kmeans with iteration reducing for clustering stock exchange very large datasets. In Semantics, Knowledge and Grids (SKG), 2015 11th International Conference on, pages 252{255. IEEE, 2015. [13]: V Priya and K Umamaheswari. Ensemble based parallel k means using map reduce for aspect based summarization. In Proceedings of the International Conference on Informatics and Analytics, page 26. ACM, 2016. [14]: Anil R Surve and Nilesh S Paddune. A survey on hadoop assisted k-means clustering of hefty volume images. International Journal on Computer Science & Engineering, 6(3):113{117, 2014. 4 References

Parallelizing KMeans using MapReduce IPKMeans vs PKMeans

Parallelizing KMeans using MapReduce IPKMeans vs PKMeans

Presentation Transcript

Join algorithms using mapreduce

Genetic Algorithms by using MapReduce

Join Using MapReduce

Parallelizing MiniSat

Scaling Genetic Algorithms using MapReduce

MapReduce VS Parallel DBMSs

Parallelizing Programs

MapReduce

Parallelizing HMM Decoding

Technology for Informatics Kmeans and MapReduce Parallelism

MRPGA ： An Extension of MapReduce for Parallelizing Genetic Algorithm

Parallelizing Computations

Parallelizing METIS

Parallelizing stencil computations

MapReduce

MapReduce

Join algorithms using mapreduce

MapReduce

Using MapReduce for Scalable Coreference Resolution