K-means Clustering

K-means Clustering Group 15 SwathiGurram PrajaktaPurohit

Goal • To program K-means on Twister (Iterative Map-Reduce) and Hadoop(Map - Reduce) and see how the change of framework effects the implementation time.

Survey • Twister • Configurable long running (cacheable) map/reduce tasks • Pub/sub messaging based communication/data transfers • Efficient support for Iterative MapReducecomputation • Combine phase to collect all reduce outputs • Data access via local disks

Survey • Hadoop: a software framework that supports data-intensive distributed applications • Uses Map- reduce programming model • it's own filesystem ( HDFS Hadoop Distributed File System based on the Google File System) which is specifically tailored for dealing with large files • can intelligently manage the distribution of processing and your files, and breaking those files down into more manageable chunks for processing

Survey • Haloop : a modified version of the HadoopMapReduce framework • provide caching options for loop-invariant data access • let users reuse major building blocks from applications' Hadoop implementations • have similar intra-job fault-tolerance mechanisms to Hadoop. • HaLoop reduces query runtimes by 1.85 compared with Hadoop

K-means Clustering

Twister K-means

Hadoop K-means

Implementation Timeline

Validation methods

Conclusion • Twister framework is faster than Hadoop for iterative map- reduce applications.

References • http://salsahpc.indiana.edu • http://www.iterativemapreduce.org/samples.html • http://hadoop.apache.org/ • http://en.wikipedia.org/wiki/Apache_Hadoop • http://clue.cs.washington.edu/node/14 • http://code.google.com/p/haloop/ • http://www.cs.washington.edu/homes/billhowe/pubs/HaLoop.pdf

Demo

Thank you

K-means Clustering