Date: 3/26/2014 Presenter: Yi Hou

Clustering Very Large Multi-dimensional Datasets with MapReduce Date: 3/26/2014 Presenter: Yi Hou

Outline • Introduction of the data model of MapReduce (Hadoop implementation)/Introduction of the data model of HDFS • Motivation • Introduction • Related work • Problem formation. • Assumption and requirement • Parallel Clustering (ParC) • Sample-and-Ignore (SnI) • Methodology • Cost-based optimization • Experiments • Future work

What is MapReduce • Idea from a paper of Google: Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters ,” December 2004 • Free variant: Hadoop of Yahoo! In 2006. • MapReduce is a high level programming model and implementation for large scale parallel data processing.

Data Model of MapReduce • Two phases: map phase and reduce phase. • Each phase has key-value pairs as input/output. • Map function and reduce function are specified by programmers. • A Map program consists of • Input: bag of (inputKey, value) pairs • Output: bag of (outputKey, value) pairs • Note: the input key may be assigned arbitrarily as the file name, or the file ID, or the timestamp (but should be relatively small, cannot be terabytes)… the outputKey is important since it controls the shuffling of the output from mapper to the corresponding reducer. • Let’s look at an example:

Data Model of MapReduce (word count) • For each document, we call a map function and emit the frequency of each word as output, for example, (w1, 1) (w2, 1) (w1, 1) (w3, 1)…. • The output of each map function is processed by MapReduce framework(in the shuffle part) . Shuffle part sorts and groups the key-value pairs by key for all Maps, for example, (w1, [1, 1…1]) (w2, [1, 1…1]) (w3, [1, 1…1]), so the Reduce sees pairs of key and its list of values. • The Reduce finally iterate through all the lists and generate a final result, for example. (w1, 25), (w2, 77), (w3, 12)… hadoop the definitive guide

Data Model of MapReduce • If there are multiple Reducer tasks, the Map tasks partition their output, each creating one partition for each reduce task but the records for any given key all go to the same Reduce. • The partitioning can be controlled by a user-defined partitioning function or simply the default partitioner - with hash function-works pretty well!. Graph from “Hadoop :the definitive guide”, chapter 3.

HDFS • Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage for Hadoop. • This is the data storage area that Hadoop’smapper function can reach. • Not the disk (as specified in this paper, maybe typo). • All data to be processed by Hadoop must be first transferred to HDFS. • However, reducer can reach disks because intermediate results output from mapper are stored in disk.

Motivation • Serial subspace clustering algorithms. • Very large dataset of moderate-to-high dimensional element • e.g. Twitter crawl:> 12 TB, Yahoo! operational data: 5 Petabytes • Solution: use good, serial clustering algorithms and make them run in parallel. • This paper addressed two challenges while applying MapReduce: • how to minimize the I/O cost (at mapper side) • how to minimize the network cost among processing nodes (at shuffling phase) • ---- Best of both Worlds – BoW method

Introduction • The data consists of top 10 eigenvectors of the adjacency matrix of the Twitter graph ~ 14 GB. • x-axis represents # of reducers and y-axis represents wall-clock time • No overall winner • BoW selects the best of two (ParC and SnI)

Related Work • Subspace Clustering • Density based: high density regions (clusters) are separated by low density regions. • k-means based: picking k centroids and iteratively assign points to nearest center/improve the center. • Cannot handle very large data, both in time and space.

Problem Formation • The major problems are: • how to minimize the I/O cost (at mapper side) • how to minimize the network cost among processing nodes (at shuffling phase) • Parallel Clustering – ParC (minimize the I/O cost ) • Sample-and-Ignore – SnI (minimize the network cost at the risk of reading data from HDFS twice) • Assumption: the serial subspace clustering algorithm should return clustering results in hyper-rectangle.

Parallel Clustering – ParC • Partition data (mapper) and shuffle to reducers • In partition, the file-based data partitioning is used. • Each reducer processes data with each certain key. • Run clustering and return hyper-rectangle results • Each cluster is described by hyper-rectangle. • Merge the hyper-rectangle results • Merge two clusters if they overlap in a d-dimensional • space.

Sample-and-Ignore – SnI • Select a small set of data • Find major clusters • Only shuffle data not included in clusters found • in last step to following ParC step. • Significantly reduce the amount of data moved in the shuffling phases at the risk of scanning data in HDFS twice, increasing I/O cost for mappers. Combined sample-and-ignore preprocessing step with ParC.

Sample-and-Ignore – SnI • Assume 2 reducers are used. • 1(b), with 50% possibility, data is shuffled t reducers. • 1(b), two clusters are found. • 2(a), two clusters are described by rectangles. • 2(b) and 2(c), data not in the rectangles are shuffled. • 2(d), run clustering in each reducer and merge them. • 2(e), final clusters are found!

Parallel Clustering – ParCVS Sample-and-Ignore – SnI • ParC does a single pass over the data. • All of the data records have to be shipped over the network. • (Minimize I/O cost) • However, • SnI significantly minimizes the network cost and the reducers processing. • At the cost of reading the whole dataset twice. • (Minimize network cost)

Methodology • Cost-based optimization: a hybrid method named BoW (Best of both Worlds) that takes the best of ParCand SnI. • Lemma 1: map cost - the expected cost for the map • phase. • Proof: • 1) m mappers are started-up at the cost of start_up_cost(m). • 2) S bytes of data are read by m mappers, each with disk speed Ds.

Methodology • Cost-based optimization: a hybrid method named BoW (Best of both Worlds) that takes the best of ParCand SnI. • Lemma 2: shuffle Cost – the expected shuffle cost. • Proof: • 1) Dr is the actual ratio of data shuffled. • 2) S*Dr is the real amount of data shuffled. • 3) S*Dr bytes of data shuffled to r reducers, each with bandwidth Ns.

Methodology • Cost-based optimization: a hybrid method named BoW (Best of both Worlds) that takes the best of ParCand SnI. • Lemma 3: reduce Cost – the expected cost for the reduce phase. • Proof: • 1) Start r reducers with cost start_up_cost(r). • 2) R reducers read data of size s in parallel with disk speed Ds. • 3) Each reducer runs clustering with data size (s/r) at the cost of plug_in_cost(s/r).

Methodology • Cost-based optimization: a hybrid method named BoW (Best of both Worlds) that takes the best of ParCand SnI. • Lemma 4: ParC Cost – the expected cost for ParC. • Proof: • 1) M mappersprocess Fs bytes of data in the map phase. • 2) Fs bytes of data are shuffled to r reducers in the shuffling phase. • 3) Fs bytes of data are analyzed in the reduce phase by r reducers. • 4) Asingle machine merges all the -clusters found, whose cost is trivial.

Methodology • Cost-based optimization: a hybrid method named BoW (Best of both Worlds) that takes the best of ParCand SnI. • Lemma 5: SnI Cost – the expected cost for SnI. • Proof: • 1) M mappersprocess Fs bytes of data in the map phase twice. • 2) Fs*Srbytes of data are shuffled to one reducers in the shuffling phase. • 3) Fs*Sr bytes of data are clustered in one reducer. • 4) Fs*Rrbytes of data are shuffled to r reducers in the shuffling phase. • 5) Fs*Rrbytes of data are clustered in r reducers. • Rr is the ratio of data that does not belong • to the major clusters.

Methodology

Experiments • The experiments solve three problems: • Q1 How much (if at all) parallelism affects the cluster’s quality? • Q2 How does our method scale-up? • Q3 How accurate are our cost-based optimization equations? • Software: Hadoop 2 clusters • M45：3200 cores, 400 machines, 1.5PB storage，2.5 TB memory. • DISC/Cloud：512 cores，64 machines，1TB RAM，256TB disk storage. • Data: • YahooEig: The top 6 eigenvectors from the adjacency matrix of one of the largest web graphs. ~ 0.2 TB. • TwitterEig: The top 10 eigenvectors from the adjacency matrix of the Twitter graph. ~ 0.014 TB. • Synthetic: A group of datasets with sizes varying from 100 thousand up to 100 million 15-dimensional points, containing 10 clusters each. • Clustering algorithm: MrCC source code. • Data partition: file-based data partitioning strategy.

Experiments • Q1 How much (if at all) parallelism affects the cluster’s quality? • Ground truth: the clustering results obtained by running the plugged-in algorithm serially on any dataset without parallel, which is not practical with real data (because of size and dimension). • Only synthetic data is evaluated. • Answer: as long as you have enough data, parallelism barely affects the accuracy, even for large numbers of reducers, like 1024.

Experiments • Q2 How does our method scale-up? • BoW achieves near-linear scale-up wrt the number of reducers. • BoW has desired scalability, scaling-up linearly with the data size.

Experiments • Q3 How accurate are our cost-based optimization equations? • 1(a) and 1(d) show that BoW, in red up-triangles, consistently picks the winning strategy among the two alternatives: ParCand SnI. • 1(b), 1(c), 1(e) and 1(f) show that the theory (using equation 4 and 5) and the measured real time costs usually agree very well.

Future Work • BoW is based on hard clustering. • Soft-clustering is a promising idea: in the merge phase, allow overlapping clusters.

Thank you!

Date: 3/26/2014 Presenter: Yi Hou