Hierarchical Clustering Algorithms

Machine Learning (BE Computer 2015 PAT) A.Y. 2018-19 SEM-II Prepared by Mr. Dhomse G.P.

Unit-6 Clustering Techniques Syllabus • Hierarchical Clustering, Expectation maximization clustering, Agglomerative Clustering Dendrograms, Agglomerative clustering in Scikit- learn, Connectivity Constraints.Introduction to Recommendation Systems- Naïve User based systems, Content based Systems,Model free collaborative filtering-singular value decomposition, alternating least squares.Fundamentals of Deep Networks-Defining Deep learning, common architectural principles ofdeep networks, building blocks of deep networks

Hierarchical Clustering • Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. • This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left. • The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted as:

At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters are then merged till we have just one cluster at the top. The height in the dendrogram at which two clusters are merged represents the distance between two clusters in the data space. • The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers maximum vertical distance AB.

This algorithm has been implemented above using bottom up approach. It is also possible to follow top-down approach starting with all data points assigned in the same cluster and recursively performing splits till each data point is assigned a separate cluster. • The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters : • Euclidean distance: ||a-b||2 = √(Σ(ai-bi)) • Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2) • Manhattan distance: ||a-b||1 = Σ|ai-bi| • Maximum distance:||a-b||INFINITY = maxi|ai-bi| • Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}

Difference between K Means and Hierarchical clustering • Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2). • In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.

K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D). • K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram

Applications of Clustering • Clustering has a large no. of applications spread across various domains. Some of the most popular applications of clustering are: • Recommendation engines • Market segmentation • Social network analysis • Search result grouping • Medical imaging • Image segmentation • Anomaly detection

Hierarchical clustering is based on the general concept of finding a hierarchy of partial clusters, built using either a bottom-up or a top-down approach. More formally, they are called: • Agglomerative clustering: The process starts from the bottom (each initial cluster is made up of a single element) and proceeds by merging the clusters until a stop criterion is reached. In general, the target has a sufficiently small number of clusters at the end of the process.

Divisive clustering: In this case, the initial state is a single cluster with all samples and the process proceeds by splitting the intermediate cluster until all elements are separated. At this point, the process continues with an aggregation criterion based on the dissimilarity between elements.

Divisive method-In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. • Finally, we proceed recursively on each cluster until there is one cluster for each observation. • There is evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumstances but is conceptually more complex.

What is Euclidean Distance? • According to the Euclidean distance formula, the distance between two points in the plane with coordinates (x, y) and (a, b) is given by

The horizontal distance between the points is 4 and the vertical distance is 3. Let's introduce one more point (-2, -1). With this small addition we get a right-angled triangle with legs 3 and 4. By the Pythagorean theorem, the square of the hypotenuse is (hypotenuse)² = 3² + 4². Which gives the length of the hypotenuse as 5, same as the distance between the two points according to the distance formula.

Agglomerative method • In agglomerative or bottom-up clustering method we assign each observation to its own cluster. • Then, compute the similarity (e.g., distance) between each of the clusters and join the two most similar clusters. • Finally, repeat steps 2 and 3 until there is only a single cluster left. The related algorithm is shown below.

Before any clustering is performed, it is required to determine the proximity matrix containing the distance between each point using a distance function. • Then, the matrix is updated to display the distance between each cluster. The following three methods differ in how the distance between each cluster is measured. • Advantages • It can produce an ordering of the objects, which may be informative for data display. • Smaller clusters are generated, which may be helpful for discovery.

Disadvantages • No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. The result should be examined closely to ensure it makes sense. • Use of different distance metrics for measuring distances between clusters may generate different results. Performing multiple experiments and comparing the results is recommended to support the veracity of the original results.

Single Linkage-In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points.

Complete Linkage- In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.

Average Linkage-In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.

Ward's linkage: In this method, all clusters are considered and the algorithm computes the sum of squared distances within the clusters and merges them to minimize it. From a statistical viewpoint, the process of agglomeration leads to a reduction in the variance of each resulting cluster. The measure is: • The Ward's linkage supports only the Euclidean distance.

Dendrograms • To better understand the agglomeration process, it's useful to introduce a graphical method called a dendrogram, which shows in a static way how the aggregations are performed, starting from the bottom (where all samples are separated) till the top (where the linkage is complete). Unfortunately, scikit-learn doesn't support them. However, SciPy (which is a mandatory requirement for it) provides some useful built-in functions. • Let's start by creating a dummy dataset:

from sklearn.datasets import make_blobs >>> nb_samples = 25 >>> X, Y = make_blobs(n_samples=nb_samples, n_features=2, centers=3, cluster_std=1.5) • To avoid excessive complexity in the resulting plot, the number of samples has been kept very low. In the following figure, there's a representation of the dataset:

Now we can compute the dendrogram. The first step is computing a distance matrix: from scipy.spatial.distance import pdist >>> Xdist = pdist(X, metric='euclidean') • We have chosen a Euclidean metric, which is the most suitable in this case. At this point, it's necessary to decide which linkage we want. Let's take Ward; however, all known methods are supported: from scipy.cluster.hierarchy import linkage >>> Xl = linkage(Xdist, method='ward') • Now, it's possible to create and visualize a dendrogram: from scipy.cluster.hierarchy import dendrogram >>> Xd = dendrogram(Xl)

In the x axis, there are the samples (numbered progressively), while the y axis represents the distance. Every arch connects two clusters that are merged together by the algorithm. For example, 23 and 24 are single elements merged together. The element 13 is then aggregated to the resulting cluster, and so the process continues.

As you can see, if we decide to cut the graph at the distance of 10, we get two separate clusters: the first one from 15 to 24 and the other one from 0 to 20. Looking at the previous dataset plot, all the points with Y < 10 are considered to be part of the first cluster, while the others belong to the second cluster. • If we increase the distance, the linkage becomes very aggressive (particularly in this example with only a few samples) and with values greater than 27, only one cluster is generated (even if the internal variance is quite high!).

Agglomerative clustering in scikit-learn • Let's consider a more complex dummy dataset with 8 centers: >>> nb_samples = 3000 >>> X, _ = make_blobs(n_samples=nb_samples, n_features=2, centers=8, cluster_std=2.0) • A graphical representation is shown in the following figure:

Let's start with a complete linkage (AgglomerativeClustering uses the method fit_predict() to train the model and transform the original dataset): • from sklearn.cluster import AgglomerativeClustering >>> ac = AgglomerativeClustering(n_clusters=8, linkage='complete') >>> Y = ac.fit_predict(X) • A plot of the result (using both different markers and colors) is shown in the following figure: The result is totally bad

In this case, the clusters are better defined, even if some of them could have become really small. It can also be useful to try other metrics • Let's now consider the average linkage: >>> ac = AgglomerativeClustering(n_clusters=8, linkage='average') >>> Y = ac.fit_predict(X) • The result is shown in the following screenshot:

Ward's linkage, that can be used only with a Euclidean metric (also the default one): >>> ac = AgglomerativeClustering(n_clusters=8) >>> Y = ac.fit_predict(X) • The resulting plot is shown in the following figure:

What is KNN? • In four years of my career into analytics I have built more than 80% of classification models and just 15-20% regression models. These ratios can be more or less generalized throughout the industry. The reason of a bias towards classification models is that most analytical problem involves making a decision. • For instance will a customer attrite or not, should we target customer X for digital campaigns, whether customer has a high potential or not etc. These analysis are more insightful and directly links to an implementation roadmap.

When do we use KNN algorithm? • KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. • To evaluate any technique we generally look at 3 important aspects: • 1. Ease to interpret output • 2. Calculation time • 3. Predictive Power

Let us take a few examples to place KNN in the scale : • KNN algorithm fairs across all parameters of considerations. It is commonly used for its easy of interpretation and low calculation time.

How does the KNN algorithm work? • Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS) : You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 3.

Hence, we will now make a circle with BS as center just as big as to enclose only three datapoints on the plane. Refer to following diagram for more details:

The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should belong to the class RC. • Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. • Next we will understand what are the factors to be considered to conclude the best K.

How do we choose the factor K? • First let us try to understand what exactly does K influence in the algorithm. • If we see the last example, given that all the 6 training observation remain constant, with a given K value we can make boundaries of each class. These boundaries will segregate RC from GS. • The same way, let’s try to see the effect of value “K” on the class boundaries. Following are the different boundaries separating the two classes with different values of K.

If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. With K increasing to infinity it finally becomes all blue or all red depending on the total majority. • The training error rate and the validation error rate are two parameters we need to access on different K-value. • Following is the curve for the training error rate with varying value of K :

As you can see, the error rate at K=1 is always zero for the training sample. This is because the closest point to any training data point is itself. Hence the prediction is always accurate with K=1. If validation error curve would have been similar, our choice of K would have been 1. Following is the validation error curve with varying value of K:

We can implement a KNN model by following the below steps: • Load the data • Initialise the value of k • For getting the predicted class, iterate from 1 to total number of training data points • Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc. • Sort the calculated distances in ascending order based on distance values • Get top k rows from the sorted array • Get the most frequent class of these rows • Return the predicted class

Connectivity constraints • scikit-learn also allows specifying a connectivity matrix, which can be used as a constraint when finding the clusters to merge. In this way, clusters which are far from each other (non- adjacent in the connectivity matrix) are skipped. • A very common method for creating such a matrix involves using the k-nearest neighbors graph function (implemented as kneighbors_graph()), that is based on the number of neighbors a sample has (according to a specific metric). In the following example, we consider a circular dummy dataset (often used in the official documentation also):

from sklearn.datasets import make_circles >>> nb_samples = 3000 >>> X, _ = make_circles(n_samples=nb_samples, noise=0.05) • A graphical representation is shown in the following figure:

We start with unstructured agglomerative clustering based on average linkage and impose 20 clusters: >>> ac = AgglomerativeClustering(n_clusters=20, linkage='average') >>> ac.fit(X) • In this case, we have used the method fit() because the class AgglomerativeClustering, after being trained, exposes the labels (cluster number) through the instance variable labels_ and it's easier to use this variable when the number of clusters is very high. • A graphical plot of the result is shown in the following figure:

Now we can try to impose a constraint with different values for k: • from sklearn.neighbors import kneighbors_graph >>> acc = [] >>> k = [50, 100, 200, 500] >>> for i in range(4): >>> kng = kneighbors_graph(X, k[i]) >>> ac1 = AgglomerativeClustering(n_clusters=20, connectivity=kng, linkage='average') >>> ac1.fit(X) >>> acc.append(ac1) • The resulting plots are shown in the following screenshot:

Hierarchical Clustering Algorithms

Hierarchical Clustering Algorithms

Presentation Transcript

Machine Learning

MACHINE LEARNING

Machine learning

Machine Learning

Machine Learning

Machine Learning

COMPUTER NUMERIC CONTROL MACHINE

Machine Learning

Machine Learning

Machine Learning

Machine Learning

Machine Learning

Machine Learning

Machine Learning

Machine learning

machine learning

Machine Learning

Machine Learning

Machine Learning