1 / 34

Functional clustering

Functional clustering. Marian Scott, Ruth Haggarty NERC workshop, University of Glasgow March 2014. First- what is clustering?. We anticipate that each sampling unit (person, site) belongs uniquely to one (unknown) group, we typically have a series of measurements on each unit

kiona
Download Presentation

Functional clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Functional clustering Marian Scott, Ruth Haggarty NERC workshop, University of Glasgow March 2014

  2. First- what is clustering? • We anticipate that each sampling unit (person, site) belongs uniquely to one (unknown) group, we typically have a series of measurements on each unit • We don’t know how many groups • Membership is defined based on measures of similarity between the sampling units

  3. First- what is clustering? • We expect that members of the clusters or groups are more similar than compared with members of other clusters • We measure dissimilarity (often as a measure of distance between pairs of observations) • Algorithmic clustering focuses on using measures of distance and hierarchical or k-means are the most commonly used

  4. The dendrogram • A dendrogram shows the connections, like a tree, we can see where observations merge together. The height on the y-axis is the distance between the clusters being merged

  5. Similarity or distance • How do we measure distance? • Euclidean distance • Weighted euclidean distance • Mahalanobis • Manhattan • ……..

  6. Clustering methods hierarchical divisive put everything together and split agglomerative keep everything separate and join the most similar points (classical cluster analysis) non-hierarchical k-means clustering

  7. Agglomerative hierarchical Single linkage or nearest neighbour finds the minimum spanning tree: shortest tree that connects all points chaining can be a problem

  8. Agglomerative hierarchical Complete linkage or furthest neighbour compact clusters of approximately equal size. (makes compact groups even when none exist)

  9. Agglomerative hierarchical Average linkage methods between single and complete linkage

  10. Hierarchical Clustering

  11. Hierarchical Clustering

  12. K-means • Different set-up • For a fixed number of clusters, k-means tries to find the assignment that minimises the sum over all the clusters of the sum of squares within clusters. • Very computational, so need to find a strategy that is feasible

  13. K-means • Begin with k starting centres • Assign each observation to the cluster closest centre • Recompute new centres • Re-assign • Keep repeating this till convergence

  14. Model based clustering • Typically the data are clustered using some assumed mixture modelingstructure. • Then the group memberships are ‘learned’ in an unsupervised fashion. • Assume the data are collected from a finite collection of populations. • The data within each population can be modeled using a standard statistical model.

  15. Model based clustering • Typically the data are clustered using some assumed mixture modelingstructure. • Then the group memberships are ‘learned’ in an unsupervised fashion. • Assume the data are collected from a finite collection of populations.

  16. Model based clustering The data within each population can be modeled using a standard statistical model, using often a mixture of Normals. Note the , cluster probablities

  17. Can we tell when we reach a good k? • Elbow plot, where we plot the within clusters total sum of squares against k, and look for an ‘elbow’ • Gap statistic • Same idea as the elbow plot, but normalises the elbow plot to make comparisons easier

  18. Functional clustering • Functional clustering has the same sub-types • Hierarchical • K-mean • Model based We might also want to think about the ‘typical’ curve of a cluster, the functional average curve

  19. Air monitoring network

  20. Examples of curves

  21. Sets of curves

  22. The identified clusters

  23. Example 5 from intro Hierarchical cluster analysis applied to the functional distance matrix for the 26 total nitrogen site trends yields the dendrogram in Figure 7. An average linkage is used. Sites within each dam tend to group together.

  24. Example 6 Hierarchical cluster analysis applied to the functional distance matrix for the 26 total nitrogen site trends yields the dendrogram in Figure 7. An average linkage is used. Sites within each dam tend to group together.

  25. Functional clustering (1) • As always we start with the scenario that we have a set of ‘monitoring locations’, and that we measure our variable(s) of interest over time. • The temporal frequency might be rather irregular and sparse, or it might be very regular and very frequent (daily) for example • We will fit smooth curves to each of the time series and each curve will become the ‘data unit’. • The norm is to use B-splines as a means of creating the smooth curves. The coefficients of the splines are key going forward.

  26. Functional clustering(2) • The coefficients of the basis functions for each time series will be an important aspect • For functional hierarchical clustering, we measure the distance between the two curves in terms of their coefficients and thus create a functional distance matrix • Now that we have the distance matrix, we can them apply one of the algorithms used for clustering, eg average linkage, k means

  27. Functional clustering (3) • Good for description, we can create the functional average curve for each cluster. • Might not be such a good approach if, as is often the case in practice, the individual curves are sparsely sampled. • not about inference, no model underpins these approaches, so we don’t have probability of cluster membership, hence • Model based clustering

  28. Model based clustering(1) instead of treating the basis coefficients as parameters and fitting a separate spline curve for each site, we use a random effects model for the coefficients. This allows us to borrow strength across curves, (handles sparsely or irregularly sampled curves). Furthermore, it automatically weights the estimated spline coefficients according to their variances and is highly efficient because it requires fitting few parameters.

  29. Model based clustering(2) Big advantage is that we estimate the probability of cluster membership Computationally challenging: Choose the number of clusters, fit the model Repeat for different cluster numbers, Compare the models

  30. Choosing how many (1) • One benefit of using model-based clustering techniques is that model selection criteria such as Akaike’s Information Criterion and Bayesian Information Criterion (BIC) can often be used to determine the appropriate number of clusters. But can be computationally expensive,

  31. Choosing how many(2) • Another popular approach for selecting the number of clusters is the gap statistic proposed by Tibshirani et al. (2001), which compares the average within cluster dispersion for the observed data, Wk, with the average within cluster dispersion for a null reference distribution, which assumes that there is no clustering within the sites. A number of reference datasets, say B, are calculated, and for each, the same clustering technique that was applied to the observed data is used.

  32. Model based clustering(4) Can be extended to deal with multiple curves per site, so for example, nitrate and phosphate Still an active research area. Resources are becoming available in R: MFDA model based functional clustering Funclustering

  33. Multivariate clustering(1)

  34. References • James and Sugar. (2003). Clustering for sparsely sampled functional data. JASA 98(462) • Henderson B,. (2006). Exploring between site differences in water quality trends. Environmetrics 17. • Jacques J, Preda C (2014) Model-based clustering for multivariate functional data. CSDA, 71

More Related