- 83 Views
- Uploaded on
- Presentation posted in: General

Grouping Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Grouping Data

Methods of cluster analysis

- We want to identify groups of similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences
- We want to create groups as a measurement technique to see how they vary with external variables

- We want to cluster artifacts or sites based on their location to identify spatial clusters

- Differences in goals
- Real types are the aim of Goal 1
- Created types are the aim of Goal 2

- Debate over whether Real types can be discovered with any degree of certainty
- Cluster analysis guarantees groups – you must confirm their utility

- What variables to use?
- All possible
- Constructed variables (from principal components, correspondence analysis, or multi-dimensional scaling)
- Restricted set of variables that support the goal(s) of creating groups (e.g. functional groups, cultural or stylistic groups)

- How to transform the variables?
- Log transforms
- Conversion to percentages (to weight rows equally)
- Size standardization (dividing by geometric mean)
- Z – scores (to weight columns equally)
- Conversion of categorical variables

- How to measure distance?
- Types of variables
- Goals of the analysis
- If uncertain, try multiple methods

- Partitioning Methods – divide the data into groups
- Hierarchical Methods
- Agglomerating – from n clusters to 1 cluster
- Divisive – from 1 cluster to k clusters

- K – Means, K – Medoids, Fuzzy
- Measure of distance – but do not need to compute full distance matrix
- Specify number of groups in advance
- Minimizing within group variability
- Finds spherical clusters

- Start with centers for k groups (user-supplied or random)
- Repeat up to iter.max times (default 10)
- Allocate rows to their closest center
- Recalculate the center positions

- Stop
- Different criteria for allocation
- Use multiple starts (e.g. 5 – 15)

- Compute groups for a range of cluster sizes and plot within group sums of squares to look for sharp increases
- Cluster randomized versions of the data and compare the results
- Examine table of statistics by group

- Plot groups in two dimensions with PCA, CA, or MDS
- Compare the groups using data or information not included in the analysis

- Base R includes kmeans() for forming groups by partitioning
- Rcmdr includes KMeans() to iterate kmeans() for best solution
- Package cluster() includes pam() which uses medoids for more robust grouping and fanny() which forms fuzzy clusters

- DarlPoints (not DartPoints) has 4 measurements for 23 Darl points
- Create Z-scores to weight variables equally with Data | Manage variables in active data set | Standardize variables …
- (or could use PCA and PC Scores)

- Use Rcmdr to partition the data into 5, 4, 3, and 2 groups
- Statistics | Dimensional analysis | Cluster analysis | k-means cluster analysis …
- TWSS = 15.42, 19.78, 25.83, 34.24
- Select group number and have Rcmdr add group to data set

- Evaluate groups against randomized data
- Randomly permute each variable
- Run k-means
- Compare random and non-random results

- Evaluate groups against external criteria (location, material, age, etc)

KMPlotWSS <- function(data, ming, maxg) {

WSS <- sapply(ming:maxg, function(x) kmeans(data, centers = x,

iter.max = 10, nstart = 10)$tot.withinss)

plot(ming:maxg, WSS, las=1, type="b", xlab="Number of Groups",

ylab="Total Within Sum of Squares", pch=16)

print(WSS)

}

KMRandWSS <- function(data, samples, min, max) {

KRand <- function(data, min, max){

Rnd <- apply(data, 2, sample)

sapply(min:max, function(y) kmeans(Rnd, y, iter.max= 10,

nstart=5)$tot.withinss)

}

Sim <- sapply(1:samples, function(x) KRand(data, min, max))

t(apply(Sim, 1, quantile, c(0,.005, .01, .025, .5,

.975, .99, .995, 1)))

}

# Compare data to randomized sets

KMPlotWSS(DarlPoints[,6:9], 1, 10)

Qtiles <- KMRandWSS(DarlPoints[,6:9], 2000, 1, 10)

matlines(1:10, Qtiles[,c(1, 5, 9)], lty=c(3, 2, 3),

lwd=2, col="dark gray")

legend("topright", c("Observed", "Median (Random)",

"Max/Min Random"), col=c("black", "dark gray",

"dark gray"), lwd=c(1, 2, 2), lty=c(1, 2, 3))

- Agglomerative – successive merging
- Divisive - successive splitting
- Monothetic – binary data
- Polythetic – interval/ratio

- At the start all rows are in separate groups (n groups or clusters)
- At each stage two rows are merged, a row and a group are merged, or two groups are merged
- The process stops when all rows are in a single cluster

- How should clusters be formed?
- Single Linkage, irregular shape groups
- Average Linkage – spherical groups
- Complete Linkage – spherical groups
- Ward’s Method – spherical groups
- Median – dendrogram inversions
- Centroid – dendrogram inversions
- McQuitty – similarity by reciprocal pairs

- Base R includes hclus() for forming groups by partitioning
- Package cluster() includes agnes()
- Rcmdr uses hclus() via Statistics | Dimensional analysis | Cluster analysis | Hierarchical cluster analysis …

- Rcmdr menus provide
- Cluster analysis and plot
- Summary statistics by group
- Adding cluster to data set

- To get traditional dendrogram:
- plot(HClust.1, hang=-1, main= "Darl Points", xlab= "Catalog Number", sub="Method=Ward; Distance=Euclidian")
- rect.hclust(HClust.1, 3)

summary(as.factor(cutree(HClust.1, k = 3))) # Cluster Sizes

1 2 3

11 6 6

by(model.matrix(~-1 + Z.Length + Z.Thickness + Z.Weight +

Z.Width, DarlPoints), as.factor(cutree(HClust.1, k = 3)), mean) # Cluster Centroids

INDICES: 1

Z.LengthZ.ThicknessZ.WeightZ.Width

-0.1345150 -0.1585615 -0.2523805 -0.1241642

------------------------------------------------------------

INDICES: 2

Z.LengthZ.ThicknessZ.WeightZ.Width

-1.1085541 -0.9209550 -0.9400026 -0.8200594

------------------------------------------------------------

INDICES: 3

Z.LengthZ.ThicknessZ.WeightZ.Width

1.355165 1.211651 1.402700 1.047694

> biplot(princomp(model.matrix(~-1 + Z.Length + Z.Thickness +

Z.Weight + Z.Width, DarlPoints)),

xlabs = as.character(cutree(HClust.1, k = 3)))

> cbind(HClust.1$merge, HClust.1$height)

[,1] [,2] [,3]

[1,] -12 -13 0.3983821

[2,] -2 -3 0.5112670

[3,] -9 -14 0.5247650

[4,] -10 -17 0.5572146

[5,] -15 3 0.7362171

[6,] -1 -11 0.7471874

[7,] -6 -18 0.8120594

[8,] -7 -8 0.8491895

[9,] 4 5 0.9841552

[10,] 2 6 1.2150606

[11,] -19 -21 1.2300507

[12,] 1 10 1.4059158

[13,] -22 11 1.4963400

[14,] -16 -20 1.5800167

[15,] -4 9 1.6195709

[16,] -5 12 2.1556543

[17,] -23 13 2.4007863

[18,] 7 14 2.4252670

[19,] 8 17 3.2632812

[20,] 16 18 4.9021149

[21,] 15 20 6.6290417

[22,] 19 21 18.7730146

- At the start all rows are considered to be a single group
- At each stage a group is divided into two groups based on the average dissimilarities
- The process stops when all rows are in separate clusters