Grouping data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Grouping Data PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on
  • Presentation posted in: General

Grouping Data. Methods of cluster analysis. Goals 1. We want to identify groups of similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences

Download Presentation

Grouping Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Grouping data

Grouping Data

Methods of cluster analysis


Goals 1

Goals 1

  • We want to identify groups of similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences

  • We want to create groups as a measurement technique to see how they vary with external variables


Goals 2

Goals 2

  • We want to cluster artifacts or sites based on their location to identify spatial clusters


Real vs created types

Real vs. Created Types

  • Differences in goals

    • Real types are the aim of Goal 1

    • Created types are the aim of Goal 2

  • Debate over whether Real types can be discovered with any degree of certainty

  • Cluster analysis guarantees groups – you must confirm their utility


Initial decisions 1

Initial Decisions 1

  • What variables to use?

    • All possible

    • Constructed variables (from principal components, correspondence analysis, or multi-dimensional scaling)

    • Restricted set of variables that support the goal(s) of creating groups (e.g. functional groups, cultural or stylistic groups)


Initial decisions 2

Initial Decisions 2

  • How to transform the variables?

    • Log transforms

    • Conversion to percentages (to weight rows equally)

    • Size standardization (dividing by geometric mean)

    • Z – scores (to weight columns equally)

    • Conversion of categorical variables


Initial decisions 3

Initial Decisions 3

  • How to measure distance?

    • Types of variables

    • Goals of the analysis

    • If uncertain, try multiple methods


Methods of grouping

Methods of Grouping

  • Partitioning Methods – divide the data into groups

  • Hierarchical Methods

    • Agglomerating – from n clusters to 1 cluster

    • Divisive – from 1 cluster to k clusters


Partitioning

Partitioning

  • K – Means, K – Medoids, Fuzzy

  • Measure of distance – but do not need to compute full distance matrix

  • Specify number of groups in advance

  • Minimizing within group variability

  • Finds spherical clusters


Procedure

Procedure

  • Start with centers for k groups (user-supplied or random)

  • Repeat up to iter.max times (default 10)

    • Allocate rows to their closest center

    • Recalculate the center positions

  • Stop

  • Different criteria for allocation

  • Use multiple starts (e.g. 5 – 15)


Evaluation 1

Evaluation 1

  • Compute groups for a range of cluster sizes and plot within group sums of squares to look for sharp increases

  • Cluster randomized versions of the data and compare the results

  • Examine table of statistics by group


Evaluation 2

Evaluation 2

  • Plot groups in two dimensions with PCA, CA, or MDS

  • Compare the groups using data or information not included in the analysis


Partitioning using r

Partitioning Using R

  • Base R includes kmeans() for forming groups by partitioning

  • Rcmdr includes KMeans() to iterate kmeans() for best solution

  • Package cluster() includes pam() which uses medoids for more robust grouping and fanny() which forms fuzzy clusters


Example

Example

  • DarlPoints (not DartPoints) has 4 measurements for 23 Darl points

  • Create Z-scores to weight variables equally with Data | Manage variables in active data set | Standardize variables …

  • (or could use PCA and PC Scores)


Example cont

Example (cont)

  • Use Rcmdr to partition the data into 5, 4, 3, and 2 groups

  • Statistics | Dimensional analysis | Cluster analysis | k-means cluster analysis …

  • TWSS = 15.42, 19.78, 25.83, 34.24

  • Select group number and have Rcmdr add group to data set


Evaluation

Evaluation

  • Evaluate groups against randomized data

    • Randomly permute each variable

    • Run k-means

    • Compare random and non-random results

  • Evaluate groups against external criteria (location, material, age, etc)


Grouping data

KMPlotWSS <- function(data, ming, maxg) {

WSS <- sapply(ming:maxg, function(x) kmeans(data, centers = x,

iter.max = 10, nstart = 10)$tot.withinss)

plot(ming:maxg, WSS, las=1, type="b", xlab="Number of Groups",

ylab="Total Within Sum of Squares", pch=16)

print(WSS)

}

KMRandWSS <- function(data, samples, min, max) {

KRand <- function(data, min, max){

Rnd <- apply(data, 2, sample)

sapply(min:max, function(y) kmeans(Rnd, y, iter.max= 10,

nstart=5)$tot.withinss)

}

Sim <- sapply(1:samples, function(x) KRand(data, min, max))

t(apply(Sim, 1, quantile, c(0,.005, .01, .025, .5,

.975, .99, .995, 1)))

}


Grouping data

# Compare data to randomized sets

KMPlotWSS(DarlPoints[,6:9], 1, 10)

Qtiles <- KMRandWSS(DarlPoints[,6:9], 2000, 1, 10)

matlines(1:10, Qtiles[,c(1, 5, 9)], lty=c(3, 2, 3),

lwd=2, col="dark gray")

legend("topright", c("Observed", "Median (Random)",

"Max/Min Random"), col=c("black", "dark gray",

"dark gray"), lwd=c(1, 2, 2), lty=c(1, 2, 3))


Hierarchical methods

Hierarchical Methods

  • Agglomerative – successive merging

  • Divisive - successive splitting

    • Monothetic – binary data

    • Polythetic – interval/ratio


Agglomerative

Agglomerative

  • At the start all rows are in separate groups (n groups or clusters)

  • At each stage two rows are merged, a row and a group are merged, or two groups are merged

  • The process stops when all rows are in a single cluster


Agglomeration methods

Agglomeration Methods

  • How should clusters be formed?

    • Single Linkage, irregular shape groups

    • Average Linkage – spherical groups

    • Complete Linkage – spherical groups

    • Ward’s Method – spherical groups

    • Median – dendrogram inversions

    • Centroid – dendrogram inversions

    • McQuitty – similarity by reciprocal pairs


Agglomerating with r

Agglomerating with R

  • Base R includes hclus() for forming groups by partitioning

  • Package cluster() includes agnes()

  • Rcmdr uses hclus() via Statistics | Dimensional analysis | Cluster analysis | Hierarchical cluster analysis …


Hclust

HClust

  • Rcmdr menus provide

    • Cluster analysis and plot

    • Summary statistics by group

    • Adding cluster to data set

  • To get traditional dendrogram:

    • plot(HClust.1, hang=-1, main= "Darl Points", xlab= "Catalog Number", sub="Method=Ward; Distance=Euclidian")

    • rect.hclust(HClust.1, 3)


Grouping data

summary(as.factor(cutree(HClust.1, k = 3))) # Cluster Sizes

1 2 3

11 6 6

by(model.matrix(~-1 + Z.Length + Z.Thickness + Z.Weight +

Z.Width, DarlPoints), as.factor(cutree(HClust.1, k = 3)), mean) # Cluster Centroids

INDICES: 1

Z.LengthZ.ThicknessZ.WeightZ.Width

-0.1345150 -0.1585615 -0.2523805 -0.1241642

------------------------------------------------------------

INDICES: 2

Z.LengthZ.ThicknessZ.WeightZ.Width

-1.1085541 -0.9209550 -0.9400026 -0.8200594

------------------------------------------------------------

INDICES: 3

Z.LengthZ.ThicknessZ.WeightZ.Width

1.355165 1.211651 1.402700 1.047694

> biplot(princomp(model.matrix(~-1 + Z.Length + Z.Thickness +

Z.Weight + Z.Width, DarlPoints)),

xlabs = as.character(cutree(HClust.1, k = 3)))


Grouping data

> cbind(HClust.1$merge, HClust.1$height)

[,1] [,2] [,3]

[1,] -12 -13 0.3983821

[2,] -2 -3 0.5112670

[3,] -9 -14 0.5247650

[4,] -10 -17 0.5572146

[5,] -15 3 0.7362171

[6,] -1 -11 0.7471874

[7,] -6 -18 0.8120594

[8,] -7 -8 0.8491895

[9,] 4 5 0.9841552

[10,] 2 6 1.2150606

[11,] -19 -21 1.2300507

[12,] 1 10 1.4059158

[13,] -22 11 1.4963400

[14,] -16 -20 1.5800167

[15,] -4 9 1.6195709

[16,] -5 12 2.1556543

[17,] -23 13 2.4007863

[18,] 7 14 2.4252670

[19,] 8 17 3.2632812

[20,] 16 18 4.9021149

[21,] 15 20 6.6290417

[22,] 19 21 18.7730146


Divisive

Divisive

  • At the start all rows are considered to be a single group

  • At each stage a group is divided into two groups based on the average dissimilarities

  • The process stops when all rows are in separate clusters


  • Login