Multivariate Data Analysis

Multivariate Data Analysis  G. Quinn, M. Burgman & J. Carey 2003

Objects • Things we wish to compare • sampling or experimental units • e.g. quadrats, animals, plants, cages etc.

Variables • Characteristics measured from each object • usually continuous variables • e.g. counts of species, size of body parts etc.

Ecological data • Objects: • sampling units (SU’s, e.g. quadrats, plots etc.) • Variables: • species abundances and/or environmental data • Common in community ecology

Wisconsin forests (Peet & Loucks 1977) • Plots (quadrats) in Wisconsin forests • Number of individuals of each species of tree recorded in each quadrat • Objects: • quadrats • Variables: • abundances of each tree species

Data Plot Bur oak Black oak White oak Red oak etc. 1 9 8 5 3 2 8 9 4 4 3 3 8 9 0 4 5 7 9 6 5 6 0 7 9 6 0 0 7 8 etc.

Garroch Head dumping ground (Clarke & Ainsworth 1993) • Sewage sludge dumping ground in bay • Transect across dumping ground • Core of mud at each of 10 stations along transect • Objects: • stations • Variables: • metal concentrations in ppm

Data Station Cu Mn Co Ni Zn Cd etc. 1 26 2470 14 34 160 0 2 30 1170 15 32 156 0.2 3 37 394 12 38 182 0.2 4 74 349 12 41 227 0.5 5 115 317 10 37 329 2.2 etc.

Morphological data • Objects: • usually organisms or specimens • Variables: • morphological measurements

Morphological data • Morphological variation between dog species/types • Objects: • dog types (7) • Variables: • sizes of 6 different parts of mandible • mandible breadth, mandible height, etc.

Data Variable Dog type 1 2 3 4 5 6 Modern dog 9.7 21.0 19.4 7.7 32.0 36.5 Jackal 8.1 16.7 18.3 7.0 30.3 32.9 Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1 Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6 Cuon 10.7 23.5 21.4 8.5 28.8 37.6 Dingo 9.6 22.6 21.1 8.3 34.4 43.1 Prehistoric dog 10.3 22.1 19.1 8.1 32.3 25.0

Presentation of Multivariate Data Ordination x Resemblance matrix x Raw data matrix x x x V1 V2 . . . . . . . . . . Vn O1 O2 . . Op x x O1 O2 . . Op x O1 O2 . . Op x Classification created using correlations, covariances or dissimilarity indices

xi - x s Data Standardization Adjusting of data so that means and/or variances or totals are the same for each variable. examples: • 1) centering + standardizing xi' = • 2) rescaling relative to the maximum xi' = xi xmax

Mantel Test A statistical test of association between the corresponding elements of two matrices. 1) Calculate r0, the correlation between the elements in the matrices (cophenetic correlation) 2) Randomly permute the rows and corresponding columns of one matrix. 3) Calculate r.

Matrix A 1 2 3 4 5 1 0 2 20 0 3 41 39 0 4 12 25 53 0 5 13 14 45 17 0 Matrix B 1 2 3 4 5 1 0 2 84 0 3 26 51 0 4 10 17 45 0 5 22 35 28 32 0 Random permutation rows (and columns) of Matrix A 2 1 5 4 3 2 0 1 20 0 5 14 13 0 4 25 12 17 0 3 39 41 45 53 0 1) 2) Matrix B unchanged

Mantel Test 4) Repeat steps 2 and 3 many times (at least 1000). 5) Estimate the likelihood of r0 by comparing it to the randomization distribution of r.

Principal Components Analysis • Aims to reduce large number of variable to smaller number of summary variables called Principal Components (or factors), that explain most of the variation in the data. • Is basically a rotation of axes after centring to the means of the variables, the rotated axes being the Principal Components. • Is usually carried out using a matrix algebra technique called eigenanalysis.

NO3 Total Total N . . . . Organic N Site 1 Site 2 Site 3 : : Steps in PCA 1) From raw data matrix, calculate correlation matrix, or covariance matrix on standardized variables Site Site Site . . . . 1 2 3 Site 1 1 Site 2 0.37 1 Site 3 0.84 0.13 1 :

Steps in PCA 2) Calculate eigenvectors (weightings of each original variable on each component) and eigenvalues (= "latent roots") (relative measures of the variation explained by each component)

Eigenvectors zik = c1yi1 + c2yi2 + . . cjyij + . . + cpyip Where zik = score for component k for object i yi = value of original variable for object i cj= factor score coefficient (weight) of variable for component k • Example: soil chemistry in a forest • zik = c1(NO3) + c2(total organic N) + c3(total N) + .. • the objects are sampling sites • the variables are chemical measurements, e.g. total N

Steps in PCA - continued 3) Decide how many components to retain (scree plot of eigenvalues) 5 4 3 Eigenvalue 2 1 0 1 2 3 4 5 6 7 8 Factor

Steps in PCA 4) Using factor score coefficients, calculate factor score = coefficient x (standardized) variable

Steps in PCA 5) Position objects on scatterplot, using factor scores on first two (or three) Principal Components 3 2 Site 2 1 FACTOR(2) Site 1 0 Site 3 -1 -2 -3 -2 -1 0 1 2 3 FACTOR(1)

Dissimilarity Indices • Dissimilarity indices: • measure how different objects are in terms of their variable values • how different sampling units are in species composition • how different organisms are in morphological structure

Dissimilarity Indices • Dissimilarity: • calculated for each pair of objects in data set • dissimilarity between 2 quadrats in terms of species composition • dissimilarity between 2 dogs in terms of morphological structure

Dissimilarity • Consider 2 objects j and k (eg. 2 quadrats) • Let yij and yik be values for variable i in objects j and k: Quadrat Sp1 Sp2 Sp3 i = 1 to 3 j 3 6 9 k 6 12 18

Quadrat Sp1 Sp2 Sp3 i = 1 to 3 j 3 6 9 k 6 12 18 • For sp1, y1j = 3 and y1k = 6 • For sp2, y2j = 6 and y2k = 12 • For sp3, y3j = 9 and y3k = 18

Euclidean Distance (yij - yik)2 [(3-6)2+(6-12)2+(9-18)2] = 11.2

100 Quadrat 1 Euclidean distance Abundance of species 2 50 Quadrat 2 0 0 50 100 Abundance of species 1 Euclidean Distance • Distance between objects when plotted in multidimensional (multivariable) space

Bray-Curtis (Czekanowski) 2min(yij,yik)  |yij - yik| 1 - = (yij + yik)(yij + yik) - where min(yij,yik) = sum of lesser abundance of each species when it occurs in both sampling units - note summation over species

2min(yij,yik)  |yij - yik| 1 - = (yij + yik)(yij + yik) 1 - [(2)(3+6+9)/(9+18+27)] = [(3+6+9)/(9+18+27)] = 0.33=0.33

Dissimilarities in ecology • reach maximum value (eg. 1) when quadrats have no species in common • Quadrat Sp1 Sp2 Sp3 • 1 0 3 0 • 2 2 0 4 • Euclidean = 5.4 • Bray-Curtis = 1

equal 0 when quadrats are identical in species abundances • Quadrat Sp1 Sp2 Sp3 • 1 2 4 7 • 2 2 4 7 • Euclidean = 0 • Bray-Curtis = 0

Preferred dissimilarity indices • Species abundance data: • zeros common • max. value when quadrats have no species in common • Bray-Curtis preferred • Measurement data: • zeros uncommon • Euclidean OK

Cluster Analysis • Agglomerative / divisive • Hierarchical / non-hierarchical • SAHN - Sequential Agglomerative Hierarchical Non-overlapping classification

Distance Matrix A B C D E A - B 2 - C 6 5 - D 10 9 4 - E 9 8 5 3 -

Average Linkage (UPGMA) • Unweighted Pair-Group Method of Arithmetic Averaging • Distance measured using the average distance of a point to a cluster

1) A B C D E A - B 2 - C 6 5 - D 10 9 4 - E 9 8 5 3 - Shortest distance is 2, between A and B A/B C D E A/B - C 5.5 - D 9.5 4 - E 8.5 5 3 - 2) From above, dist(AC) = 6 dist(BC) = 5 In new matrix, group AB is (6 + 5)/2 from C Shortest distance is now 3, between D and E

A/B C D/E A/B - C 5.5 - D/E 9 4.5 - 4) A/B C/D/E A/B - C/D/E 7.83 - 3) From Step 2, dist(CD) = 4 dist(CE) = 5 In new matrix, group DE is (4 + 5)/2 from C

Distance Groups 0 A, B, C, D, E 2 (A, B), C, D, E 3 (A, B), C, (D, E) 4.5 (A, B), (C, D, E) 7.8 (A, B, C, D, E) 8 6 Distance 4 2 A B C D E Dendrograms Linkage values can be used to construct a dendrogram

Other Linkage Methods Single Linkage (Nearest Neighbour) • distance measured to closest point in cluster Complete Linkage (Furthest Neighbour) • distance between two clusters defined as the furthest distance between any two points in them

Minimum Spanning Trees • Edges linked to nearest points (vertices) • MST may be mapped onto eigenspace, showing which points are distorted in two dimensions

Minimum Spanning Trees Steps: • Find minimum value in resemblance matrix. Draw the two points and join with a line. Write the distance value on the line. • Find the next lowest value in the matrix. Draw these points and join them with a line. • Repeat until all points have been drawn and connected to some other point. • Redraw the whole plot to make the line lengths representative of the distances.

Multivariate Data Analysis