- 115 Views
- Uploaded on
- Presentation posted in: General

Ordinations

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Ordinations

Peter Shaw

Introduction to ordination

- “Ordination” is the term used to arrange multivariate data in a rational order, usually into a 2D space (ie a diagram on paper).
- The underlying concept is that of taking a set of community data (or chemical or physical…) and organising them into a picture that gives you insight into their organisation.

Wet

1,2

Dry

4,5

Muddy

3

A direct ordination, projecting communities onto time: The sand dune succession at the southern end of lake Michegan (re-drawn from Olsen 1958)

Cottonwood. Populus deltoides

Black oak Quercus velutina

Pines, Pinus spp

Bare sand stabilised by Marram grass

Elevation, m

193

189

183

177

0 350 600 850 1100 3500 10,000

Years stable

4000

Alpine dwarf scrub

A 2-D direct ordination, also called a mosaic diagram, in this case showing the distribution of vegetation types in relation to elevation and moisture in the Sequoia national park. This is an example of a direct ordination, laying out communities in relation to two well-understood axes of variation. Redrawn from Vankat (1982) with kind permission of the California botanical society.

Subalpine

conifer

Wet meadows

3000

Elevation, m

Lodgepole

pine

Red fir

Jeffrey

pine

pinyon

2000

Mixed conifers

Ponserosa pine

Montane

hardwood

Montane

hardwood

1000

Chaparral

hardwoods

Valley-foothill hardwood

Annual grassland

Moist….…………………Dry

Bray-Curtiss ordination

- This technique is good for introducing the concept of ordination, but is almost never used nowadays.
- It dates back to the late 1950s when computers were unavailable, and had the advantage that it could be run by hand.
- 3 steps:
- Convert raw data to a matrix of distances between samples
- identify end points
- plot each sample on a graph in relation to the end points.

YearA B C

1100 0 0

2 9010 0

3 8020 5

4 603510

5 505020

6 406030

7 203040

8 52060

9 01075

10 0 090

Sample data - a succession:

Choose a measure of distance between years:

The usual one is the Bray-Curtiss index (the Czekanowski index).

Between year 1 & 2:

A B C

Y1100 0 0

Y2 90 10 0

Minimum 90 0 0 = 90

Sum190 10 0 =200

distance = 1-2*90/200 = 0.1

The matrix of B_C distances between each of the 10 years. Notice that the matrix is symmetrical about the leading diaginal, so only lower half is shown.

PS I did this by hand.

PPS it took about 30 minutes!

1 2 3 4 5 6 7 8 9 10

1 0.00

2 0.10 0.00

3 0.29 0.12 0.00

4 0.41 0.28 0.15 0.00

5 0.54 0.45 0.33 0.16 0.00

6 0.65 0.50 0.45 0.28 0.12 0.00

7 0.79 0.68 0.54 0.38 0.33 0.27 0.00

8 0.95 0.84 0.68 0.63 0.56 0.49 0.26 0.00

9 1.00 0.89 0.74 0.79 0.71 0.63 0.43 0.18 0.00

10 1.00 1.00 0.95 0.90 0.81 0.73 0.56 0.33 0.14 0.00

Now establish the end points - this is a subjective choice, I choose years 1 and 10 as being extreme values.

- Now draw a line with 1 at one end, 10 at the other.
- The length of this line = a distance of 1.0, based on the matrix above.

Distance = 1.0

10

1

Now locate obs 2, 1.0 from year 10 and 0.1 from year 1. This can be done by circles.

Year2 locates here:

Radius = 1.0

10

1

Radius = 0.1

Distance 1<-> 2 = 0.1 units, distance 2<->10 = 1.0 units, so draw 2 circles:

The final ordination of the points (approx.)

5

3

4

6

7

2

8

9

10

1

- This is our first true multivariate technique, and is one of my favourites.
- It is fairly close to multiple linear regression, with one big conceptual difference (and a hugely different output).

Y

X2

- MLR fits them vertically to one special variable (Y, the dependent).
- PCA fits them orthogonally - all variables are equally important, there is no dependent.

X1

V3

V2

V1

- MLR seeks to find the best hyper-plane through the data.
- PCA [can be thought of as] starting off by fitting one best LINE through the cloud of datapoints, like shining a laser beam through an array of tethered balloons. This line passes through the middle of the dataset (defined as the mean of each variable), and runs along the axis which explains the greatest amount of variation within the data.
- This first line is known as the first principal axis – it is the most useful single line that can be fitted through the data (formally the linear combination of variables with the greatest variance).

The 1st principal axis of a dataset – shown as a laser in a room of

tethered balloons!

Overall mean of the dataset

Often the first axis of a dataset will be an obvious simple source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.A good indicator of the importance of a principal axis is the % variance it explains. This will tend to decrease as the number of variables = number of axes in the dataspace increases. For half-decent data with 10-20 variables you expect c. 30% of variation on the 1st principal axis.Having fitted one principal axis, you can fit a 2nd. It will explain less variance than the 1st axis.This 2nd axis explains the maximum possible variation, subject to 2 constraints:1: is orthogonal (at 90 degrees to) the 1st principal axis2: it runs through the mean of the dataset.

The 2nd principal axis of a dataset,– shown in blue

This diagram shows a 3D dataspace (axes being V1, V2 and V3).

Overall mean of the dataset

V1

1.0

0.3

0.5

1.3

1.5

2.5

2.5

3.4

V2

0.1

1.0

2.0

0.8

1.5

2.2

2.8

3.4

V3

0.3

2.1

2.0

1.0

1.2

2.5

2.5

3.5

- The first 2 principal axes define a 2D plane, on which the positions of all the datapoints can be projected.
- Note that this is exactly the same as casting a shadow.
- Thus PCA allows us to examine the shadow of a hig-dimensional object, say a 30 dimensional dataspace (defined by 30 variables such as species density).
- Such a projection is a basic ordination diagram.
- I always use such diagrams when exploring data – the scattergraph of the 1st 2 principal axes of a dataset is the most genuinely informative description of the entire data.

A PCA can generate as many principal axes as there are axes in the original dataspace, but each one is progressively less important than the one preceeding it.

In my experience the first axis is usually intelligible (often blindingly obvious), the second often useful, the third rarely useful, 4th and above always seem to be random noise.

- Inspecting the scattergraph is useful,but you need to know the importance of each variable with respect to each axis. This is the gradient of the axis with respect to that variable.
- This is given as a standard in PCA output, but you need to know how the numbers are arrived at to know what the gibberish means!

- 1: derive the matrix of all correlation coefficients – the correlation matrix. Note the similarity to Bray-Curtiss ordination: we start with N columns of data, then derive an N*N matrix informing us about the relationship between each pair of columns.
- 2: Derive the eigenvectors and eigenvalues of this matrix. It turns out that MOST multivariate techniques involve eigenvector analysis, so you may as well get used to the term!

- Involve you knowing a little about matrix multiplication.
- Matrix multiplication is essentially the same as solving a shopping bill!
- I have 2 apples, 1banana and 3 oranges.
- You have 1 apple 2 bananas and 3 oranges.
- Costs: 10p per apple, 15p per banana, 20p per orange.
- I pay 2*10 + 1*15 + 3*20 = 95p
- you pay 1*10 + 2*15+3*30 = 130p

my total

your total

my amounts

95

130

10

15

20

2 1 3

1 2 3

X

=

your amounts

Such a calculation is called a linear combination

Eigenvectors 2:

- Are intimately linked with matrix multiplication. You don’t have to know this bit, but it would help!
- Take an [N*N] matrix M, and use it to multiply a [1*N] vector of 1’s.
- This gives you a new [1*N] vector, of different numbers. Call this V1
- Multiply V1 by M, to get V2.
- Multiply V2 * M to get V3, etc.
- After infinite repetitions the elements of V will settle down to a steady pattern – this is the dominant eigenvector of the matrix M 1.
- Each time 1 is multiplied by M it grows by a constant multiple, which is the first eigenvalue of M E1.

*

1

1

1

V1

*

V1

V2

*

V2

V3

After a while, each successive multiplication

preserves the shape (the eigenvector) while increasing values by a constant amount (the eigenvalue

- by multiplying the source data by the corresponding eigenvector elements, then adding these together.
- Thus the projection of site A on the first principal axis is based on the calculation:
- (spp1A*V11 + spp2A*V21+spp3A*V31…)
- Where
- spp1A = number of species 1 at site A, etc
- V21 = 1st eigenvector element for species 2

- Luckily the PC does this for you.
- There is one added complication: you do not usually use raw data in the above calculation. It is possible to do so, but the resulting scores are very dominated by the commonest species.
- Instead all species data is first converted to Z scores, so that mean = 0.00 and sd = 1.00
- This means that principal axes, which always run through the mean, are always centred on the origin 0,0,0,0,…
- It also means that typically half the numbers in a PCA output are negative, and all are apparently unintelligible!

- 1: Derive eigenvectors of correlation matrix. Call the 1st one E1.
- 2: convert all raw data into Z scores (mean = 0, sd = 1.0)
- 3: 1st axis scores =
- Z * E1 (where Z is an N*R matrix, E is a 1*N matrix).
- 2nd axis scores = Z*E2, etc.

- Is interpreting the output!
- DON’T PANIC!!
- Stage 1: look at the variance on the 1st axis. Values above 50% are hopeful, values in the 20s suggest a poor or noisy dataset. There is no significance test, but referal to table of predicted values from the broken stick distribution is often helpful.

- 2: Look at the eigenvector elements on the first axis – which species / variables have large loadings (positive or negative). Do some variables have opposed loadings (one very +ve, one very –ve) suggesting a gradient between the two extremes?
- The actual values of the eigenvector elements are meaningless – it is their pattern which matters. In fact the sign of elements will sometimes differ between packages if they use different algorithms! It’s OK! The diagrams will look the same, it’s just that the pattern of the points will be reversed.

- Stage 3: plot the scattergraph of the ordination scores and have a look.
- This is the shadow of the dataset.
- Look for clusters, outliers, gradients. Each point is one observation, so you can identify odd points and check the values in them. (Good way to pick up typing errors).
- Overlay the graph with various markers – often this picks out important trends.

Data on Collembola of industrial waste sites.

The 1st axis of a PCA ordination detected habitat type: woodland vs open lagoon, while the second detected succession within the Tilbury trial plots.

- Since there is an intimate connection between eigenvector loadings and axis scores, it is helpful to inspect them together.
- There is an elegant solution here, known as a biplot.
- You plot the site scores as points on a graph, and put eigenvector elements on the same graph (usually as arrows).

Pond community handout data – site scores plotted by SPSS

These are the new variables which appear after PCA: FAC1 and FAC2. Note aquatic sites at left hand side.

Factor scores (=eigenvector elements) for each species

in pond dataset. Note aquatic species at left hand side.

.8

potamoge

.7

.6

.5

.4

epilob

.3

ranuscle

.2

phragaus

.1

potentil

AX2

0.0

-1.0

-.5

0.0

.5

1.0

AX1

Pond community handout data – the biplot.

Note aquatic sites and species at left hand side. Note also that this is only a description of the data – no hypotheses are stated and no significance values can be calculated.

Potamog

Epilobium

Ranunc

Potent.

Phrag.

- Corresponence analysis (CA), also known as Reciprocal Averaging (RA)
This technique is widely used, and relates to the notion of the weighted mean.

1: Take an N*R matrix of sites*species data, and calculate the weighted mean score for each site, giving each species an arbitrary weighting.

Now use the site scores to get a weighted mean for each species.

2: repeat stage 1 until the scores stabilise.

- Namely that it derives scores for species and sites at the same time. These can be displayed in a biplot, just as with PCA. This is conceptually simpler than a PCA biplot due to the simultaneous derivation of scores.
- The formal algorithm involved here is almost the same as PCA – except that (in theory) you extract eigenvectors of the matrix of chi-squared distances between samples or species, instead of the correlation coefficients.

- This concerns the ordination diagram produced when analysing a succession, or a community which changes steadily along a transect in space.

You would expect this pattern – and would be wrong!!

Axis 2

Year 1

Year 2

Year 3

Year 4

Year 5

Year 6

Axis 1

Idealised successional data (used in Bray-Curtiss ordination last week

PCA ordination of idealised successional data – the Horseshoe effect (arch distortion).

- Ideas were that it represented a bug in the techniques, or a hitherto undiscovered fundamental truth about ecological successions.
- Neither is the case. The algorithm simply tells the truth! It’s just that humans are no good at visualising high-dimensional spaces. If they were they would know that successions must be arched in a space where distance = difference between sites.
- Think about the difference between an early-successional site and a mid-successional one. It is likely to be absolute – no species in common. How about an early vs a late successional site? The same, not more.

Site 4

max. separation

The only way to ensure that the early-mid distance = early-late distance is for the succession to be portrayed as a curve. It is!

Site 1

max. separation

Site 3

Site 2

not far!

Axis 2

Late

Early

Axis 1

mid

- In analysing the communities of fungi in pine forests (which roughly corresponded to a succession), I found the the 2nd axis of the PCA correlated significantly with diversity (Shannon index).
- Getting my head round why the 2D projection of a 10D space should correlate with the information content of the community was not exactly easy! Well done to my supervisor for pointing out the arch distortion and community diversity both peaked mid-succession.

Ax2

Ax2

Young

Old

HENCE:

Ax1

Medium

Site age

ALSO

Ax2

Diversity

HENCE:

Diversity

Site age

- This arch effect was a well known irritant to ecologists by the mid 1970s, and was “sorted out” by a piece of software written by Mark Hill (then and still of ITE).
- The program was called DECORANA, for DEtrended CORrespondence ANAlysis. The algorithm is properly called Detrended correspondence analysis or DCA - it is a common minor misnomer to confuse the algorithm and Mark Hill’s FORTRAN code which implements the algorithm.

- In fact recently a minor bug has been found in the source code - it could invalidate a few analyses, but in practice seems of minor importance (it concerns the way eigenvalues are sorted when values are tied).
- Post-1997 issues of DCA should be safe. I still use an older version, with a reasonably clean conscience!
- It is not supplied in any standard package, but is widely available in ecological packages - we have a copy in the dept.

- The algorithm uses CA rather than PCA (faster extraction, + simultaneous extraction of species and site scores).
- It is iterative: the 1st pass is simply a 2D CA ordination.
- The 1st axis is then chopped up into N subsegments (default = 26), and each one rescaled to have a common mean.

- This is to ensure that the extremes get equal representation (the technical term is “a constant turnover rate”).

Before

Ax2

After

1

7

2

6

1 2 3 4 5 6 7

Ax1

3

5

4

1 2 3 4 5 6 7

- Until the iterations converge. This gives the 1st DCA axis. It has an eigenvalue, species and site scores just like CA - but don’t ask what the numbers actually mean!!
- The procedure is then repeated to get a 2nd axis - same procedure, but the 1st axis is removed by subtraction at each stage.
- DECORANA also gives a 3rd axis by the same algorithm, then stops. Note that the DCA algorithm could give N axes. As I said, 4th axes and above tend to be merely noise.

- As biplots, although the scores can be subject to ANOVA etc.
- There is no hypothesis inherent in DCA so no significance test inherent.
- It is excellent for dealing with successions etc, and is many ecologists ordination of choice.
- I try to avoid it only because the input data format required (Cornell Condensed Format) is an unmitigated headache! (Unless you are happy with fixed columns and Fortran fomat statements).

"The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other statistical innovation (with the possible exception of multiple regression techniques)." Cormack (1971) A review of classification. Journal of the Royal Statistical SocietyA 134, 321-367.

Here the aim is to aggregate individuals into clusters, based on an objective estimate of the distance between them. The aim is to produce a dendrogram:

This involves 2 choices, both allowing many options:

1: How to measure the distance?

2: What rules to build the dendrogram by?

In practice there are c 3 standard distance measures and c5 standard sets of rules to build the dendrogram, giving 15 different algorithms for cluster analysis.

In addition to these 15 different STANDARD ways of making dendrogram, there are other options. One package (called CLUSTAN) offers a total around 180 different algorithms.

But each different algorithm can generate a different shape of the dendrogram, a different pattern of relationships – and you have no way of knowing which one is most useful.

For a book I explored a small number of datasets in painful detail using various multivariate techniques, and they all gave the same basic story, identified the same extreme values and clusters.

Except cluster analysis, which told me several different stories, all garbage!!

Worse, if you do happen to find a dendrogram sequence that makes sense – be careful, it may be lying to you! Dendrograms can be re-arranged around any joint (or node), and quite distantly related points can end up side-by-side.

Is the same dendrogram as

D C B A

B A D C

C and B next to each other

(but connected only by a high-level node)

C and B far apart

The 8 different ways of presenting the dendrogram of the quarry floor data (Usher 1975) using nearest neighbour analysis on a matrix of euclidean distances.

These patterns can re-arrange around any node, like a child’s mobile.

The number of permutations of a dendrogram with n points is 2(n-1) !!

A B C D

B A C D

A B D C

B A D C

C D A B

D C A B

C D B A

D C B A

This conjunction of S4 (top) and S1, S2, S3 below it was mentioned as showing the Lake Superior samples clustered together.

I have cut&pasted this dendrogram to show an alternative, valid arrangement, in which the Lake Superior samples are widely scattered.

The lesson here – ignore the ordering of objects along a cluster analysis axis! (Unlike all ordinations, in which order matters and is preserved under transformations.)

I know of at least one case of published work using a dendrogram to classify lake communities, where one stated discovery was that samples from one particular lake clustered together. It is true that they were next to each other on the published dendrogram, but there was a fault line down the middle of the cluster, which could have been re-arranged to show it as 2 widely spaced clusters!

My summary about cluster analysis – DON’T!! Ordinate it, or use TWINSPAN. I do have concern about the over-reliance of DNA researchers on dendrograms, although they seem to operate with fewer choices (3 distance measures, 6 ways to build up a tree) and usually publish topologically valid trees!

"The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis". Byron Morgan, personal communication.

TWINSPAN was written by Mark Hill (of ITE, now CEH, author of DECORANA) in 1979, and has DCA at its core. The aim is to produce a 2-way ordered classification of the data. This produces 2 linked dendrograms one of species, one of sites), but these are fixed – may not rotate, and in my experience make a great deal of sense. It also shows you which species are most associated with which areas and identifies indicator species for different parts of the dendrograms. Like DCA, this analysis has so far been confined to ecologists but deserves to be use far more widely.

Its technical details are a lecture in themselves – read it up well if you choose to use this.

TWO-WAY ORDERED TABLE

Ordered list of species

Ordered list of observations

1223331111222222 22333111 1 1

52912516781345788906034049143235672

2 CORTSEMI 455344---1-----1----2-------------- 000

4 INOCLACE 413-----1-------------------------- 000

Dendrogram to classify

species

6 LACTRUFU --5411-------------223------------- 000

9 SUILLUTE 24423--1-3--11--------------------- 000

1 BOLEFERR --1-2-22232224-2-22-1-2-----1------ 001

10 SUILVARI 555-2--4-44525552-33335------------ 001

3 GOMPROSE 3325555555555555152444342---------- 01

8 SUILBOVI 55555555555555553555555555--------- 01

5 LACCPROX 54555455555535535555555555555542555 1

7 PAXIINVO --221232--22212-1514244554122553345 1

Dendrogram to classify

observations

00000000000000000000000111111111111

00000011111111111111111000111111111

00011100000000001111111 000111111

- Is a package, but just like DECORANA has become associated with the algorithm it implements.
- To be correct, CANOCO implements several ordination algorithms but the best known and most used is CCA = Canonical Correspondence Analysis.

- Deals with the common ecological situation where you wish to relate community data to environmental data.
- Which spp are associated with which chemicals? Is the pattern random?

Species data

0 1 1 23 3

2 3 4 23 4

2 12 13 3

0 2 1 3 1

Environmental

pH Na K OM

pH Na K OM

pH Na K OM

- Are nightmarish, unless you enjoy advanced matrix algebra!
- The output is not much better!
- Luckily it can be converted into 2 user-friendly facets:
- a tri-plot
- a significance value

CCA triplot of Collembola succession on PFA in relation to soil conditions.

Increasing age of PFA

P

Em

Im

Hv

Ipal

Ssp

Ct

Lc

Bp

Sv

TilbO

TilbZ

Ivp

Tm

Ll

TilbC

Barking

woods

5

Ip

LOI

Se

Fc

In

Thurrock

woods

Fm

Young dry PFA

pH

Conductivity

Woodland stage PFA

Separation of

Thurrock’s

saline lagoon

Fresh wet PFA

Ivg

Thurrock lagoon

Enico

- Takes H0 = no association between community and environment.
- It is tested by obtaining an eigenvalue corresponding to the linkage in your actual data, then randomly shuffling the environmental data and re-calculating this eigenvalue. Repeat 200-1000 times, and see how your true value ranks among the randomly shuffled values.
- This is a Monte-Carlo test - the way forward for inferential testing (IMHO).