Ordinations. Peter Shaw. Introduction to ordination. “Ordination” is the term used to arrange multivariate data in a rational order, usually into a 2D space (ie a diagram on paper).
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
A direct ordination, projecting communities onto time: The sand dune succession at the southern end of lake Michegan (re-drawn from Olsen 1958)
Cottonwood. Populus deltoides
Black oak Quercus velutina
Pines, Pinus spp
Bare sand stabilised by Marram grass
0 350 600 850 1100 3500 10,000
4000 sand dune succession at the southern end of lake Michegan (re-drawn from Olsen 1958)
Alpine dwarf scrub
A 2-D direct ordination, also called a mosaic diagram, in this case showing the distribution of vegetation types in relation to elevation and moisture in the Sequoia national park. This is an example of a direct ordination, laying out communities in relation to two well-understood axes of variation. Redrawn from Vankat (1982) with kind permission of the California botanical society.
Bray-Curtiss ordination sand dune succession at the southern end of lake Michegan (re-drawn from Olsen 1958)
Year A B C sand dune succession at the southern end of lake Michegan (re-drawn from Olsen 1958)
1 100 0 0
2 90 10 0
3 80 20 5
4 60 35 10
5 50 50 20
6 40 60 30
7 20 30 40
8 5 20 60
9 0 10 75
10 0 0 90
Sample data - a succession:
Choose a measure of distance between years:
The usual one is the Bray-Curtiss index (the Czekanowski index).
Between year 1 & 2:
A B C
Y1 100 0 0
Y2 90 10 0
Minimum 90 0 0 = 90
Sum 190 10 0 =200
distance = 1-2*90/200 = 0.1
The matrix of B_C distances between each of the 10 years. Notice that the matrix is symmetrical about the leading diaginal, so only lower half is shown.
PS I did this by hand.
PPS it took about 30 minutes!
1 2 3 4 5 6 7 8 9 10
2 0.10 0.00
3 0.29 0.12 0.00
4 0.41 0.28 0.15 0.00
5 0.54 0.45 0.33 0.16 0.00
6 0.65 0.50 0.45 0.28 0.12 0.00
7 0.79 0.68 0.54 0.38 0.33 0.27 0.00
8 0.95 0.84 0.68 0.63 0.56 0.49 0.26 0.00
9 1.00 0.89 0.74 0.79 0.71 0.63 0.43 0.18 0.00
10 1.00 1.00 0.95 0.90 0.81 0.73 0.56 0.33 0.14 0.00
Now establish the end points - this is a subjective choice, I choose years 1 and 10 as being extreme values.
Distance = 1.0
Now locate obs 2, 1.0 from year 10 and 0.1 from year 1. This can be done by circles.
Year2 locates here:
Radius = 1.0
Radius = 0.1
Distance 1<-> 2 = 0.1 units, distance 2<->10 = 1.0 units, so draw 2 circles:
The final ordination of the points (approx.) This can be done by circles.
The 1 This can be done by circles.st principal axis of a dataset – shown as a laser in a room of
Overall mean of the dataset
Often the first axis of a dataset will be an obvious simple source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.A good indicator of the importance of a principal axis is the % variance it explains. This will tend to decrease as the number of variables = number of axes in the dataspace increases. For half-decent data with 10-20 variables you expect c. 30% of variation on the 1st principal axis.Having fitted one principal axis, you can fit a 2nd. It will explain less variance than the 1st axis.This 2nd axis explains the maximum possible variation, subject to 2 constraints: 1: is orthogonal (at 90 degrees to) the 1st principal axis 2: it runs through the mean of the dataset.
The 2nd principal axis of a dataset,– shown in blue source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.
This diagram shows a 3D dataspace (axes being V1, V2 and V3).
Overall mean of the dataset
A PCA can generate as many principal axes as there are axes in the original dataspace, but each one is progressively less important than the one preceeding it.
In my experience the first axis is usually intelligible (often blindingly obvious), the second often useful, the third rarely useful, 4th and above always seem to be random noise.
2 1 3
1 2 3
Such a calculation is called a linear combination
Eigenvectors 2: source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.
* source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.
After a while, each successive multiplication
preserves the shape (the eigenvector) while increasing values by a constant amount (the eigenvalue
Data on Collembola of industrial waste sites. source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.
The 1st axis of a PCA ordination detected habitat type: woodland vs open lagoon, while the second detected succession within the Tilbury trial plots.
Pond community handout data – site scores plotted by SPSS source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.
These are the new variables which appear after PCA: FAC1 and FAC2. Note aquatic sites at left hand side.
Factor scores (=eigenvector elements) for each species source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.
in pond dataset. Note aquatic species at left hand side.
Pond community handout data – the biplot. source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.
Note aquatic sites and species at left hand side. Note also that this is only a description of the data – no hypotheses are stated and no significance values can be calculated.
This technique is widely used, and relates to the notion of the weighted mean.
1: Take an N*R matrix of sites*species data, and calculate the weighted mean score for each site, giving each species an arbitrary weighting.
Now use the site scores to get a weighted mean for each species.
2: repeat stage 1 until the scores stabilise.
You would expect this pattern – and would be wrong!!
PCA ordination of idealised successional data – the Horseshoe effect (arch distortion).
The only way to ensure that the early-mid distance = early-late distance is for the succession to be portrayed as a curve. It is!
1 2 3 4 5 6 7
1 2 3 4 5 6 7
"The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other statistical innovation (with the possible exception of multiple regression techniques)." Cormack (1971) A review of classification. Journal of the Royal Statistical SocietyA 134, 321-367.
Here the aim is to aggregate individuals into clusters, based on an objective estimate of the distance between them. The aim is to produce a dendrogram:
This involves 2 choices, both allowing many options:
1: How to measure the distance?
2: What rules to build the dendrogram by?
In practice there are c 3 standard distance measures and c5 standard sets of rules to build the dendrogram, giving 15 different algorithms for cluster analysis.
In addition to these 15 different STANDARD ways of making dendrogram, there are other options. One package (called CLUSTAN) offers a total around 180 different algorithms.
But each different algorithm can generate a different shape of the dendrogram, a different pattern of relationships – and you have no way of knowing which one is most useful.
For a book I explored a small number of datasets in painful detail using various multivariate techniques, and they all gave the same basic story, identified the same extreme values and clusters.
Except cluster analysis, which told me several different stories, all garbage!!
Worse, if you do happen to find a dendrogram sequence that makes sense – be careful, it may be lying to you! Dendrograms can be re-arranged around any joint (or node), and quite distantly related points can end up side-by-side.
Is the same dendrogram as
D C B A
B A D C
C and B next to each other
(but connected only by a high-level node)
C and B far apart
The 8 different ways of presenting the dendrogram of the quarry floor data (Usher 1975) using nearest neighbour analysis on a matrix of euclidean distances.
These patterns can re-arrange around any node, like a child’s mobile.
The number of permutations of a dendrogram with n points is 2(n-1) !!
A B C D
B A C D
A B D C
B A D C
C D A B
D C A B
C D B A
D C B A
This conjunction of S4 (top) and S1, S2, S3 below it was mentioned as showing the Lake Superior samples clustered together.
I have cut&pasted this dendrogram to show an alternative, valid arrangement, in which the Lake Superior samples are widely scattered.
The lesson here – ignore the ordering of objects along a cluster analysis axis! (Unlike all ordinations, in which order matters and is preserved under transformations.)
I know of at least one case of published work using a dendrogram to classify lake communities, where one stated discovery was that samples from one particular lake clustered together. It is true that they were next to each other on the published dendrogram, but there was a fault line down the middle of the cluster, which could have been re-arranged to show it as 2 widely spaced clusters!
My summary about cluster analysis – DON’T!! Ordinate it, or use TWINSPAN. I do have concern about the over-reliance of DNA researchers on dendrograms, although they seem to operate with fewer choices (3 distance measures, 6 ways to build up a tree) and usually publish topologically valid trees!
"The best form of cluster analysis is ordination, because ordination is not a form of cluster analysis". Byron Morgan, personal communication.
TWINSPAN was written by Mark Hill (of ITE, now CEH, author of DECORANA) in 1979, and has DCA at its core. The aim is to produce a 2-way ordered classification of the data. This produces 2 linked dendrograms one of species, one of sites), but these are fixed – may not rotate, and in my experience make a great deal of sense. It also shows you which species are most associated with which areas and identifies indicator species for different parts of the dendrograms. Like DCA, this analysis has so far been confined to ecologists but deserves to be use far more widely.
Its technical details are a lecture in themselves – read it up well if you choose to use this.
TWO-WAY ORDERED TABLE dendrogram to classify lake communities, where one stated discovery was that samples from one particular lake clustered together. It is true that they were next to each other on the published dendrogram, but there was a fault line down the middle of the cluster, which could have been re-arranged to show it as 2 widely spaced clusters!
Ordered list of species
Ordered list of observations
1223331111222222 22333111 1 1
2 CORTSEMI 455344---1-----1----2-------------- 000
4 INOCLACE 413-----1-------------------------- 000
Dendrogram to classify
6 LACTRUFU --5411-------------223------------- 000
9 SUILLUTE 24423--1-3--11--------------------- 000
1 BOLEFERR --1-2-22232224-2-22-1-2-----1------ 001
10 SUILVARI 555-2--4-44525552-33335------------ 001
3 GOMPROSE 3325555555555555152444342---------- 01
8 SUILBOVI 55555555555555553555555555--------- 01
5 LACCPROX 54555455555535535555555555555542555 1
7 PAXIINVO --221232--22212-1514244554122553345 1
Dendrogram to classify
0 1 1 23 3
2 3 4 23 4
2 12 13 3
0 2 1 3 1
pH Na K OM
pH Na K OM
pH Na K OM
CCA triplot of Collembola succession on PFA in relation to soil conditions.
Increasing age of PFA
Young dry PFA
Woodland stage PFA
Fresh wet PFA