Loading in 2 Seconds...
Loading in 2 Seconds...
Best Practices vs. Misuse of PCA in the Analysis of Climate Variability . Bob Livezey Climate Services /Office of Services/NWS/NOAA 30 th Climate Diagnostics and Prediction Workshop State College, PA, October 26, 2005. Outline. Motivation, take-home messages and references
Climate Services /Office of Services/NWS/NOAA
30th Climate Diagnostics and Prediction Workshop
State College, PA, October 26, 2005
1. Preprocessing often has major impact on results and their interpretation.
2. PCA results are inherently domain dependent as I
will illustrate later.
3. Standardization means each record has equal weight in variance-based multivariate analyses; ie high latitudes vs tropics, January vs. November.
If this is desirable then PCA should be based on the correlation matrix, if not desirable then the covariance matrix.
4. PCA should be performed on as narrow a window in the seasonal cycle as sample considerations permit to avoid mixing inhomogeneous climates (like the January vs. November example in 3 above).
5. Area averaged or gridded data often must be weighted in in multivariate analyses:
Smaller areas can influence results as much as larger;
On lat/lon grids density of points (and influence) increase with latitude.
5. Two ways to treat the problem:
Create an approximate equal area representation (ie CPC megadivisions, Barnston and Livezey, 1987, grid);
Weight the data – generally proportional to the square root of the area.
5 . If weights are needed and PCA on the correlation matrix is the objective, then standardization should be performed before weighting and then the covariance matrix formed. Otherwise weights are removed in the standardization step.
6. In EPCA (see below), CCA, etc. maps of variables with greater numbers of data points will have disproportionate influence on the results unless the maps are weighted, ie proportionately to the square root of the ratio of the total variance in all variables to the total variance in the weighted variable (see Livezey and Smith, 1999b).
4. Example of first four patterns of 3-day precipitation for May-August over the central US (Richman and Lamb, 1985). The sequence of patterns is seen repeatedly in other analyses and can be considered an artifact of the geometry of PCA:
(a) a point with 0.5 is more than 6 times more important than a point with 0.2, a point with 0.8 more than 7 times more important than one with 0.3, etc.;
(b) summations of the squares over the maps give the total variances listed in 5 above;
(c) comparing the squared central values within closed contours allows practical discrimination between monopoles, dipoles, etc.
7. The time series that go with the patterns (the a’s) are uncorrelated (i.e. not collinear), so they are desirable for multiple linear regression.
8. To compress or filter the data some of the patterns must be thrown out, i.e. the series must be truncated; this is an ART (see O’Lenic and Livezey, 1988 for the best approach I know).
In these applications over-truncation (throwing baby out with the bath water) is of far more concern than under-truncation (retention of some noise). As a pre-step for rotation, CCA, etc., both should be of concern (see below).
9. Physical interpretation of other than the leading PC pattern is usually unwarranted, and this is often the case for the first as well. Richman (1986) shows this for the example in two ways. First he splits the domain in two and does separate PCA on each. Here’s the result for the first PCA mode. Note that the first mode for the southern domain (a monopole covering the domain) is not reproduced in the full domain analysis:
Next he computes the one-point teleconnectionpattern for the largest loading on each pattern. Here’s the result for the second PCA mode. The PCA mode is a dipole, the teleconnection pattern (reflecting the physical covariance structure around the point) a monopole:
10. The North et al. (1982) Test is to determine whether two consecutive patterns can be reasonably interpreted as distinct patterns or separate signals. It assumes the n samples are independent (heuristically adjust downward for dependence):
10. Other kinds of PCA:
Combined (CPCA) – more than one mapped variable;
Extended (EPCA) – group of maps of same variable at different lags to capture pattern evolution (MSSA is a variant);
Rotated (RPCA) – to reduce sampling error and improve physical representiveness.
2. Note the robustness of rotated patterns in Richman’s split domain example (all patterns are present in both analyses):
Now compare rotated mode 2 and its corresponding teleconnection pattern (both are monopoles with similar scales):
3. Barnston and Livezey (1987) compared 120 monthly 700 mb height PCA and RPCA patterns with their corresponding one-point teleconnection patterns – the average pattern correlation was 0.69 and 0.90 respectively. They also used sensitivity tests to demonstrate dramatic reductions in sampling error.
4. The most likely reason for the success of rotation is the relaxation of the geometrical and mathematical constraints on the analysis, ie the data can speak more for itself.
In a commonly used variant of varimax where the eigenvectors are weighted by the square root of the eigenvalue the resulting patterns do not have to be orthogonal and the resulting time series do not have to be independent (Jolliffe, 1995).
5. Under-rotation (truncation of too many modes) can result in discarded signal while over-rotation (truncation of too few) can result in over-regionalization of signals (see Olenic and Livezey, 1988).
Map (a) here is a dipole but (b)and (c) are monopoles.