- 118 Views
- Uploaded on
- Presentation posted in: General

Introduction to multivariate analysis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

- Data does not always come with a single response
- Nor does it always have a response
- A data set may consist simply n measurements on p variables
- For example, a doctor might record patient’s height, weight, blood pressure and pulse.
- We could envision situations where any one of these variables is the response and the other’s are predictors.
- Or we could have a situation where we just wanted to examine similarities (and differences) of patients

Statistical Data Analysis - Lecture25 - 23/05/03

- So we can see that in some ways we’ve already encountered multivariate data – multiple regression is an example
- But we really haven’t learned how to deal with anything other than a single response
- And we’re not going to!
- There is a whole body of multivariate data analysis literature devoted to extensions of techniques we’ve seen to far
- Hotellings’ T2 – a multivariate extension to the t-test
- MANOVA – multivariate ANOVA with more than one response
- Multiple regression where the response is multidimensional (canonical correlation)

Statistical Data Analysis - Lecture25 - 23/05/03

- The fact is that in some ways these extensions are trivial
- Sure the interpretation issues are harder
- And the assumptions are just about impossible to verify
- For this reason, we will concentrate on multivariate data description
- This is by no means easy
- For example how do we visualize data in more than 3 dimensions
- Like EDA for low dimension problems, multivariate data visualization and explanation is possibly one of the most important treatments of multivariate data

Statistical Data Analysis - Lecture25 - 23/05/03

- Whilst it is theoretically possible to avoid discussing linear algebra when talking about multivariate techniques, it is practically impossible.
- The reason for this is that we need a common language in order to get some handle on what we’re doing and what we’re talking about
- Linear algebra provides that common language (and indeed underlies a majority of the statistics you have encountered already)
- We will not get hung up on computational techniques, as they are often abstracted from the theory

Statistical Data Analysis - Lecture25 - 23/05/03

- Dimensions in linear algebra – every object in linear algebra (scalar, vector or matrix) has a set of dimensions associated with it.
- These dimensions are reported as rows and columns. So for example a scalar (a real number) has 1 row and 1 column. An r c matrix has r rows and c columns
- An n-dimensional row vector is a list of n points (arranged in 1 row and n columns) which describe a point in n-dimensional space.
- An n-dimensional column vector is a list of n points (arranged in n rows and 1 column) which describe a point in n-dimensional space.
- In terms of using row or column vectors to describe a point in space, it makes no difference. However they do behave differently when it comes to operations like multiplication

Statistical Data Analysis - Lecture25 - 23/05/03

- If a vector (of length n) is said to be real valued then it represents a point in n – the real value hyperplane (2 is the standard 2D plane, and 3 is what we live in)
- We write vectors as bold lower case letters.
- In statistics, the default orientation (unless otherwise defined by the author) of a vector, is a column vector.
E.g.: a is a column vector not a row vector

- A n p matrix a collection of np elements arranged in n rows and p columns.
- Given that we tend to represent variables as column vectors, a matrix is a collection of p column vectors of length n.
- We write matrices as bold capital letters, e.g. X is a matrix

Statistical Data Analysis - Lecture25 - 23/05/03

- The elements of a matrix are referred to by two indices, a row index and a column index, so xijrefers to the jth element on the ith row of the matrix X.
- The transpose of a matrix reverses the order of the rows and the columns. That is, an n p matrix becomes a p n matrix with the ijth element equal to xji
- The transpose of a matrix X is denoted XT, Xtor X
- Usually we denote the jth column of X asxjand the ith row as

Statistical Data Analysis - Lecture25 - 23/05/03

- Like ordinary multiplication, there is the matrix equivalent of division, but some special conditions apply.
- If X is an n n matrix (then X is called a square matrix), andX is of full rank (there is a unique solution b to the equation Xb = 0), then X has an inverseX-1, such that XX-1 = X-1X = Inwhere Inis the identity matrix (a matrix with ones down the diagonal and zeroes else where)
- E.g. if

Statistical Data Analysis - Lecture25 - 23/05/03

- If A is an np matrix and Bis a rc then we can multiply B by A (to get AB)only ifp = r
- Note this does not automatically mean we can multiply A by B – this can only happen if c = n
- In general AB BA
- If A is an np matrix and Bis a rc and p = r then the ijth element of the product (let C = AB) is given by
- This is more easily demonstrated than seen from the formula

Statistical Data Analysis - Lecture25 - 23/05/03

- There are a variety of different ways that distance is measured (between two multidimensional points), and their pros and their cons are just about as varied
- If we have just two variables (p = 2) X and Y with n observations on each, then the distance between the ith and the jth point is given by Pythagoras’ theorem

Statistical Data Analysis - Lecture25 - 23/05/03

Statistical Data Analysis - Lecture25 - 23/05/03

- The examples you have seen so far are called Euclidean distances and have a natural extension when there are more than three variables (p>3).
- For any p the Euclidean distance between two points is given by
- This should look familiar – this is the distance we minimize for regression.
- The second part of the equation is called the L2 norm

Statistical Data Analysis - Lecture25 - 23/05/03

- One direct downside of the Euclidean difference is that it is dominated by variables with a large mean (relative to the other variables).
- For example, if variable X1 measures height in mm and variable X2 measures weight in stone , then most of the distance will be dominated by X1
- One solution to this is to scale the each variable before measuring the distance
- That is, we subtract the mean of each variable from every measurement for that variable and divide by the standard deviation
- This can work well, but has the disadvantage of removing information about separation

Statistical Data Analysis - Lecture25 - 23/05/03

- There are a whole set of distance measures based on norms
- The L2 norm is only one of a family of measures (called p norms, or Lp norms) given by the general formula
- When p = 1 the L1 norm is sometimes called the Manhatten distance
- When p is infinite the L, is called the infinity norm (or max or sup norm) and is defined by

Statistical Data Analysis - Lecture25 - 23/05/03

- The Mahalanobis distance is used to measure the distance of a single multivariate observation from the centre of the population that the observation comes from.
- If are the values of for the individual, with corresponding population mean values
then

where V is the population covariance matrix.

- If we have the population means and covariance matrix then D2 follows a chi-square distribution with p degrees of freedom

Statistical Data Analysis - Lecture25 - 23/05/03

- The covariance matrix is the multivariate equivalent of the variance for a single observation, with the diagonal elements equal to the sample variance and the off diagonal elements cij
i.e. the sample covariance between the ith and jth variables

- Unfortunately we almost never have the means or the covariance matrix, and so we must estimate them from the data
- The covariance matrix V is estimated by taking a pooled average of the covariance matrices for each of the variables
- It is unclear how quickly the Mahalanobis distance converges to a chi-square distribution, but Manly suggests when p = 100 (100 independent variables) there should be no problem in assuming this.

Statistical Data Analysis - Lecture25 - 23/05/03