This presentation is the property of its rightful owner.
1 / 16

# Introduction to multivariate analysis PowerPoint PPT Presentation

Introduction to multivariate analysis. Data does not always come with a single response Nor does it always have a response A data set may consist simply n measurements on p variables For example, a doctor might record patient’s height, weight, blood pressure and pulse.

Introduction to multivariate analysis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

### Introduction to multivariate analysis

• Data does not always come with a single response

• Nor does it always have a response

• A data set may consist simply n measurements on p variables

• For example, a doctor might record patient’s height, weight, blood pressure and pulse.

• We could envision situations where any one of these variables is the response and the other’s are predictors.

• Or we could have a situation where we just wanted to examine similarities (and differences) of patients

Statistical Data Analysis - Lecture25 - 23/05/03

### Multivariate data

• So we can see that in some ways we’ve already encountered multivariate data – multiple regression is an example

• But we really haven’t learned how to deal with anything other than a single response

• And we’re not going to!

• There is a whole body of multivariate data analysis literature devoted to extensions of techniques we’ve seen to far

• Hotellings’ T2 – a multivariate extension to the t-test

• MANOVA – multivariate ANOVA with more than one response

• Multiple regression where the response is multidimensional (canonical correlation)

Statistical Data Analysis - Lecture25 - 23/05/03

### Multivariate data

• The fact is that in some ways these extensions are trivial

• Sure the interpretation issues are harder

• And the assumptions are just about impossible to verify

• For this reason, we will concentrate on multivariate data description

• This is by no means easy

• For example how do we visualize data in more than 3 dimensions

• Like EDA for low dimension problems, multivariate data visualization and explanation is possibly one of the most important treatments of multivariate data

Statistical Data Analysis - Lecture25 - 23/05/03

### An introduction to linear algebra

• Whilst it is theoretically possible to avoid discussing linear algebra when talking about multivariate techniques, it is practically impossible.

• The reason for this is that we need a common language in order to get some handle on what we’re doing and what we’re talking about

• Linear algebra provides that common language (and indeed underlies a majority of the statistics you have encountered already)

• We will not get hung up on computational techniques, as they are often abstracted from the theory

Statistical Data Analysis - Lecture25 - 23/05/03

### Some definitions

• Dimensions in linear algebra – every object in linear algebra (scalar, vector or matrix) has a set of dimensions associated with it.

• These dimensions are reported as rows and columns. So for example a scalar (a real number) has 1 row and 1 column. An r c matrix has r rows and c columns

• An n-dimensional row vector is a list of n points (arranged in 1 row and n columns) which describe a point in n-dimensional space.

• An n-dimensional column vector is a list of n points (arranged in n rows and 1 column) which describe a point in n-dimensional space.

• In terms of using row or column vectors to describe a point in space, it makes no difference. However they do behave differently when it comes to operations like multiplication

Statistical Data Analysis - Lecture25 - 23/05/03

• If a vector (of length n) is said to be real valued then it represents a point in n – the real value hyperplane (2 is the standard 2D plane, and 3 is what we live in)

• We write vectors as bold lower case letters.

• In statistics, the default orientation (unless otherwise defined by the author) of a vector, is a column vector.

E.g.: a is a column vector not a row vector

• A n p matrix a collection of np elements arranged in n rows and p columns.

• Given that we tend to represent variables as column vectors, a matrix is a collection of p column vectors of length n.

• We write matrices as bold capital letters, e.g. X is a matrix

Statistical Data Analysis - Lecture25 - 23/05/03

• The elements of a matrix are referred to by two indices, a row index and a column index, so xijrefers to the jth element on the ith row of the matrix X.

• The transpose of a matrix reverses the order of the rows and the columns. That is, an n p matrix becomes a p n matrix with the ijth element equal to xji

• The transpose of a matrix X is denoted XT, Xtor X

• Usually we denote the jth column of X asxjand the ith row as

Statistical Data Analysis - Lecture25 - 23/05/03

• Like ordinary multiplication, there is the matrix equivalent of division, but some special conditions apply.

• If X is an n n matrix (then X is called a square matrix), andX is of full rank (there is a unique solution b to the equation Xb = 0), then X has an inverseX-1, such that XX-1 = X-1X = Inwhere Inis the identity matrix (a matrix with ones down the diagonal and zeroes else where)

• E.g. if

Statistical Data Analysis - Lecture25 - 23/05/03

• If A is an np matrix and Bis a rc then we can multiply B by A (to get AB)only ifp = r

• Note this does not automatically mean we can multiply A by B – this can only happen if c = n

• In general AB  BA

• If A is an np matrix and Bis a rc and p = r then the ijth element of the product (let C = AB) is given by

• This is more easily demonstrated than seen from the formula

Statistical Data Analysis - Lecture25 - 23/05/03

### Distance measures

• There are a variety of different ways that distance is measured (between two multidimensional points), and their pros and their cons are just about as varied

• If we have just two variables (p = 2) X and Y with n observations on each, then the distance between the ith and the jth point is given by Pythagoras’ theorem

Statistical Data Analysis - Lecture25 - 23/05/03

Statistical Data Analysis - Lecture25 - 23/05/03

### Euclidean distance

• The examples you have seen so far are called Euclidean distances and have a natural extension when there are more than three variables (p>3).

• For any p the Euclidean distance between two points is given by

• This should look familiar – this is the distance we minimize for regression.

• The second part of the equation is called the L2 norm

Statistical Data Analysis - Lecture25 - 23/05/03

• One direct downside of the Euclidean difference is that it is dominated by variables with a large mean (relative to the other variables).

• For example, if variable X1 measures height in mm and variable X2 measures weight in stone , then most of the distance will be dominated by X1

• One solution to this is to scale the each variable before measuring the distance

• That is, we subtract the mean of each variable from every measurement for that variable and divide by the standard deviation

• This can work well, but has the disadvantage of removing information about separation

Statistical Data Analysis - Lecture25 - 23/05/03

### Alternative distance measures

• There are a whole set of distance measures based on norms

• The L2 norm is only one of a family of measures (called p norms, or Lp norms) given by the general formula

• When p = 1 the L1 norm is sometimes called the Manhatten distance

• When p is infinite the L, is called the infinity norm (or max or sup norm) and is defined by

Statistical Data Analysis - Lecture25 - 23/05/03

### Mahalanobis distance

• The Mahalanobis distance is used to measure the distance of a single multivariate observation from the centre of the population that the observation comes from.

• If are the values of for the individual, with corresponding population mean values

then

where V is the population covariance matrix.

• If we have the population means and covariance matrix then D2 follows a chi-square distribution with p degrees of freedom

Statistical Data Analysis - Lecture25 - 23/05/03

### Mahalanobis distance

• The covariance matrix is the multivariate equivalent of the variance for a single observation, with the diagonal elements equal to the sample variance and the off diagonal elements cij

i.e. the sample covariance between the ith and jth variables

• Unfortunately we almost never have the means or the covariance matrix, and so we must estimate them from the data

• The covariance matrix V is estimated by taking a pooled average of the covariance matrices for each of the variables

• It is unclear how quickly the Mahalanobis distance converges to a chi-square distribution, but Manly suggests when p = 100 (100 independent variables) there should be no problem in assuming this.

Statistical Data Analysis - Lecture25 - 23/05/03