Introduction to multivariate analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

Introduction to multivariate analysis PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on
  • Presentation posted in: General

Introduction to multivariate analysis. Data does not always come with a single response Nor does it always have a response A data set may consist simply n measurements on p variables For example, a doctor might record patient’s height, weight, blood pressure and pulse.

Download Presentation

Introduction to multivariate analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to multivariate analysis

Introduction to multivariate analysis

  • Data does not always come with a single response

  • Nor does it always have a response

  • A data set may consist simply n measurements on p variables

  • For example, a doctor might record patient’s height, weight, blood pressure and pulse.

  • We could envision situations where any one of these variables is the response and the other’s are predictors.

  • Or we could have a situation where we just wanted to examine similarities (and differences) of patients

Statistical Data Analysis - Lecture25 - 23/05/03


Multivariate data

Multivariate data

  • So we can see that in some ways we’ve already encountered multivariate data – multiple regression is an example

  • But we really haven’t learned how to deal with anything other than a single response

  • And we’re not going to!

  • There is a whole body of multivariate data analysis literature devoted to extensions of techniques we’ve seen to far

    • Hotellings’ T2 – a multivariate extension to the t-test

    • MANOVA – multivariate ANOVA with more than one response

    • Multiple regression where the response is multidimensional (canonical correlation)

Statistical Data Analysis - Lecture25 - 23/05/03


Multivariate data1

Multivariate data

  • The fact is that in some ways these extensions are trivial

  • Sure the interpretation issues are harder

  • And the assumptions are just about impossible to verify

  • For this reason, we will concentrate on multivariate data description

  • This is by no means easy

  • For example how do we visualize data in more than 3 dimensions

  • Like EDA for low dimension problems, multivariate data visualization and explanation is possibly one of the most important treatments of multivariate data

Statistical Data Analysis - Lecture25 - 23/05/03


An introduction to linear algebra

An introduction to linear algebra

  • Whilst it is theoretically possible to avoid discussing linear algebra when talking about multivariate techniques, it is practically impossible.

  • The reason for this is that we need a common language in order to get some handle on what we’re doing and what we’re talking about

  • Linear algebra provides that common language (and indeed underlies a majority of the statistics you have encountered already)

  • We will not get hung up on computational techniques, as they are often abstracted from the theory

Statistical Data Analysis - Lecture25 - 23/05/03


Some definitions

Some definitions

  • Dimensions in linear algebra – every object in linear algebra (scalar, vector or matrix) has a set of dimensions associated with it.

  • These dimensions are reported as rows and columns. So for example a scalar (a real number) has 1 row and 1 column. An r c matrix has r rows and c columns

  • An n-dimensional row vector is a list of n points (arranged in 1 row and n columns) which describe a point in n-dimensional space.

  • An n-dimensional column vector is a list of n points (arranged in n rows and 1 column) which describe a point in n-dimensional space.

  • In terms of using row or column vectors to describe a point in space, it makes no difference. However they do behave differently when it comes to operations like multiplication

Statistical Data Analysis - Lecture25 - 23/05/03


Introduction to multivariate analysis

  • If a vector (of length n) is said to be real valued then it represents a point in n – the real value hyperplane (2 is the standard 2D plane, and 3 is what we live in)

  • We write vectors as bold lower case letters.

  • In statistics, the default orientation (unless otherwise defined by the author) of a vector, is a column vector.

    E.g.: a is a column vector not a row vector

  • A n p matrix a collection of np elements arranged in n rows and p columns.

  • Given that we tend to represent variables as column vectors, a matrix is a collection of p column vectors of length n.

  • We write matrices as bold capital letters, e.g. X is a matrix

Statistical Data Analysis - Lecture25 - 23/05/03


Introduction to multivariate analysis

  • The elements of a matrix are referred to by two indices, a row index and a column index, so xijrefers to the jth element on the ith row of the matrix X.

  • The transpose of a matrix reverses the order of the rows and the columns. That is, an n p matrix becomes a p n matrix with the ijth element equal to xji

  • The transpose of a matrix X is denoted XT, Xtor X

  • Usually we denote the jth column of X asxjand the ith row as

Statistical Data Analysis - Lecture25 - 23/05/03


Introduction to multivariate analysis

  • Like ordinary multiplication, there is the matrix equivalent of division, but some special conditions apply.

  • If X is an n n matrix (then X is called a square matrix), andX is of full rank (there is a unique solution b to the equation Xb = 0), then X has an inverseX-1, such that XX-1 = X-1X = Inwhere Inis the identity matrix (a matrix with ones down the diagonal and zeroes else where)

  • E.g. if

Statistical Data Analysis - Lecture25 - 23/05/03


Introduction to multivariate analysis

  • If A is an np matrix and Bis a rc then we can multiply B by A (to get AB)only ifp = r

  • Note this does not automatically mean we can multiply A by B – this can only happen if c = n

  • In general AB  BA

  • If A is an np matrix and Bis a rc and p = r then the ijth element of the product (let C = AB) is given by

  • This is more easily demonstrated than seen from the formula

Statistical Data Analysis - Lecture25 - 23/05/03


Distance measures

Distance measures

  • There are a variety of different ways that distance is measured (between two multidimensional points), and their pros and their cons are just about as varied

  • If we have just two variables (p = 2) X and Y with n observations on each, then the distance between the ith and the jth point is given by Pythagoras’ theorem

Statistical Data Analysis - Lecture25 - 23/05/03


Introduction to multivariate analysis

Statistical Data Analysis - Lecture25 - 23/05/03


Euclidean distance

Euclidean distance

  • The examples you have seen so far are called Euclidean distances and have a natural extension when there are more than three variables (p>3).

  • For any p the Euclidean distance between two points is given by

  • This should look familiar – this is the distance we minimize for regression.

  • The second part of the equation is called the L2 norm

Statistical Data Analysis - Lecture25 - 23/05/03


Introduction to multivariate analysis

  • One direct downside of the Euclidean difference is that it is dominated by variables with a large mean (relative to the other variables).

  • For example, if variable X1 measures height in mm and variable X2 measures weight in stone , then most of the distance will be dominated by X1

  • One solution to this is to scale the each variable before measuring the distance

  • That is, we subtract the mean of each variable from every measurement for that variable and divide by the standard deviation

  • This can work well, but has the disadvantage of removing information about separation

Statistical Data Analysis - Lecture25 - 23/05/03


Alternative distance measures

Alternative distance measures

  • There are a whole set of distance measures based on norms

  • The L2 norm is only one of a family of measures (called p norms, or Lp norms) given by the general formula

  • When p = 1 the L1 norm is sometimes called the Manhatten distance

  • When p is infinite the L, is called the infinity norm (or max or sup norm) and is defined by

Statistical Data Analysis - Lecture25 - 23/05/03


Mahalanobis distance

Mahalanobis distance

  • The Mahalanobis distance is used to measure the distance of a single multivariate observation from the centre of the population that the observation comes from.

  • If are the values of for the individual, with corresponding population mean values

    then

    where V is the population covariance matrix.

  • If we have the population means and covariance matrix then D2 follows a chi-square distribution with p degrees of freedom

Statistical Data Analysis - Lecture25 - 23/05/03


Mahalanobis distance1

Mahalanobis distance

  • The covariance matrix is the multivariate equivalent of the variance for a single observation, with the diagonal elements equal to the sample variance and the off diagonal elements cij

    i.e. the sample covariance between the ith and jth variables

  • Unfortunately we almost never have the means or the covariance matrix, and so we must estimate them from the data

  • The covariance matrix V is estimated by taking a pooled average of the covariance matrices for each of the variables

  • It is unclear how quickly the Mahalanobis distance converges to a chi-square distribution, but Manly suggests when p = 100 (100 independent variables) there should be no problem in assuming this.

Statistical Data Analysis - Lecture25 - 23/05/03


  • Login