1 / 16

Introduction to multivariate analysis

Introduction to multivariate analysis. Data does not always come with a single response Nor does it always have a response A data set may consist simply n measurements on p variables For example, a doctor might record patient’s height, weight, blood pressure and pulse.

missy
Download Presentation

Introduction to multivariate analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to multivariate analysis • Data does not always come with a single response • Nor does it always have a response • A data set may consist simply n measurements on p variables • For example, a doctor might record patient’s height, weight, blood pressure and pulse. • We could envision situations where any one of these variables is the response and the other’s are predictors. • Or we could have a situation where we just wanted to examine similarities (and differences) of patients Statistical Data Analysis - Lecture25 - 23/05/03

  2. Multivariate data • So we can see that in some ways we’ve already encountered multivariate data – multiple regression is an example • But we really haven’t learned how to deal with anything other than a single response • And we’re not going to! • There is a whole body of multivariate data analysis literature devoted to extensions of techniques we’ve seen to far • Hotellings’ T2 – a multivariate extension to the t-test • MANOVA – multivariate ANOVA with more than one response • Multiple regression where the response is multidimensional (canonical correlation) Statistical Data Analysis - Lecture25 - 23/05/03

  3. Multivariate data • The fact is that in some ways these extensions are trivial • Sure the interpretation issues are harder • And the assumptions are just about impossible to verify • For this reason, we will concentrate on multivariate data description • This is by no means easy • For example how do we visualize data in more than 3 dimensions • Like EDA for low dimension problems, multivariate data visualization and explanation is possibly one of the most important treatments of multivariate data Statistical Data Analysis - Lecture25 - 23/05/03

  4. An introduction to linear algebra • Whilst it is theoretically possible to avoid discussing linear algebra when talking about multivariate techniques, it is practically impossible. • The reason for this is that we need a common language in order to get some handle on what we’re doing and what we’re talking about • Linear algebra provides that common language (and indeed underlies a majority of the statistics you have encountered already) • We will not get hung up on computational techniques, as they are often abstracted from the theory Statistical Data Analysis - Lecture25 - 23/05/03

  5. Some definitions • Dimensions in linear algebra – every object in linear algebra (scalar, vector or matrix) has a set of dimensions associated with it. • These dimensions are reported as rows and columns. So for example a scalar (a real number) has 1 row and 1 column. An r c matrix has r rows and c columns • An n-dimensional row vector is a list of n points (arranged in 1 row and n columns) which describe a point in n-dimensional space. • An n-dimensional column vector is a list of n points (arranged in n rows and 1 column) which describe a point in n-dimensional space. • In terms of using row or column vectors to describe a point in space, it makes no difference. However they do behave differently when it comes to operations like multiplication Statistical Data Analysis - Lecture25 - 23/05/03

  6. If a vector (of length n) is said to be real valued then it represents a point in n – the real value hyperplane (2 is the standard 2D plane, and 3 is what we live in) • We write vectors as bold lower case letters. • In statistics, the default orientation (unless otherwise defined by the author) of a vector, is a column vector. E.g.: a is a column vector not a row vector • A n p matrix a collection of np elements arranged in n rows and p columns. • Given that we tend to represent variables as column vectors, a matrix is a collection of p column vectors of length n. • We write matrices as bold capital letters, e.g. X is a matrix Statistical Data Analysis - Lecture25 - 23/05/03

  7. The elements of a matrix are referred to by two indices, a row index and a column index, so xijrefers to the jth element on the ith row of the matrix X. • The transpose of a matrix reverses the order of the rows and the columns. That is, an n p matrix becomes a p n matrix with the ijth element equal to xji • The transpose of a matrix X is denoted XT, Xtor X • Usually we denote the jth column of X asxjand the ith row as Statistical Data Analysis - Lecture25 - 23/05/03

  8. Like ordinary multiplication, there is the matrix equivalent of division, but some special conditions apply. • If X is an n n matrix (then X is called a square matrix), andX is of full rank (there is a unique solution b to the equation Xb = 0), then X has an inverseX-1, such that XX-1 = X-1X = Inwhere Inis the identity matrix (a matrix with ones down the diagonal and zeroes else where) • E.g. if Statistical Data Analysis - Lecture25 - 23/05/03

  9. If A is an np matrix and Bis a rc then we can multiply B by A (to get AB)only ifp = r • Note this does not automatically mean we can multiply A by B – this can only happen if c = n • In general AB  BA • If A is an np matrix and Bis a rc and p = r then the ijth element of the product (let C = AB) is given by • This is more easily demonstrated than seen from the formula Statistical Data Analysis - Lecture25 - 23/05/03

  10. Distance measures • There are a variety of different ways that distance is measured (between two multidimensional points), and their pros and their cons are just about as varied • If we have just two variables (p = 2) X and Y with n observations on each, then the distance between the ith and the jth point is given by Pythagoras’ theorem Statistical Data Analysis - Lecture25 - 23/05/03

  11. Statistical Data Analysis - Lecture25 - 23/05/03

  12. Euclidean distance • The examples you have seen so far are called Euclidean distances and have a natural extension when there are more than three variables (p>3). • For any p the Euclidean distance between two points is given by • This should look familiar – this is the distance we minimize for regression. • The second part of the equation is called the L2 norm Statistical Data Analysis - Lecture25 - 23/05/03

  13. One direct downside of the Euclidean difference is that it is dominated by variables with a large mean (relative to the other variables). • For example, if variable X1 measures height in mm and variable X2 measures weight in stone , then most of the distance will be dominated by X1 • One solution to this is to scale the each variable before measuring the distance • That is, we subtract the mean of each variable from every measurement for that variable and divide by the standard deviation • This can work well, but has the disadvantage of removing information about separation Statistical Data Analysis - Lecture25 - 23/05/03

  14. Alternative distance measures • There are a whole set of distance measures based on norms • The L2 norm is only one of a family of measures (called p norms, or Lp norms) given by the general formula • When p = 1 the L1 norm is sometimes called the Manhatten distance • When p is infinite the L, is called the infinity norm (or max or sup norm) and is defined by Statistical Data Analysis - Lecture25 - 23/05/03

  15. Mahalanobis distance • The Mahalanobis distance is used to measure the distance of a single multivariate observation from the centre of the population that the observation comes from. • If are the values of for the individual, with corresponding population mean values then where V is the population covariance matrix. • If we have the population means and covariance matrix then D2 follows a chi-square distribution with p degrees of freedom Statistical Data Analysis - Lecture25 - 23/05/03

  16. Mahalanobis distance • The covariance matrix is the multivariate equivalent of the variance for a single observation, with the diagonal elements equal to the sample variance and the off diagonal elements cij i.e. the sample covariance between the ith and jth variables • Unfortunately we almost never have the means or the covariance matrix, and so we must estimate them from the data • The covariance matrix V is estimated by taking a pooled average of the covariance matrices for each of the variables • It is unclear how quickly the Mahalanobis distance converges to a chi-square distribution, but Manly suggests when p = 100 (100 independent variables) there should be no problem in assuming this. Statistical Data Analysis - Lecture25 - 23/05/03

More Related