introduction to multivariate analysis
Download
Skip this Video
Download Presentation
Introduction to multivariate analysis

Loading in 2 Seconds...

play fullscreen
1 / 16

Introduction to multivariate analysis - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Introduction to multivariate analysis. Data does not always come with a single response Nor does it always have a response A data set may consist simply n measurements on p variables For example, a doctor might record patient’s height, weight, blood pressure and pulse.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Introduction to multivariate analysis' - missy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to multivariate analysis
Introduction to multivariate analysis
  • Data does not always come with a single response
  • Nor does it always have a response
  • A data set may consist simply n measurements on p variables
  • For example, a doctor might record patient’s height, weight, blood pressure and pulse.
  • We could envision situations where any one of these variables is the response and the other’s are predictors.
  • Or we could have a situation where we just wanted to examine similarities (and differences) of patients

Statistical Data Analysis - Lecture25 - 23/05/03

multivariate data
Multivariate data
  • So we can see that in some ways we’ve already encountered multivariate data – multiple regression is an example
  • But we really haven’t learned how to deal with anything other than a single response
  • And we’re not going to!
  • There is a whole body of multivariate data analysis literature devoted to extensions of techniques we’ve seen to far
    • Hotellings’ T2 – a multivariate extension to the t-test
    • MANOVA – multivariate ANOVA with more than one response
    • Multiple regression where the response is multidimensional (canonical correlation)

Statistical Data Analysis - Lecture25 - 23/05/03

multivariate data1
Multivariate data
  • The fact is that in some ways these extensions are trivial
  • Sure the interpretation issues are harder
  • And the assumptions are just about impossible to verify
  • For this reason, we will concentrate on multivariate data description
  • This is by no means easy
  • For example how do we visualize data in more than 3 dimensions
  • Like EDA for low dimension problems, multivariate data visualization and explanation is possibly one of the most important treatments of multivariate data

Statistical Data Analysis - Lecture25 - 23/05/03

an introduction to linear algebra
An introduction to linear algebra
  • Whilst it is theoretically possible to avoid discussing linear algebra when talking about multivariate techniques, it is practically impossible.
  • The reason for this is that we need a common language in order to get some handle on what we’re doing and what we’re talking about
  • Linear algebra provides that common language (and indeed underlies a majority of the statistics you have encountered already)
  • We will not get hung up on computational techniques, as they are often abstracted from the theory

Statistical Data Analysis - Lecture25 - 23/05/03

some definitions
Some definitions
  • Dimensions in linear algebra – every object in linear algebra (scalar, vector or matrix) has a set of dimensions associated with it.
  • These dimensions are reported as rows and columns. So for example a scalar (a real number) has 1 row and 1 column. An r c matrix has r rows and c columns
  • An n-dimensional row vector is a list of n points (arranged in 1 row and n columns) which describe a point in n-dimensional space.
  • An n-dimensional column vector is a list of n points (arranged in n rows and 1 column) which describe a point in n-dimensional space.
  • In terms of using row or column vectors to describe a point in space, it makes no difference. However they do behave differently when it comes to operations like multiplication

Statistical Data Analysis - Lecture25 - 23/05/03

slide6
If a vector (of length n) is said to be real valued then it represents a point in n – the real value hyperplane (2 is the standard 2D plane, and 3 is what we live in)
  • We write vectors as bold lower case letters.
  • In statistics, the default orientation (unless otherwise defined by the author) of a vector, is a column vector.

E.g.: a is a column vector not a row vector

  • A n p matrix a collection of np elements arranged in n rows and p columns.
  • Given that we tend to represent variables as column vectors, a matrix is a collection of p column vectors of length n.
  • We write matrices as bold capital letters, e.g. X is a matrix

Statistical Data Analysis - Lecture25 - 23/05/03

slide7
The elements of a matrix are referred to by two indices, a row index and a column index, so xijrefers to the jth element on the ith row of the matrix X.
  • The transpose of a matrix reverses the order of the rows and the columns. That is, an n p matrix becomes a p n matrix with the ijth element equal to xji
  • The transpose of a matrix X is denoted XT, Xtor X
  • Usually we denote the jth column of X asxjand the ith row as

Statistical Data Analysis - Lecture25 - 23/05/03

slide8
Like ordinary multiplication, there is the matrix equivalent of division, but some special conditions apply.
  • If X is an n n matrix (then X is called a square matrix), andX is of full rank (there is a unique solution b to the equation Xb = 0), then X has an inverseX-1, such that XX-1 = X-1X = Inwhere Inis the identity matrix (a matrix with ones down the diagonal and zeroes else where)
  • E.g. if

Statistical Data Analysis - Lecture25 - 23/05/03

slide9
If A is an np matrix and Bis a rc then we can multiply B by A (to get AB)only ifp = r
  • Note this does not automatically mean we can multiply A by B – this can only happen if c = n
  • In general AB  BA
  • If A is an np matrix and Bis a rc and p = r then the ijth element of the product (let C = AB) is given by
  • This is more easily demonstrated than seen from the formula

Statistical Data Analysis - Lecture25 - 23/05/03

distance measures
Distance measures
  • There are a variety of different ways that distance is measured (between two multidimensional points), and their pros and their cons are just about as varied
  • If we have just two variables (p = 2) X and Y with n observations on each, then the distance between the ith and the jth point is given by Pythagoras’ theorem

Statistical Data Analysis - Lecture25 - 23/05/03

euclidean distance
Euclidean distance
  • The examples you have seen so far are called Euclidean distances and have a natural extension when there are more than three variables (p>3).
  • For any p the Euclidean distance between two points is given by
  • This should look familiar – this is the distance we minimize for regression.
  • The second part of the equation is called the L2 norm

Statistical Data Analysis - Lecture25 - 23/05/03

slide13
One direct downside of the Euclidean difference is that it is dominated by variables with a large mean (relative to the other variables).
  • For example, if variable X1 measures height in mm and variable X2 measures weight in stone , then most of the distance will be dominated by X1
  • One solution to this is to scale the each variable before measuring the distance
  • That is, we subtract the mean of each variable from every measurement for that variable and divide by the standard deviation
  • This can work well, but has the disadvantage of removing information about separation

Statistical Data Analysis - Lecture25 - 23/05/03

alternative distance measures
Alternative distance measures
  • There are a whole set of distance measures based on norms
  • The L2 norm is only one of a family of measures (called p norms, or Lp norms) given by the general formula
  • When p = 1 the L1 norm is sometimes called the Manhatten distance
  • When p is infinite the L, is called the infinity norm (or max or sup norm) and is defined by

Statistical Data Analysis - Lecture25 - 23/05/03

mahalanobis distance
Mahalanobis distance
  • The Mahalanobis distance is used to measure the distance of a single multivariate observation from the centre of the population that the observation comes from.
  • If are the values of for the individual, with corresponding population mean values

then

where V is the population covariance matrix.

  • If we have the population means and covariance matrix then D2 follows a chi-square distribution with p degrees of freedom

Statistical Data Analysis - Lecture25 - 23/05/03

mahalanobis distance1
Mahalanobis distance
  • The covariance matrix is the multivariate equivalent of the variance for a single observation, with the diagonal elements equal to the sample variance and the off diagonal elements cij

i.e. the sample covariance between the ith and jth variables

  • Unfortunately we almost never have the means or the covariance matrix, and so we must estimate them from the data
  • The covariance matrix V is estimated by taking a pooled average of the covariance matrices for each of the variables
  • It is unclear how quickly the Mahalanobis distance converges to a chi-square distribution, but Manly suggests when p = 100 (100 independent variables) there should be no problem in assuming this.

Statistical Data Analysis - Lecture25 - 23/05/03

ad