Linear Methods For Classification Chapter 4

Linear MethodsFor ClassificationChapter 4 Machine Learning Seminar Shinjae Yoo Tal Blum

Bayesian Decision Theory • World states j (i.e. classes) • Actions (x) (i.e. classification) • R((x) | x) – Risk or cost function • The total risk: • R = s R((x)|x) p(x) dx • The Bayes Decision Rule: • (x) = minj R(j|x) = minjck=1(j | k) P(k | x)

Minimum Error-Rate Classification • Introduction of the zero-one loss function: • Therefore, the conditional risk is: • “The risk corresponding to this loss function is the average probability error” • 

Minimum Error-Rate Classification • Minimizing the risk is equivalent to maximizing P(i | x) • Optimal Strategy is: • Decide i if P (i | x) > P(j | x) j  i • For classification: choose the class with the highest posterior probability

Discriminant Functions • Denoting gi(x) = P(i|x) • 8f s.t. f is monotonic increasing • the set {f(gi(x))} gives the same classification as the set {gi(x)} • Examples: Gc’(x) = P(x|c’)P(c’) / (c P(x|c)P(c)) Gc’(x) = P(x|c’)P(c’) Gc’(x) = lnP(x|c’)+lnP(c’)

Linear Discriminant Functions • Special case when a monotone function of the boundary f(g(x)) is linear • Example: • Logistic regression • f the log function • Decision boundary

Extensions to Linear Discriminant Functions

Linear RegressionOf An Indicator Matrix • Y = (Y1,…Yk) an N*K indicator matrix • Yj,k=1 iff Gj = k • Can be seen as K separate linear regressions

Linear RegressionOf An Indicator Matrix • The algorithm • Compute a K vector

Linear RegressionOf An Indicator Matrix • Properties: • Is linear regression flexible enough to model fi(x)? • fi(x) can be negative or bigger than 1 • Incorporating more basis elements can help • Gives the same estimate as minBNi=1 ||yi-[(1,x)BT||2

Masking Effect

Gaussians Discriminant Functions • Multivariate density • Multivariate normal density in d dimensions is: where: x = (x1, x2, …, xd)t(t stands for the transpose vector form)  = (1, 2, …, d)t mean vector  = d*d covariance matrix || and -1 are determinant and inverse respectively

Gaussians Discriminant Functions(2) • The discriminant function we use: gi(x) = ln P(x | i) + ln P(i) • The parameter are usually not known so they are estimated

LDA – Linear Discriminant Analysis • Case 1: i are equal and i = I • The separating plane is: (x-x0)=0 • Where  = (i-j)/2 is the direction of the means difference • X0 is given by: • X0 is on the line separating the means, but not necessary in the middle of the line, unless the priors are equal.

Case where i = I

LDA - where i =  • Case 2: i are equal (i = ) • The separating plane is: (x-x0)=0 • Where  =  -1(i-j) •  is generally not orthogonal to the vector separating the means

Connection Between LDA and Multiple Linear Regression • For 2 class problem, the directions of the decision boundaries are the same, but unless N1=N2 the intercepts are different. • While both are estimating discriminative functions and produce the same types of linear boundaries Linear regression is a discriminative method while LDA a generative.

QDA – Arbitrary i • Case 3: i are arbitrary • The discriminant functions: • Where

Comparing Extended LDA and QDA

What do we use, LDA or QDA? • QDA is more expressive, but requires more parameters. • Number of parameters: LDA: (K-1)*(P+1) QDA: (K-1)[P(P+3)/2+1] • Both perform very well on many tasks. • Usually because the data does not support more complex boundaries. • If the data is not Gaussian, CV on the cutoffs may help.

Regularized Discriminant Analysis • Is there a a middle way between LDA and QDA?

Regularized Discriminant Analysis

Computation of LDA • Compute the eigen decomposition • Project X by X* = D-1/2UTX • Classify to the closest centroid in the transformed space modulo the prior probabilities i.

matrix factorization! Using matrix factorization, we hope to: - Reduce the dimensionality (compress) X W C 0.4999 1.1964 1.1389 1.1556 0.9290 0.7520 0.7321 0.6830 0.8260 0.7515 1.0596 1.3355 1.1624 1.1964 1.4396 0.5631 0.8010 0.8009 0.9455 0.6194 0.2071 0.9708 0.8749 0.7495 0.7896 0.4049 0.9841 0.8991 0.8562 0.8418 0.7940 0.9445 0.9178 1.0947 0.8313 0.9538 0.6973 0.6195 0.7928 0.8628 0.4401 1.0705 0.9657 0.9013 0.9417 0.7626 0.9141 0.8859 1.0518 0.8075 0.9501 0.6154 0.0579 0.2311 0.7919 0.3529 0.6068 0.9218 0.8132 0.4860 0.7382 0.0099 0.8913 0.1763 0.1389 0.7621 0.4057 0.2028 0.4565 0.9355 0.1987 0.0185 0.9169 0.6038 0.8214 0.4103 0.2722 0.4447 0.8936 0.1988  0.0153 0.9318 0.8462 0.6721 0.6813 0.7468 0.4660 0.5252 0.8381 0.3795 0.4451 0.4186 0.2026 0.0196 0.8318 

SVD

Reduced Rank LDA • LDA can be computed by projecting the points into a K-1 dimensional space and computing distances there • Reduced rank PCA minimizes the reconstruction error • Reduced Rank LDA is about finding orthogonal set of vectors that maximized Rayleigh Quotient • W – the within class covariance, a sum of the class cov matrices • B – the between class covariance, the cov matrix of the centroids of X • W+B = T = XTX

Reduced Rank LDA • Helps to visualize high dimensional data • The reduction in dimension is usually done by ordering the vectors and then choosing the first M vectors • LDA is also used just as a dimension reduction method together with other classification methods such as Nearest Neighbor. • It is equivalent to projecting the vectors and their class centers into a low dimensional subspace

Reduced LDA

Questions • 2 Kinds of questions • Clarification / derivation • Performance / understanding

Questions(2) • The relationship to Linear Regression is that we are modeling E(Y|X=x) as a linear function of x.

Questions(3) • 1) last paragraph in P.90, it's said, "the large discrepancy between the training and test error is partly due to the fact that there many repeat measurements on a small number of individuals", two questions about this. a) since it's "partly due to", what are the other factors; b) give some explanation on "many repeat measurements on a small number of individuals". • 2) I am curious how does the number 30% come from, in first paragraph in P.105? • can we explain it only because of the Gaussian assumption of fk(x)?

Questions(4) • 1) Eq. 4.3, Y is a matrix of 0's and 1's, with each row having a single 1. If we put more than one 1 in a row, can we explore Linear regression to multiple classification? • 2) p.95, the book said we can apply classification after data reduction. However, to me, LDA used label already (Y), and for classification we will use label to estimate the distribution. Is it overfitted? • 3) What is generative model and discriminative model? Is LDA a generative or discriminative model?

Questions(5) • A high level picture that I get from this chapter is that we are trying to come up with models that approximate the conditional expection: E = Pr(G=k/X=x). We start with a linear regression model that tries to approximate E by f^hat_k(x) which is rigid and is not a good approximation to the posterior probability (E) as it could be -ve or > 1. Then we go to LDA model, where we assume a model for class density, and use conditional class densities to calculate our posterior (E), the posterior obtained is a better approximation of E but we are also adding bias (class densities) which also reduces variance. Then using Logistic regression model, we still approximate E better than linear regression model, and this model is more robust as it maximizes the likelihood of training data and it also has few assumptions compared to LDA. • How can we extend this picture? what other models can we use that can better approximate the conditional expectation Pr(G=k/X=x) ? One way to extend LDA is getting a better approximation of class models using unlabeled data and label them using a distance metric similar to distance metric used in k-NN classifier in ch 2.

Questions(6) • 2. Sometimes we can get an estimate of prior probabilities of classes using domain knowledge, LDA framework allows us to use these prior probabilities during calculation of the posterior. Can we use prior probabilities in other models like Linear regression and Logistic regression?

Questions(7) • The chapter has two broad categories, one where we use discrimnant functions like Pr(G=k/X=x) and then use this for classification and the other method is to directly model the boundaries between the classes (using the hyperplanes approach). When to use which approach ? One case is when you know the densities, you can use LDA because optimal hyperplane might use noise as support points -- but what if you don't know the density ? Another example is -- a variant of the hyperplane approach (SVM's) perform better in high dimensional input data -- similarly are there any other situations where a particular approach is preferred?

Questions(8) • 1. P.83 describes the masking phenomenon. In figure 4.2., consider the case where the middle class is moved slightly to right-down. Though it will not be masked, it still has only a small slice of area, which is very bad prediction. Hence, to me, masking phenomenon means linear regression for multivariate Gaussian is generally inaccurate. And of course it is inaccurate, because linear regression assumes the probability distribution is linear not Gaussian. Now since this assumption no longer holds, why we go on to quadratic fit? What is the rationale? Since the data is Gaussian, why not just use LDA or QDA, with sound theoretical inference? • Put it another way. The logic looks like: linear regression may cause masking; quadratic fit can avoid masking; when K>=3 use polynomial terms up to K-1. My point is, just because quadratic fit avoids masking does not mean it is a reasonable classification, especially when the reason of masking is the model being Gaussian not linear.

Questions(9) • I suppose most cases people use LDA / QDA, but in some cases people use polynomail forms of linear regression by heuristics and experiences. What is the real situation? • P.90 Regularized Discriminant Analysis: Why we want compromise between LDA and QDA? Because of computational tractability? It seems to me that RDA requires more computation, not less

Questions(10) • When does "mask" usually happen with the regression approach?I was wondering the opposite question. When doing linear regression on an indicator matrix, with classes K>=3, is there ever a case where K-2 of the classes aren't masked? It seems that with a linear decision boundary, you can only ever bisect the space, so you can only decide between two classes. (Except when augmenting the data with X^2, etc.) Is this correct?p.83-84) The book mentions a "loose but general rule" for using a polynomial terms in linear regression for classification. Is there a hard rule for the maximal degree of polynomial input that would be required to resolve K separable classes?- Is all separable data separable by some polynomial? Of bounded degree?

Questions(11) • In logistic regression, it is shown (Eqn 4.26) that it is a reweighted least square problem. Conceptually, what is the role of reweighting using W? Does other reweighting scheme exist which gives better classification performance

Questions(12) • (1) In section 4.3.1, regularized discriminant analysis, what is the effect to diagonize E? • It seems to me to strengthen the independence assumption so does it really help in practice? • (2) On p.92 the Lth discriminant variable is computed as Z_l = v_l' X, where v_l = W^(-1/2) v^*_l, • why v_l has a W^(-1/2) term instead of W^(+1/2) term?

Questions(13) • I'm not sure I immediately see how allowing linear regression onto basis expansions h(X) of the inputs will lead to consistent estimates of the posterior probabilities Pr(G = k|X = x) (p82). Is there a clearer way of demonstrating this (perhaps this question is better suited to Ch 5)? The text also suggests that these expansions should be adaptively applied as the size of our training set grows, which is also a little ambiguous. Perhaps basis expansions could be applied to the example vowel recognition problem on pp 84-5 to demonstrate this? I'll look into it.

More Questions • 1. We learned from Chap3 that how to do hypo. testing for linear regression, could you talk a little bit about that for logistic regression? That is, what are the assumptions, and what are the tests? • 2. Already seen this question twice: how does 30% come out anyway?

Linear Methods For Classification Chapter 4

Linear Methods For Classification Chapter 4

Presentation Transcript

Linear methods for regression

Linear Methods for Classification

Chapter 4 Methods

Lecture 8,9 – Linear Methods for Classification

Linear Methods for Regression

Linear Models for Classification : Probabilistic Methods

Linear Models for Classification

Chapter 4 Methods

Chapter 4 Classification

Linear Methods for Regression

LINEAR CLASSIFICATION METHODS

Linear classification

Linear Models for Classification

Recall , in linear methods for classification and regression

Linear Methods For Classification Chapter 4

CLASSIFICATION CHAPTER (15-4)

Linear Classification

Chapter 4 Methods

Linear Methods for Classification

Chapter 4 Linear Transformations