- 274 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Introduction to Machine Learning' - bernad

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Introduction to Machine Learning

### Binary Classification

### Multi-Class Classification

### Ordinal Regression

### Regression

### Ranking

### Multi-Label Classification

### Are These Problems Distinct?

### In This Course

### Learning from Noisy Data

### Probability Theory

### Probability Distribution Functions

### A Toy Example

### The Maximum Likelihood Approach

### The Maximum A Posteriori Approach

### The Bayesian Approach

### Approaches to Classification

### Notation

### Bayes’ Decision Rule

### Issues to Think About

### Bayesian Approach

### Maximum A Posteriori (MAP)

### MAP & Maximum Likelihood (ML)

### IID Data

### Generative Methods

### Generative Methods – Naïve Bayes

### Generative Methods – Naïve Bayes

### Naïve Bayes – Prediction

### Discriminative Methods

### Disc. Methods – Logistic Regression

### Regularized Logistic Regression

### Convex Functions

### Gradient Descent

### Gradient Descent – Logistic Regression

### Newton Methods

### Quasi-Newton Methods

### Generative versus Discriminative

### Generative versus Discriminative

### Generative versus Discriminative

### Multi-class Logistic Regression

### Multi-class Logistic Regression

### Multi-class Logistic Regression

### Multinomial Logistic Regression

### Multinomial Logistic Regression

### Calculating the Margin

### Hard Margin SVM Primal

### Linearly Inseparable Data

### The C-SVM Primal Formulation

### The C-SVM Dual Formulation

### SVMs versus Regularized LR

### SVMs versus Regularized LR

### SVMs versus Regularized LR

### Duality

### Duality

### Karush-Kuhn-Tucker (KKT) Conditions

### SVM – Duality

### SVM – KKT Conditions

### Hinge Loss and Sparseness in

### Linearly Inseparable Data

### The Kernel Trick

### The Kernel Trick

### Some Popular Kernels

### Valid Kernels – Mercer’s Theorem

### Valid Kernels – Mercer’s Theorem

### Operations on Kernels

### Kernels

### Structured Output Prediction

### Multi-Class SVM

### Multi-Class SVM Dual

### Multi-Class Classification

Manik Varma

Microsoft Research India

http://research.microsoft.com/~manik

manik@microsoft.com

- Is this person Madhubala or not?
- Is this person male or female?
- Is this person beautiful or not?

- Is this person Madhubala, Lalu or Rakhi Sawant?
- Is this person happy, sad, angry or bemused?

- Is this person very beautiful, beautiful, ordinary or ugly?

- How beautiful is this person on a continuous scale of 1 to 10? 9.99?

- Rank these people in decreasing order of attractiveness.

- Tag this image with the set of relevant labels from {female, Madhubala, beautiful, IITD faculty}

Can regression solve all these problems

- Binary classification – predict p(y=1|x)
- Multi-Class classification – predict p(y=k|x)
- Ordinal regression – predict p(y=k|x)
- Ranking – predict and sort by relevance
- Multi-Label Classification – predict p(y{1}k|x)
- Learning from experience and data
- In what form can the training data be obtained?
- What is known a priori?
- Complexity of training
- Complexity of prediction

- Classification
- Generative methods
- Nearest neighbour, Naïve Bayes
- Discriminative methods
- Logistic Regression
- Discriminant methods
- Support Vector Machines
- Regression, Ranking, Feature Selection, etc.
- Unsupervised learning
- Semi-supervised learning
- Reinforcement learning

- Unknown generative model Y = f(X)
- Noise in measuring input and feature extraction
- Noise in labels
- Nuisance variables
- Missing data
- Finite training set size

Non-negativity and unit measure

- 0 ≤ p(y) , p() = 1, p() = 0
- Conditional probability – p(y|x)
- p(x, y) = p(y|x) p(x) = p(x|y) p(y)
- Bayes’ Theorem
- p(y|x) = p(x|y) p(y) / p(x)
- Marginalization
- p(x) = yp(x, y) dy
- Independence
- p(x1, x2) = p(x1) p(x2) p(x1|x2) = p(x1)
- Chris Bishop, “Pattern Recognition & Machine Learning”

Bernoulli: Single trial with probability of success =

- n {0, 1}, [0, 1]
- p(n|) = n(1 – )1-n
- Binomial: N iid Bernoulli trials with n successes
- n {0, 1, …, N}, [0, 1],
- p(n|N,) = NCnn(1 – )N-n
- Multinomial: N iid trials, outcome k occurs nk times
- nk {0, 1, …, N}, knk = N, k [0, 1], kk = 1
- p(n|N,) = N! kknk / nk!

We don’t know whether a coin is fair or not. We are told that heads occurred n times in N coin flips.

- We are asked to predict whether the next coin flip will result in a head or a tail.
- Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail
- We should predict heads if p(y=1|n,N) > p(y=0|n,N)

Let p(y=1|n,N) = and p(y=0|n,N) = 1 - so that we should predict heads if > ½

- How should we estimate ?
- Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of that maximizes the likelihood of observing the data
- ML = argmaxp(n|) = argmaxNCnn(1 – )N-n
- = argmaxn log() + (N – n) log(1 – )
- = n / N
- We should predict heads if n > ½ N

We should choose the value of maximizing the posterior probability of conditioned on the data

- We assume a
- Binomial likelihood : p(n|) = NCnn(1 – )N-n
- Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b)
- MAP = argmaxp(|n,a,b) = argmaxp(n|) p(|a,b)
- = argmaxn (1 – )N-na-1 (1–)b-1
- = (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails
- We should predict heads if n > ½ (N + b – a)

- p(y=1|n,a,b) = p(y=1|n,) p(|a,b,n) d
- = p(|a,b,n) d
- = (|a + n, b + N –n) d
- = (n + a) / (N + a + b) as if we saw an extra a heads & b tails
- We should predict heads if n > ½ (N + b – a)
- The Bayesian and MAP prediction coincide in this case
- In the very large data limit, both the Bayesian and MAP prediction coincide with the ML prediction (n > ½ N)

- Can not deal with previously unseen data
- Large scale annotated data acquisition cost might be very high
- Rule based expert system
- Dependent on the competence of the expert.
- Complex problems lead to a proliferation of rules, exceptions, exceptions to exceptions, etc.
- Rules might not transfer to similar problems
- Learning from training data and prior knowledge
- Focuses on generalization to novel data

- Set of N labeled examples of the form (xi, yi)
- Feature vector – xD. X = [x1x2 … xN]
- Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y)
- Example – Gender Identification

(x1 = , y1 = +1)

(x2 = , y2 = +1)

(x3 = , y3 = +1)

(x4 = , y4 = -1)

- p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1
- p(y=+1|x) > ½ ? y = +1 : y = -1

- Should we choose just one function to explain the data?
- If yes, should this be the function that explains the data the best?
- What about prior knowledge?
- Generative versus Discriminative
- Can we learn from “positive” data alone?
- Should we model the data distribution?
- Are there any missing variables?
- Do we just care about the final decision?

p(y|x,X,Y) = fp(y,f|x,X,Y) df

- = fp(y|f,x,X,Y) p(f|x,X,Y) df
- = fp(y|f,x) p(f|X,Y) df
- This integral is often intractable.
- To solve it we can
- Choose the distributions so that the solution is analytic (conjugate priors)
- Approximate the true distribution of p(f|X,Y) by a simpler distribution (variational methods)
- Sample from p(f|X,Y) (MCMC)

p(y|x,X,Y) = fp(y|f,x) p(f|X,Y) df

- = p(y|fMAP,x) when p(f|X,Y) = (f – fMAP)
- The more training data there is the better p(f|X,Y) approximates a delta function
- We can make predictions using a single function, fMAP, and our focus shifts to estimating fMAP.

- = argmaxfp(X,Y|f) p(f) / p(X,Y)
- = argmaxfp(X,Y|f) p(f)
- fML argmaxfp(X,Y|f) (Maximum Likelihood)
- Maximum Likelihood holds if
- There is a lot of training data so that
- p(X,Y|f) >> p(f)
- Or if there is no prior knowledge so that p(f) is uniform (improper)

- = argmaxfIp(xi,yi|f)
- The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels.
- In particular, p(X,Y) Ip(xi,yi)

MAP = argmaxp() Ip(xi,yi| )

- = argmaxp(x) p(y) Ip(xi,yi| )
- = argmaxp(x) p(y) Ip(xi|yi,) p(yi|)
- = argmaxp(x) p(y) Ip(xi|yi,) p(yi|)
- = [argmaxxp(x) Ip(xi|yi,x)] *
- [argmaxyp(y) Ip(yi|y)]
- x and y can be solved for independently
- The parameters of each class decouple and can be solved for independently

MAP = [argmaxxp(x) Ip(xi|yi,x)] *

- [argmaxyp(y) Ip(yi|x)]
- Naïve Bayes assumptions
- Independent Gaussian features
- p(xi|yi,x) = jp(xij|yi,x)
- p(xij|yi=1,x) = N(xij| j1, i)
- Improper uniform priors (no prior knowledge)
- p(x) = p(y) = const
- Bernoulli labels
- p(yi=+1|y) = , p(yi=-1|y) = 1-

ML = [argmaxxIjN(xij| j1, i)] *

- [argmaxI (1+yi)/2 (1-)(1-yi)/2]
- Estimating ML
- ML = argmaxI (1+yi)/2 (1-)(1-yi)/2
- = argmax (N+I yi) log()+ (N-I yi) log(1-)
- = N+ / N (by differentiating and setting to zero)
- Estimating ML, ML
- ML = (1 / N) yi=1xi
- 2jML = [ yi=+1 (xij - +jML)2 + yi=-1 (xij - -jML)2 ]/N

p(y=+1|x) = p(x|y=+1) p(y=+1) / p(x)

- = 1 / (1 + exp(log(p(y=-1)/ p(y=+1))
- +log(p(x|y=-1) / p(x|y=+1)))
- = 1 / (1 + exp( log(1/ - 1) - ½ -t-1-
- + ½ +t-1+ + (+- -)t-1x ))
- = 1 / (1 + exp(-b – wtx)) (Logistic Regression)
- p(y=-1|x)= exp(-b – wtx) / (1 + exp(-b – wtx))
- log(p(y=-1|x)/ p(y=+1|x)) = -b – wtx
- y = sign(b + wtx)
- The decision boundary will be linear!

MAP = argmaxp() Ip(xi,yi| )

- We assume that
- p() = p(w) p(w)
- p(xi,yi| ) = p(yi| xi, ) p(xi| )
- = p(yi| xi, w) p(xi| w)
- MAP = [argmaxwp(w) Ip(yi| xi, w)] *
- [argmaxwp(w) Ip(xi|w)]
- It turns out that only w plays no role in determining the posterior distribution
- p(y|x,X,Y) = p(y|x, MAP) = p(y|x, wMAP)
- where wMAP = argmaxwp(w) Ip(yi| xi, w)

MAP = argmaxw,bp(w) Ip(yi| xi, w)

- Regularized Logistic Regression
- Gaussian prior – p(w) = exp( -½ wtw)
- Logistic likelihood–
- p(yi| xi, w) = 1 / (1 + exp(-yi(b + wtxi)))

MAP = argmaxw,bp(w) Ip(yi| xi, w)

- = argminw,b ½wtw+ I log(1+exp(-yi(b+wtxi)))
- Bad news: No closed form solution for w and b
- Good news: We have to minimize a convex function
- We can obtain the global optimum
- The function is smooth
- Tom Minka, “A comparison of numerical optimizers for LR” (Matlab code)
- Keerthi et al., “A Fast Dual Algorithm for Kernel Logistic Regression”, ML 05
- Andrew and Gao, “OWL-QN” ICML 07
- Krishnapuram et al., “SMLR” PAMI 05

Convex f : f(x1 + (1- )x2) f(x1) + (1- )f(x2)

- The Hessian 2f is always positive semi-definite
- The tangent is always a lower bound to f

Iteration : xn+1 = xn - nf(xn)

- Step size selection : Armijo rule
- Stopping criterion : Change in f is “miniscule”

(w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi)))

- w(w, b) =w –Ip(-yi|xi,w) yi xi
- b(w, b) = –Ip(-yi|xi,w) yi
- Beware of numerical issues while coding!

Iteration : xn+1 = xn - nH-1f(xn)

- Approximate f by a 2nd order Taylor expansion
- The error can now decrease quadratically

Computing and inverting the Hessian is expensive

- Quasi-Newton methods can approximate H-1 directly (LBFGS)
- Iteration : xn+1 = xn - nBn-1f(xn)
- Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn)
- The secant equation does not fully determine B
- LBFGS updates Bn+1-1 using two rank one matrices

A discriminative model might be correct even when the corresponding generative model is not

- A discriminative model has fewer parameters than the corresponding generative model
- A generative models parameters are uncoupled and can often be estimated in closed form
- A discriminative models parameters are correlated and training algorithms can be relatively expensive
- A discriminative model often has lower test error given a “reasonable” amount of training data.
- A generative model can deal with missing data

Let (hA,N) denote the error of hypothesis h trained using algorithm A on N data points

- When the generative model is correct
- (hDis,) = (hGen,)
- When the generative model is incorrect
- (hDis,) (hGen,)
- For a linear classifier trained in D dimensions
- (hDis,N) (hDis,) + O( [-z log z]½) where z=D/N1
- It suffices to pick N = (D) points for discriminative learning of linear classifiers
- For some generative models N = (log D)

A generative classifier might converge much faster to its higher asymptotic error

- Ng & Jordan, “On Discriminative vs. Generative Classifiers” NIPS 02.
- Tom Mitchell, “Generative and Discriminative Classifiers“

Multinomial Logistic Regression

- 1-vs-All
- Learn L binary classifiers for an L class problem
- For the lth classifier, examples from class l are +ve while examples from all other classes are –ve
- Classify new points according to max probability
- 1-vs-1
- Learn L(L-1)/2 binary classifiers for an L class problem by considering every class pair
- Classify novel points by majority vote
- Classify novel points by building a DAG

- Non-linear multi-class classifier
- Number of classes = L
- Number of training points per class = N
- Algorithm training time for M points = O(M3)
- Classification time given M training points=O(M)

Multinomial Logistic Regression

- Training time = O(L6N3)
- Classification time for a new point = O(L2N)
- 1-vs-All
- Training time = O(L4N3)
- Classification time for a new point = O(L2N)
- 1-vs-1
- Training time = O(L2N3)
- Majority vote classification time = O(L2N)
- DAG classification time = O(LN)

MAP = argmaxw,bp(w) Ip(yi| xi, w)

- Regularized Multinomial Logistic Regression
- Gaussian prior
- p(w) = exp( -½ lwltwl)
- Multinomial logistic posterior
- p(yi = l | xi, w) = efl(xi) / kefk(xi)
- where fk(xi) = wktxi + bk
- Note that we have to learn an extra classifier by not explicitly enforcing lp(yi = l | xi, w) = 1

(w, b) = ½ kwktwk+ I [log(kfk(xi)) - kkyifk(xi)]

- wk(w, b) =wk +I [ p(yi = k | xi,w) - kyi ]xi
- bk(w, b) =I [ p(yi = k | xi,w) - kyi ]

- Geometric Intuition: Choose the perpendicular bisector of the shortest line segment joining the convex hulls of the two classes

Margin = 2 /wtw

- Support Vector

b

- Support Vector

- Support Vector

- Support Vector

w

wtx + b = -1

wtx + b = 0

wtx + b = +1

Let x+ be any point on the +ve supporting plane and x- the closest point on the –ve supporting plane

- Margin = |x+ – x-|
- = |w| (since x+ = x- + w)
- = 2 |w|/|w|2 (assuming = 2/|w|2)
- = 2/|w|
- wtx+ + b = +1
- wtx- + b = -1
- wt(x+ – x-)= 2 wtw= 2 = 2/|w|2

- such that wtxi + b +1 if yi = +1
- wtxi + b -1 if yi = -1
- Difficult to optimize directly
- Convex Quadratic Program (QP) reformulation
- Minimize ½wtw
- such that yi(wtxi + b) 1
- Convex QPs can be easy to optimize

Minimize ½wtw + C #(Misclassified points)

- such that yi(wtxi + b) 1 (for “good” points)
- The optimization problem is NP Hard in general
- Disastrous errors are penalized the same as near misses

Margin = 2 /wtw

> 1

- Misclassified point

< 1

b

- Support Vector

= 0

- Support Vector

w

wtx + b = -1

= 0

wtx + b = 0

wtx + b = +1

- such that yi(wtxi + b) 1 – i
- i 0
- The optimization is a convex QP
- The globally optimal solution will be obtained
- Number of variables = D + N + 1
- Number of constraints = 2N
- Solvers can train on 800K points in 47K (sparse) dimensions in less than 2 minutes on a standard PC
- Fan et al., “LIBLINEAR” JMLR 08
- Bordes et al., “LaRank” ICML 07

- such that 1tY = 0
- 0 C
- K is a kernel matrix such that Kij = K(xi, xj) = xitxj
- are the dual variables (Lagrange multipliers)
- Knowing gives us w and b
- The dual is also a convex QP
- Number of variables = N
- Number of constraints = 2N + 1
- Fan et al., “LIBSVM” JMLR 05
- Joachims, “SVMLight”

Most of the SVM s are zero!

Most of the SVM s are zero!

Most of the SVM s are not zero

- s. t. fi(x) 0 1 i N
- hi(x)= 0 1 i M
- Lagrangian L(x,,) = f0(x) + i ifi(x) + i ihi(x)
- Dual D = Max,Minx L(x,,)
- s. t. 0

The Lagrange dual is always concave (even if the primal is not convex) and might be an easier problem to optimize

- Weak duality : P D
- Always holds
- Strong duality : P = D
- Does not always hold
- Usually holds for convex problems
- Holds for the SVM QP

If strong duality holds, then for x*, * and * to be optimal the following KKT conditions must necessarily hold

- Primal feasibility : fi(x*) 0 & hi(x*)= 0 for 1 i
- Dual feasibility : * 0
- Stationarity : xL(x*, *,*) = 0
- Complimentary slackness : i*fi(x*)= 0
- If x+, + and + satisfy the KKT conditions for a convex problem then they are optimal

- s. t. Y(Xtw + b1) 1 –
- 0
- Lagrangian L(,, w,,b) = ½wtw + Ct – t
- –t[Y(Xtw + b1) – 1 + ]
- Dual D = Max 1t – ½tYKY
- s. t. 1tY = 0
- 0 C

Lagrangian L(,, w,,b) = ½wtw + Ct – t

- –t[Y(Xtw + b1) – 1 + ]
- Stationarity conditions
- wL= 0 w* = XY* (Representer Theorem)
- L= 0 C = * + *
- bL= 0 *tY1 = 0
- Complimentary Slackness conditions
- i* [ yi (xitw* + b*) – 1 + i*] = 0
- i*i* = 0

Misclassifications and margin violations

- yif(xi) <1 i* > 0 i* = 0 i* = C
- Support vectors
- yif(xi) =1 i* = 0 & 0 ≤ i* ≤ C
- Correct classifications
- yif(xi) > 1 yif(xi) – 1 + i*> 0 i* = 0

This 1D dataset can not be separated using a single hyperplane (threshold)

- We need a non-linear decision boundary

x

Let the “lifted” training set be { ((xi), yi) }

- Define the kernel such that Kij = K(xi, xj) = (xi)t (xj)
- Primal P = Minw,,b½wtw + Ct
- s. t. Y((X)tw + b1) 1 –
- 0
- Dual D = Max 1t – ½tYKY
- s. t. 1tY = 0
- 0 C
- Classifier: f(x) = sign((x)tw + b) = sign(tYK(:,x) + b)

Let (x) = [1, 2x1, … , 2xD , x12, … , xD2, 2x1x2, …, 2x1xD, …, 2xD-1xD]t

- Define K(xi, xj) = (xi)t (xj) = (xitxj + 1)2
- Primal
- Number of variables = D + N + 1
- Number of constraints = 2N
- Number of flops for calculating (x)tw = O(D2)
- Number of flops for deg 20 polynomial = O(D20)
- Dual
- Number of variables = N
- Number of constraints = 2N + 1
- Number of flops for calculating Kij= O(D)
- Number of flops for deg 20 polynomial = O(D)

- Polynomial : K(xi,xj) = (xit-1xj + c)d
- Gaussian (RBF) : K(xi,xj) = exp( –kk(xik – xjk)2)
- Chi-Squared : K(xi,xj) = exp( –2(xi, xj) )
- Sigmoid : K(xi,xj) = tanh(xitxj – c)
- should be positive definite, c 0, 0 and d should be a natural number

Let Z be a compact subset of D and K a continuous symmetric function. Then K is a kernel if

- Z Zf(x) K(x,z) f(z) dxdz 0
- for all square integrable real valued function f on Z.

Let Z be a compact subset of D and K a continuous symmetric function. Then K is a kernel if

- Z Zf(x) K(x,z) f(z) dxdz 0
- for all square integrable real valued function f on Z.
- K is a kernel if every finite symmetric matrix formed by evaluating K on pairs of points from Z is positive semi-definite

The following operations result in valid kernels

- K(xi,xj) = kkKk(xi,xj) (k 0)
- K(xi,xj) = kKk(xi,xj)
- K(xi,xj) = f(xi) f(xj) (f : D )
- K(xi,xj) = p(K1(xi,xj)) (p : +ve coeff poly)
- K(xi,xj) = exp(K1(xi,xj))
- Kernels can be defined over graphs, sets, strings and many other interesting data structures

Kernels should encode all our prior knowledge about feature similarities.

- Kernel parameters can be chosen through cross validation or learnt (see Multiple Kernel Learning).
- Non-linear kernels can sometimes boost classification performance tremendously.
- Non-linear kernels are generally expensive (both during training and for prediction)

- such that f(xi,yi) f(xi,y) + (yi,y) – i yyi
- i 0
- Prediction argmaxyf(x,y)
- This formulation minimizes the hinge on the loss on the training set subject to regularization on f
- Can be used to predict sets, graphs, etc. for suitable choices of
- Taskar et al., “Max-Margin Markov Networks” NIPS 03
- Tsochantaridis et al., “Large Margin Methods for Structured & Interdependent Output Variables” JMLR 05

- such that f(xi,yi) f(xi,y) + (yi,y) – i yyi
- i 0
- Prediction argmaxyf(x,y)
- (yi,y) = 1 – yi,y
- f(x,y) = wt [ (x) (y) ]
- = wyt(x) (assuming (y) =ey)
- Weston and Watkin, “SVMs for Multi-Class Pattern Recognition” ESANN 99
- Bordes et al., “LaRank” ICML 07

For L classes, with N points per class, the total number of dual variables is NL2

- Finding the exact solution for real world non-linear problems is often infeasible
- In practice, we can obtain an approximate solution or switch to the 1-vs-All or 1-vs-1 formulations

- A mon-linear problem with L classes and N points/class
- SMO training is cubic in the number of dual variables
- The number of support vectors is the same order as the number of training points

Download Presentation

Connecting to Server..