Classification III

1 / 41

# Classification III - PowerPoint PPT Presentation

Classification III. Tamara Berg CS 590-133 Artificial Intelligence. Many slides throughout the course adapted from Svetlana Lazebnik , Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer , Rob Pless , Killian Weinberger, Deva Ramanan. Announcements.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Classification III' - tegan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Classification III

Tamara Berg

CS 590-133 Artificial Intelligence

Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan

Announcements

Pick-up your midterm from TAs if you haven’t gotten it yet

Assignment 4 due today

Nearest

Neighbor

Decision

Tree

Linear

Functions

Discriminant Function
• It can be arbitrary functions of x, such as:
Linear classifier
• Find a linear function to separate the classes

f(x) = sgn(w1x1 + w2x2 + … + wDxD) = sgn(w  x)

Perceptron

Input

Weights

x1

w1

x2

w2

Output:sgn(wx+ b)

x3

w3

.

.

.

Can incorporate bias as component of the weight vector by always including a feature with value set to 1

wD

xD

Perceptron training algorithm
• Initialize weights
• Cycle through training examples in multiple passes (epochs)
• For each training example:
• If classified correctly, do nothing
• If classified incorrectly, update weights
Perceptron update rule
• For each training instance x with label y:
• Classify with current weights: y’ = sgn(wx)
• Update weights:
• α is a learning rate that should decay as 1/t, e.g., 1000/(1000+t)
• What happens if answer is correct?
• Otherwise, consider what happens to individual weights:
• If y = 1 and y’ = −1, wi will be increased if xi is positive or decreased if xi is negative −> wxgets bigger
• If y = −1 and y’ = 1, wi will be decreased if xi is positive or increased if xi is negative −> wxgets smaller
Implementation details
• Bias (add feature dimension with value fixed to 1) vs. no bias
• Initialization of weights: all zeros vs. random
• Number of epochs (passes through the training data)
• Order of cycling through training examples
Multi-class perceptrons
• Need to keep a weight vector wc for each class c
• Decision rule:
• Update rule: suppose an example from class c gets misclassified as c’
• Update for c:
• Update for c’:
Differentiable perceptron

Input

Weights

x1

w1

x2

w2

Output:(wx+ b)

x3

w3

.

.

.

Sigmoid function:

wd

xd

Update rule for differentiable perceptron
• Define total classification error or loss on the training set:
• Update weights by gradient descent:
• For a single training point, the update is:
Multi-Layer Neural Network
• Can learn nonlinear functions
• Training: find network weights to minimize the error between true and estimated labels of training examples:
• Minimization can be done by gradient descent provided f is differentiable
• This training method is called back-propagation
Deep convolutional neural networks

Zeiler, M., and Fergus, R. Visualizing and Understanding Convolutional Neural Networks, tech report, 2013.

Krizhevsky, A., Sutskever, I., and Hinton, G.E. ImageNetclassicationwith deep convolutional neural networks. NIPS, 2012.

Demo from Berkeley

http://decaf.berkeleyvision.org/

Demo in the browser!

https://www.jetpac.com/deepbelief

Linear classifier
• Find a linear function to separate the classes

f(x) = sgn(w1x1 + w2x2 + … + wDxD) = sgn(w  x)

denotes +1

denotes -1

Linear Discriminant Function

x2

• f(x) is a linear function:

wT x + b > 0

• A hyper-plane in the feature space

wT x + b = 0

x1

x1

wT x + b < 0

denotes +1

denotes -1

Linear Discriminant Function

x2

• How would you classify these points using a linear discriminant function in order to minimize the error rate?

x1

denotes +1

denotes -1

Linear Discriminant Function

x2

• How would you classify these points using a linear discriminant function in order to minimize the error rate?

x1

denotes +1

denotes -1

Linear Discriminant Function

x2

• How would you classify these points using a linear discriminant function in order to minimize the error rate?

x1

denotes +1

denotes -1

Linear Discriminant Function

x2

• How would you classify these points using a linear discriminant function in order to minimize the error rate?
• Which one is the best?

x1

Large Margin Linear Classifier

x2

• The linear discriminant function (classifier) with the maximum margin is the best

Margin

“safe zone”

• Margin is defined as the width that the boundary could be increased by before hitting a data point
• Why it is the best?
• strong generalization ability

x1

Linear SVM

x+

x+

x-

Support Vectors

Large Margin Linear Classifier

x2

Margin

wT x + b = 1

wT x + b = 0

wT x + b = -1

x1

Support vector machines
• Find hyperplane that maximizes the margin between the positive and negative examples

For support vectors,

Distance between point and hyperplane:

Therefore, the margin is 2 / ||w||

Support vectors

Margin

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Finding the maximum margin hyperplane
• Maximize margin 2 / ||w||
• Correctly classify all training data:

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Solving the Optimization Problem
• The linear discriminant function is:
• Notice it relies on a dot product between the test point xand the support vectors xi

Non-linear SVMs: Feature Space

• General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x→φ(x)

Slide courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Nonlinear SVMs: The Kernel Trick

• With this mapping, our discriminant function becomes:
• No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.
• A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

Nonlinear SVMs: The Kernel Trick

• Examples of commonly-used kernel functions:
• Linear kernel:
• Polynomial kernel:
• Gaussian (Radial-Basis Function (RBF) ) kernel:
• Sigmoid:
Support Vector Machine: Algorithm

1. Choose a kernel function

2. Choose a value for C and any other parameters (e.g. σ)

3. Solve the quadratic programming problem (many software packages available)

4. Classify held out validation instances using the learned model

5. Select the best learned model based on validation accuracy

6. Classify test instances using the final selected model

Some Issues
• Choice of kernel

- Gaussian or polynomial kernel is default

- if ineffective, more elaborate kernels are needed

- domain experts can give assistance in formulating appropriate similarity measures

• Choice of kernel parameters

- e.g. σ in Gaussian kernel

-In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Summary: Support Vector Machine

1. Large Margin Classifier

• Better generalization ability & less over-fitting

2. The Kernel Trick

• Map data points to higher dimensional space in order to make them linearly separable.
• Since only dot product is needed, we do not need to represent the mapping explicitly.
Detection

+1 pos

?

features

classify

-1 neg

?

?

x

F(x)

y

• We slide a window over the image
• Extract features for each window
• Classify each window into pos/neg