Supervised Learning Recap

1 / 40

# Supervised Learning Recap - PowerPoint PPT Presentation

Supervised Learning Recap. Machine Learning. Last Time. Support Vector Machines Kernel Methods. Today. Review of Supervised Learning Unsupervised Learning ( Soft) K-means clustering Expectation Maximization Spectral Clustering Principle Components Analysis Latent Semantic Analysis.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Supervised Learning Recap

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Supervised Learning Recap

Machine Learning

Last Time
• Support Vector Machines
• Kernel Methods
Today
• Review of Supervised Learning
• Unsupervised Learning
• (Soft) K-means clustering
• Expectation Maximization
• Spectral Clustering
• Principle Components Analysis
• Latent Semantic Analysis
Supervised Learning
• Linear Regression
• Logistic Regression
• Graphical Models
• Hidden Markov Models
• Neural Networks
• Support Vector Machines
• Kernel Methods
Major concepts
• Gaussian, Multinomial, Bernoulli Distributions
• Joint vs. Conditional Distributions
• Marginalization
• Maximum Likelihood
• Risk Minimization
• Feature Extraction, Kernel Methods
Some favorite distributions
• Bernoulli
• Multinomial
• Gaussian
Maximum Likelihood
• Identify the parameter values that yield the maximum likelihood of generating the observed data.
• Take the partial derivative of the likelihood function
• Set to zero
• Solve
• NB: maximum likelihood parameters are the same as maximum log likelihood parameters
Maximum Log Likelihood
• Why do we like the log function?
• It turns products (difficult to differentiate) and turns them into sums (easy to differentiate)
• log(xy) = log(x) + log(y)
• log(xc) = clog(x)
Risk Minimization
• Pick a loss function
• Squared loss
• Linear loss
• Perceptron (classification) loss
• Identify the parameters that minimize the loss function.
• Take the partial derivative of the loss function
• Set to zero
• Solve
Frequentistsv. Bayesians
• Point estimates vs. Posteriors
• Risk Minimization vs. Maximum Likelihood
• L2-Regularization
• Frequentists: Add a constraint on the size of the weight vector
• Bayesians: Introduce a zero-mean prior on the weight vector
• Result is the same!
L2-Regularization
• Frequentists:
• Introduce a cost on the size of the weights
• Bayesians:
• Introduce a prior on the weights
Types of Classifiers
• Generative Models
• Highest resource requirements.
• Need to approximate the joint probability
• Discriminative Models
• Moderate resource requirements.
• Typically fewer parameters to approximate than generative models
• Discriminant Functions
• Can be trained probabilistically, but the output does not include confidence information
Linear Regression
• Fit a line to a set of points
Linear Regression
• Extension to higher dimensions
• Polynomial fitting
• Arbitrary function fitting
• Wavelets
• Classifier output
Logistic Regression
• Fit gaussians to data for each class
• The decision boundary is where the PDFs cross
• No “closed form” solution to the gradient.
Graphical Models
• General way to describe the dependence relationships between variables.
• Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.
Junction Tree Algorithm
• Moralization
• “Marry the parents”
• Make undirected
• Triangulation
• Remove cycles >4
• Junction Tree Construction
• Identify separators such that the running intersection property holds
• Introduction of Evidence
• Pass slices around the junction tree to generate marginals
Hidden Markov Models
• Sequential Modeling
• Generative Model
• Relationship between observations and state (class) sequences
Perceptron
• Step function used for squashing.
• Classifier as Neuron metaphor.
Perceptron Loss
• Classification Error vs. Sigmoid Error
• Loss is only calculated on Mistakes

Perceptrons use

strictly classification

error

Neural Networks
• Interconnected Layers of Perceptrons or Logistic Regression “neurons”
Neural Networks
• There are many possible configurations of neural networks
• Vary the number of layers
• Size of layers
Support Vector Machines
• Maximum Margin Classification

Small Margin

Large Margin

Support Vector Machines
• Optimization Function
• Decision Function
Questions?
• Now would be a good time to ask questions about Supervised Techniques.
Clustering
• Identify discrete groups of similar data points
• Data points are unlabeled
Recall K-Means
• Algorithm
• Select K – the desired number of clusters
• Initialize K cluster centroids
• For each point in the data set, assign it to the cluster with the closest centroid
• Update the centroid based on the points assigned to each cluster
• If any data point has changed clusters, repeat
Soft K-means
• In k-means, we force every data point to exist in exactly one cluster.
• This constraint can be relaxed.

Minimizes the entropy of cluster

assignment

Soft k-means
• We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points
• Convergence is based on a stopping threshold rather than changed assignments
Gaussian Mixture Models
• Rather than identifying clusters by “nearest” centroids
• Fit a Set of k Gaussians to the data.
Gaussian Mixture Models
• Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,
Graphical Modelswith unobserved variables
• What if you have variables in a Graphical model that are never observed?
• Latent Variables
• Training latent variable models is an unsupervised learning application

uncomfortable

amused

sweating

laughing

Latent Variable HMMs
• We can cluster sequences using an HMM with unobserved state variables
• We will train the latent variable models using Expectation Maximization
Expectation Maximization
• Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization
• Step 1: Expectation (E-step)
• Evaluate the “responsibilities” of each cluster with the current parameters
• Step 2: Maximization (M-step)
• Re-estimate parameters using the existing “responsibilities”
• Related to k-means
Questions
• One more time for questions on supervised learning…
Next Time
• Gaussian Mixture Models (GMMs)
• Expectation Maximization