supervised learning recap l.
Skip this Video
Loading SlideShow in 5 Seconds..
Supervised Learning Recap PowerPoint Presentation
Download Presentation
Supervised Learning Recap

Loading in 2 Seconds...

play fullscreen
1 / 40

Supervised Learning Recap - PowerPoint PPT Presentation

  • Uploaded on

Supervised Learning Recap. Machine Learning. Last Time. Support Vector Machines Kernel Methods. Today. Review of Supervised Learning Unsupervised Learning ( Soft) K-means clustering Expectation Maximization Spectral Clustering Principle Components Analysis Latent Semantic Analysis.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Supervised Learning Recap

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
last time
Last Time
  • Support Vector Machines
  • Kernel Methods
  • Review of Supervised Learning
  • Unsupervised Learning
    • (Soft) K-means clustering
    • Expectation Maximization
    • Spectral Clustering
    • Principle Components Analysis
    • Latent Semantic Analysis
supervised learning
Supervised Learning
  • Linear Regression
  • Logistic Regression
  • Graphical Models
    • Hidden Markov Models
  • Neural Networks
  • Support Vector Machines
    • Kernel Methods
major concepts
Major concepts
  • Gaussian, Multinomial, Bernoulli Distributions
  • Joint vs. Conditional Distributions
  • Marginalization
  • Maximum Likelihood
  • Risk Minimization
  • Gradient Descent
  • Feature Extraction, Kernel Methods
some favorite distributions
Some favorite distributions
  • Bernoulli
  • Multinomial
  • Gaussian
maximum likelihood
Maximum Likelihood
  • Identify the parameter values that yield the maximum likelihood of generating the observed data.
      • Take the partial derivative of the likelihood function
      • Set to zero
      • Solve
  • NB: maximum likelihood parameters are the same as maximum log likelihood parameters
maximum log likelihood
Maximum Log Likelihood
  • Why do we like the log function?
  • It turns products (difficult to differentiate) and turns them into sums (easy to differentiate)
  • log(xy) = log(x) + log(y)
  • log(xc) = clog(x)
risk minimization
Risk Minimization
  • Pick a loss function
    • Squared loss
    • Linear loss
    • Perceptron (classification) loss
  • Identify the parameters that minimize the loss function.
    • Take the partial derivative of the loss function
    • Set to zero
    • Solve
frequentists v bayesians
Frequentistsv. Bayesians
  • Point estimates vs. Posteriors
  • Risk Minimization vs. Maximum Likelihood
  • L2-Regularization
    • Frequentists: Add a constraint on the size of the weight vector
    • Bayesians: Introduce a zero-mean prior on the weight vector
    • Result is the same!
l2 regularization
  • Frequentists:
    • Introduce a cost on the size of the weights
  • Bayesians:
    • Introduce a prior on the weights
types of classifiers
Types of Classifiers
  • Generative Models
    • Highest resource requirements.
    • Need to approximate the joint probability
  • Discriminative Models
    • Moderate resource requirements.
    • Typically fewer parameters to approximate than generative models
  • Discriminant Functions
    • Can be trained probabilistically, but the output does not include confidence information
linear regression
Linear Regression
  • Fit a line to a set of points
linear regression14
Linear Regression
  • Extension to higher dimensions
    • Polynomial fitting
    • Arbitrary function fitting
      • Wavelets
      • Radial basis functions
      • Classifier output
logistic regression
Logistic Regression
  • Fit gaussians to data for each class
  • The decision boundary is where the PDFs cross
  • No “closed form” solution to the gradient.
  • Gradient Descent
graphical models
Graphical Models
  • General way to describe the dependence relationships between variables.
  • Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.
junction tree algorithm
Junction Tree Algorithm
  • Moralization
    • “Marry the parents”
    • Make undirected
  • Triangulation
    • Remove cycles >4
  • Junction Tree Construction
    • Identify separators such that the running intersection property holds
  • Introduction of Evidence
    • Pass slices around the junction tree to generate marginals
hidden markov models
Hidden Markov Models
  • Sequential Modeling
    • Generative Model
  • Relationship between observations and state (class) sequences
  • Step function used for squashing.
  • Classifier as Neuron metaphor.
perceptron loss
Perceptron Loss
  • Classification Error vs. Sigmoid Error
    • Loss is only calculated on Mistakes

Perceptrons use

strictly classification


neural networks
Neural Networks
  • Interconnected Layers of Perceptrons or Logistic Regression “neurons”
neural networks22
Neural Networks
  • There are many possible configurations of neural networks
    • Vary the number of layers
    • Size of layers
support vector machines
Support Vector Machines
  • Maximum Margin Classification

Small Margin

Large Margin

support vector machines24
Support Vector Machines
  • Optimization Function
  • Decision Function
  • Now would be a good time to ask questions about Supervised Techniques.
  • Identify discrete groups of similar data points
  • Data points are unlabeled
recall k means
Recall K-Means
  • Algorithm
    • Select K – the desired number of clusters
    • Initialize K cluster centroids
    • For each point in the data set, assign it to the cluster with the closest centroid
    • Update the centroid based on the points assigned to each cluster
    • If any data point has changed clusters, repeat
soft k means
Soft K-means
  • In k-means, we force every data point to exist in exactly one cluster.
  • This constraint can be relaxed.

Minimizes the entropy of cluster


soft k means32
Soft k-means
  • We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points
  • Convergence is based on a stopping threshold rather than changed assignments
gaussian mixture models
Gaussian Mixture Models
  • Rather than identifying clusters by “nearest” centroids
  • Fit a Set of k Gaussians to the data.
gaussian mixture models35
Gaussian Mixture Models
  • Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,
graphical models with unobserved variables
Graphical Modelswith unobserved variables
  • What if you have variables in a Graphical model that are never observed?
    • Latent Variables
  • Training latent variable models is an unsupervised learning application





latent variable hmms
Latent Variable HMMs
  • We can cluster sequences using an HMM with unobserved state variables
  • We will train the latent variable models using Expectation Maximization
expectation maximization
Expectation Maximization
  • Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization
    • Step 1: Expectation (E-step)
      • Evaluate the “responsibilities” of each cluster with the current parameters
    • Step 2: Maximization (M-step)
      • Re-estimate parameters using the existing “responsibilities”
  • Related to k-means
  • One more time for questions on supervised learning…
next time
Next Time
  • Gaussian Mixture Models (GMMs)
  • Expectation Maximization