Part 4 advanced svm based learning methods
1 / 52

Part 4: ADVANCED SVM-based LEARNING METHODS - PowerPoint PPT Presentation

  • Uploaded on

Part 4: ADVANCED SVM-based LEARNING METHODS . Vladimir Cherkassky University of Minnesota Presented at Tech Tune Ups, ECE Dept, June 1, 2011. Electrical and Computer Engineering. 1. 1. 1. OUTLINE. Motivation for non-standard approaches: high-dimensional data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Part 4: ADVANCED SVM-based LEARNING METHODS' - eitan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Part 4 advanced svm based learning methods


Vladimir Cherkassky

University of Minnesota

Presented at Tech Tune Ups, ECE Dept, June 1, 2011

Electrical and Computer Engineering






Motivation for non-standard approaches: high-dimensional data

Alternative Learning Settings

- Transduction and SSL

- Inference Through Contradictions

- Learning using privileged information (or SVM+)

- Multi-task Learning


Insights provided by svm vc theory
Insights provided by SVM(VC-theory)

  • Why linear classifiers can generalize?

    (1) Margin is large (relative to R)

    (2) % of SV’s is small

    (3) ratio d/n is small

  • SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both

  • What happens when d>>n ?

    - standard inductive methods usually fail

How to improve generalization for hdlss
How to improve generalization for HDLSS?

Conventional approach:

Incorporate a priori knowledge into learning method

  • Preprocessing and feature selection

  • Model parameterization (~ good kernels in SVM)

    Assumption: a priori knowledge about good model

    Non-standard learning formulations:

    Incorporate a priori knowledge into new non-standard learning formulation (learning setting)

    Assumption: a priori knowledge is about properties of application data and/or goal of learning

  • Which type of assumptions makes more sense?



Motivation for non-standard approaches

Alternative Learning Settings

- Transduction and SSL

- Inference Through Contradictions

- Learning with Structured Data

- Multi-task Learning


Examples of non standard settings
Examples of non-standard settings

  • Application domain:hand-written digit recognition

  • Standard inductive setting

  • Transduction:labeled training + unlabeled data

  • Learning through contradictions:

    labeled training data ~ examples of digits 5 and 8

    unlabeled examples (Universum) ~ all other (eight) digits

  • Learning using hidden information:

    Training data ~ t groups (i.e., from t different persons)

    Test data ~ group label not known

  • Multi-task learning:

    Training data ~ t groups (from different persons)

    Test data ~ t groups (group label is known)

Modifications of inductive setting
Modifications of Inductive Setting

  • Standard Inductive learning assumes

    Finite training set

    Predictive model derived using only training data

    Prediction for all possible test inputs

  • Possible modifications

    1. Predict only for given test points  transduction

    2. A priori knowledge in the form of additional ‘typical’ samples  learning through contradiction

    3. Additional (group) info about training data  Learning using privileged information (LUPI) aka SVM+

    4. Additional (group) info about training + test data  Multi-task learning

Transduction vapnik 1982 1995
Transduction(Vapnik, 1982, 1995)

  • How to incorporate unlabeled test data into the learning process? Assume binary classification

  • Estimating function at given points

    Given: labeled training data

    and unlabeled test points

    Estimate: class labels at these test points

    Goal of learning: minimization of risk on the test set:


Transduction based on margin size
Transduction based on margin size

Single unlabeled test point X

Part 4 advanced svm based learning methods

Many test points X aka working samples

Transduction based on margin size1
Transduction based on margin size

  • Binary classification, linear parameterization, joint set of (training + working) samples

  • Two objectives of transductive learning:

    (TL1) separate labeled training data using a large-margin hyperplane (as in standard inductive SVM)

    (TL2) separating (explain) working data set using a large-margin hyperplane.

Transduction based on margin size2
Transduction based on margin size

  • Standard SVM hinge loss for labeled samples

  • Loss function for unlabeled samples:

     Mathematical optimization formulation

Optimization formulation for svm transduction
Optimization formulation for SVM transduction

  • Given: joint set of (training + working) samples

  • Denote slack variables for training, for working

  • Minimize

    subject to


     Solution (~ decision boundary)

  • Unbalanced situation (small training/ large test)

     all unlabeled samples assigned to one class

  • Additional constraint:

Optimization formulation cont d
Optimization formulation (cont’d)

  • Hyperparameters control the trade-off between explanation and margin size

  • Soft-margin inductive SVM is a special case of soft-margin transduction with zero slacks

  • Dual + kernel version of SVM transduction

  • Transductive SVM optimization is not convex

    (~ non-convexity of the loss for unlabeled data) –

     different opt. heuristics ~ different solutions

  • Exact solution (via exhaustive search) possible for small number of test samples (m) – but this solution is NOT very useful (~ inductive SVM).

Many applications for transduction
Many applications for transduction

  • Text categorization: classify word documents into a number of predetermined categories

  • Email classification: Spam vs non-spam

  • Web page classification

  • Image database classification

  • All these applications:

    - high-dimensional data

    - small labeled training set (human-labeled)

    - large unlabeled test set

Example application
Example application

  • Prediction of molecular bioactivity for drug discovery

  • Training data~1,909; test~634 samples

  • Input space ~ 139,351-dimensional

  • Prediction accuracy:

    SVMinduction ~74.5%; transduction ~ 82.3%

    Ref:J. Weston et al, KDD cup 2001 data analysis: prediction of molecular bioactivity for drug design – binding to thrombin, Bioinformatics 2003

Semi supervised learning ssl
Semi-Supervised Learning (SSL)

  • Labeled data + unlabeled data  Model

  • Similar to transduction (but not the same):

    - Goal 1 ~ prediction for unlabeled samples

    - Goal 2 ~ estimate an inductive model

  • Many algorithms

  • Applications similar to transduction

  • Typically

    - Transduction works better for HDLSS

    - SSL works better for low-dimensional data

Example self learning algorithm
Example: Self-Learning Algorithm

Given initial labeled set L and unlabeled set U


(1) estimate a classifier using labeled set L

(2) classify randomly chosen unlabeled sample using decision rule estimated in Step (1)

(3) move this new labeled sample to set L

Iterate steps (1) – (3) until all unlabeled samples are classified.

Example of self learning algorithm
Example of Self-Learning Algorithm

Noisy Hyperbolas: unlabeled samples in green

Initial condition:

Example of self learning algorithm1
Example of Self-Learning Algorithm

Iteration 50 Iteration 100 (final)

Inference through contradiction vapnik 2006
Inference through contradiction (Vapnik 2006)

  • Motivation: what is a priori knowledge?

    - info about thespace of admissible models

    - info aboutadmissible data samples

  • Labeled training samples + unlabeled samples from the Universum

  • Universum samples encode info about the region of input space (where application data lives):

    - Usually from a different distribution than training/test data

  • Examples of the Universum data

  • Large improvement for small training samples

Main idea
Main Idea

  • Handwritten digit recognition: digit 5 vs 8

Fig. courtesy of J. Weston (NEC Labs)

Learning with the universum
Learning with the Universum

  • Inductive setting for binary classification

    Given: labeled training data

    and unlabeled Universum samples

    Goal of learning: minimization of prediction risk (as in standard inductive setting)

  • Balance between two goals:

    - explain labeled training data using large-margin hyperplane

    - achieve maximum falsifiability ~ max # contradictions on the Universum

     Math optimization formulation (extension of SVM)

Insensitive loss for universum samples
-insensitive loss for Universum samples

Random averaging universum

Class 1



Class -1

Random averaging Universum

Random averaging for digits 5 and 8
Random Averaging for digits 5 and 8

  • Two randomly selected examples

  • Universum sample:

Application study vapnik 2006
Application Study (Vapnik, 2006)

  • Binary classification of handwritten digits 5 and 8

  • For this binary classification problem, the following Universum sets had been used:

    U1: randomly selected digits (0,1,2,3,4,6,7,9)

    U2: randomly mixing pixels from images 5 and 8

    U3: average of randomly selected examples of 5 and 8

    Training set size tried: 250, 500, … 3,000 samples

    Universum set size: 5,000 samples

  • Prediction error: improved over standard SVM, i.e. for 500 training samples: 1.4% vs 2% (SVM)

Cultural interpretation of universum jokes absurd examples
Cultural Interpretation of Universum:jokes, absurd examples:

neither Hillary nor Obama dadaism

Application study predicting gender of human faces
Application Study: predicting gender of human faces

  • Binary classification setting

  • Difficult problem:

    dimensionality ~ large (10K - 20K)

    labeled sample size ~ small (~ 10 - 20)

  • Humans perform very well for this task

  • Issues:

    - possible improvement (vs standard SVM)

    - how to choose ‘good’ Universum?

    - model parameter tuning

Male faces examples
Male Faces: examples

Universum faces neither male nor female
Universum Faces:neither male nor female

Empirical study cont d
Empirical Study(cont’d)

  • Universum generation:

    U1 Average: of male and female samples randomly selected from the training set (U. of Essex database)

    U2 Empirical Distribution: estimate pixel-wise distribution of the training data. Generate a new picture from this distribution

    U3 Animal faces:

Universum generation examples
Universum generation: examples

U1 Averaging:

U2 Empirical Distribution:


Results of gender classification
Results of gender classification

  • Classification accuracy: improves vs standard SVM by ~ 2% with U1 Universum,

    and by ~ 1% with U2 Universum.

  • Universum by averaging gives better results for this problem, when number of Universum samples N = 500 or 1,000

Results of gender classification1
Results of gender classification

Universum ~ Animal Faces:

Degrades classification accuracy by 2-5% (vs standard SVM)

Animal faces are not relevant to this problem


Learning with structured data vapnik 2006
Learning with Structured Data(Vapnik, 2006)

• Application: Handwritten digit recognition

Labeled training data provided by t persons (t >1)

Goal 1: find a classifier that will generalize well for future samples generated by these persons ~ Learning with Structured Data or Learning using Hidden Information

Goal 2: find t classifiers with generalization (for each person) ~ Multi-Task Learning(MTL)

• Application: Medical diagnosis

Labeled training data provided by t groups of patients (t >1), say men and women (t = 2)

Goal 1: estimate a classifier to predict/diagnose a disease using training data from t groups of patients ~ LWSD

Goal 2: find t classifiers specialized for each group of patients ~ MTL

Different ways of using group information
Different Ways of Using Group Information

















Svm technology vapnik 2006
SVM+ technology (Vapnik, 2006)

  • Map the input vectors simultaneously into:

    - Decision space (standard SVM classifier)

    - Correcting space (where correcting functions model slack variables for different groups)

  • Decision space/function~ the same for all groups

  • Correcting functions ~ different for each group (but correcting space may be the same)

  • SVM+ optimization formulation incorporates:

    - the capacity of decision function

    - capacity of correcting functions for group r

    - relative importance (weight) of these two capacities

Svm approach vapnik 2006
SVM+ approach (Vapnik, 2006)

Correcting space

Correcting functions


Correcting space


Decision function

Decision space



Class 1

slack variable for group r

Class -1

Svm formulation
SVM+ Formulation

Decision Space

Correcting Space

subject to:

Svm for multi task learning liang 2008
SVM+ for Multi-task Learning (Liang 2008)

New learning formulation: SVM+MTL

Define decision function for each group as

Common decision function models the relatedness among groups

Correcting functions fine-tune the model for each group (task)



Svm mtl formulation
svm+MTL Formulation

Decision Space

Correcting Space

subject to:


Empirical validation

Empirical Validation

Different ways of using group info  different learning settings:

- which one yields better generalization?

- how performance is affected by sample size?

Empirical comparisons:

- synthetic data set


Different ways of using group information1
Different Ways of Using Group Information

















Comparison for synthetic data set
Comparison for Synthetic Data Set

Generate x where each

The coefficient vectors of three tasks are specified as

For each task and each data vector,

Details of methods used:

- linear SVM classifier (single parameterC)

- SVM+, SVM+MTL classifier (3 parameters: linear kernel for decision space, RBF kernel for correcting space, and parameter γ)

- Independent validation set for model selection


Experimental results
Experimental Results

Comparison results(ave over 10 trials):

n ~ number of training samples per task

ave test error (%):

Note: relative performance depends on sample size

Note: SVM+ always better than SVM

SVM+MTL always better than mSVM




Motivation for non-standard approaches

Alternative Learning Settings

Summary: Advantages/limitations of non-standard settings

Advantages limitations of nonstandard settings

Advantages+limitations of nonstandard settings


- make common sense

- follow methodological framework (VC-theory)

- yield better generalization (but not always)


- need to formalize application requirements  need to understand application domain

- generally more complex learning formulations

- more difficult model selection

- few known empirical comparisons (to date)

SVM+ is a promising new technology for hard problems

References and resources
References and Resources

  • Vapnik, V. Estimation of Dependencies Based on Empirical Data. Empirical Inference Science: Afterword of 2006, Springer, 2006

  • Cherkassky, V. and F. Mulier, Learning from Data, second edition, Wiley, 2007

  • Chapelle, O., Schölkopf, B., and A. Zien, Eds., Semi-Supervised Learning, MIT Press, 2006

  • Cherkassky, V. and Y. Ma, Introduction to Predictive learning, Springer, 2011 (to appear)

  • Hastie, T., R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference and Prediction, New York: Springer, 2001

  • Schölkopf, B. and A. Smola, Learning with Kernels. MIT Press, 2002.

    Public-domain SVM software

  • Main web page link

  • LIBSVM software library

  • SVM-Light software library

  • Non-standard SVM-based methodologies: Universum, SVM+, MTL