part 4 advanced svm based learning methods
Skip this Video
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 52

Part 4: ADVANCED SVM-based LEARNING METHODS - PowerPoint PPT Presentation

  • Uploaded on

Part 4: ADVANCED SVM-based LEARNING METHODS . Vladimir Cherkassky University of Minnesota [email protected] Presented at Tech Tune Ups, ECE Dept, June 1, 2011. Electrical and Computer Engineering. 1. 1. 1. OUTLINE. Motivation for non-standard approaches: high-dimensional data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Part 4: ADVANCED SVM-based LEARNING METHODS ' - eitan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
part 4 advanced svm based learning methods


Vladimir Cherkassky

University of Minnesota

[email protected]

Presented at Tech Tune Ups, ECE Dept, June 1, 2011

Electrical and Computer Engineering






Motivation for non-standard approaches: high-dimensional data

Alternative Learning Settings

- Transduction and SSL

- Inference Through Contradictions

- Learning using privileged information (or SVM+)

- Multi-task Learning


insights provided by svm vc theory
Insights provided by SVM(VC-theory)
  • Why linear classifiers can generalize?

(1) Margin is large (relative to R)

(2) % of SV’s is small

(3) ratio d/n is small

  • SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both
  • What happens when d>>n ?

- standard inductive methods usually fail

how to improve generalization for hdlss
How to improve generalization for HDLSS?

Conventional approach:

Incorporate a priori knowledge into learning method

  • Preprocessing and feature selection
  • Model parameterization (~ good kernels in SVM)

Assumption: a priori knowledge about good model

Non-standard learning formulations:

Incorporate a priori knowledge into new non-standard learning formulation (learning setting)

Assumption: a priori knowledge is about properties of application data and/or goal of learning

  • Which type of assumptions makes more sense?


Motivation for non-standard approaches

Alternative Learning Settings

- Transduction and SSL

- Inference Through Contradictions

- Learning with Structured Data

- Multi-task Learning


examples of non standard settings
Examples of non-standard settings
  • Application domain:hand-written digit recognition
  • Standard inductive setting
  • Transduction:labeled training + unlabeled data
  • Learning through contradictions:

labeled training data ~ examples of digits 5 and 8

unlabeled examples (Universum) ~ all other (eight) digits

  • Learning using hidden information:

Training data ~ t groups (i.e., from t different persons)

Test data ~ group label not known

  • Multi-task learning:

Training data ~ t groups (from different persons)

Test data ~ t groups (group label is known)

modifications of inductive setting
Modifications of Inductive Setting
  • Standard Inductive learning assumes

Finite training set

Predictive model derived using only training data

Prediction for all possible test inputs

  • Possible modifications

1. Predict only for given test points  transduction

2. A priori knowledge in the form of additional ‘typical’ samples  learning through contradiction

3. Additional (group) info about training data  Learning using privileged information (LUPI) aka SVM+

4. Additional (group) info about training + test data  Multi-task learning

transduction vapnik 1982 1995
Transduction(Vapnik, 1982, 1995)
  • How to incorporate unlabeled test data into the learning process? Assume binary classification
  • Estimating function at given points

Given: labeled training data

and unlabeled test points

Estimate: class labels at these test points

Goal of learning: minimization of risk on the test set:


transduction based on margin size
Transduction based on margin size

Single unlabeled test point X

transduction based on margin size1
Transduction based on margin size
  • Binary classification, linear parameterization, joint set of (training + working) samples
  • Two objectives of transductive learning:

(TL1) separate labeled training data using a large-margin hyperplane (as in standard inductive SVM)

(TL2) separating (explain) working data set using a large-margin hyperplane.

transduction based on margin size2
Transduction based on margin size
  • Standard SVM hinge loss for labeled samples
  • Loss function for unlabeled samples:

 Mathematical optimization formulation

optimization formulation for svm transduction
Optimization formulation for SVM transduction
  • Given: joint set of (training + working) samples
  • Denote slack variables for training, for working
  • Minimize

subject to


 Solution (~ decision boundary)

  • Unbalanced situation (small training/ large test)

 all unlabeled samples assigned to one class

  • Additional constraint:
optimization formulation cont d
Optimization formulation (cont’d)
  • Hyperparameters control the trade-off between explanation and margin size
  • Soft-margin inductive SVM is a special case of soft-margin transduction with zero slacks
  • Dual + kernel version of SVM transduction
  • Transductive SVM optimization is not convex

(~ non-convexity of the loss for unlabeled data) –

 different opt. heuristics ~ different solutions

  • Exact solution (via exhaustive search) possible for small number of test samples (m) – but this solution is NOT very useful (~ inductive SVM).
many applications for transduction
Many applications for transduction
  • Text categorization: classify word documents into a number of predetermined categories
  • Email classification: Spam vs non-spam
  • Web page classification
  • Image database classification
  • All these applications:

- high-dimensional data

- small labeled training set (human-labeled)

- large unlabeled test set

example application
Example application
  • Prediction of molecular bioactivity for drug discovery
  • Training data~1,909; test~634 samples
  • Input space ~ 139,351-dimensional
  • Prediction accuracy:

SVMinduction ~74.5%; transduction ~ 82.3%

Ref:J. Weston et al, KDD cup 2001 data analysis: prediction of molecular bioactivity for drug design – binding to thrombin, Bioinformatics 2003

semi supervised learning ssl
Semi-Supervised Learning (SSL)
  • Labeled data + unlabeled data  Model
  • Similar to transduction (but not the same):

- Goal 1 ~ prediction for unlabeled samples

- Goal 2 ~ estimate an inductive model

  • Many algorithms
  • Applications similar to transduction
  • Typically

- Transduction works better for HDLSS

- SSL works better for low-dimensional data

example self learning algorithm
Example: Self-Learning Algorithm

Given initial labeled set L and unlabeled set U


(1) estimate a classifier using labeled set L

(2) classify randomly chosen unlabeled sample using decision rule estimated in Step (1)

(3) move this new labeled sample to set L

Iterate steps (1) – (3) until all unlabeled samples are classified.

example of self learning algorithm
Example of Self-Learning Algorithm

Noisy Hyperbolas: unlabeled samples in green

Initial condition:

example of self learning algorithm1
Example of Self-Learning Algorithm

Iteration 50 Iteration 100 (final)

inference through contradiction vapnik 2006
Inference through contradiction (Vapnik 2006)
  • Motivation: what is a priori knowledge?

- info about thespace of admissible models

- info aboutadmissible data samples

  • Labeled training samples + unlabeled samples from the Universum
  • Universum samples encode info about the region of input space (where application data lives):

- Usually from a different distribution than training/test data

  • Examples of the Universum data
  • Large improvement for small training samples
main idea
Main Idea
  • Handwritten digit recognition: digit 5 vs 8

Fig. courtesy of J. Weston (NEC Labs)

learning with the universum
Learning with the Universum
  • Inductive setting for binary classification

Given: labeled training data

and unlabeled Universum samples

Goal of learning: minimization of prediction risk (as in standard inductive setting)

  • Balance between two goals:

- explain labeled training data using large-margin hyperplane

- achieve maximum falsifiability ~ max # contradictions on the Universum

 Math optimization formulation (extension of SVM)

random averaging universum

Class 1



Class -1

Random averaging Universum
random averaging for digits 5 and 8
Random Averaging for digits 5 and 8
  • Two randomly selected examples
  • Universum sample:
application study vapnik 2006
Application Study (Vapnik, 2006)
  • Binary classification of handwritten digits 5 and 8
  • For this binary classification problem, the following Universum sets had been used:

U1: randomly selected digits (0,1,2,3,4,6,7,9)

U2: randomly mixing pixels from images 5 and 8

U3: average of randomly selected examples of 5 and 8

Training set size tried: 250, 500, … 3,000 samples

Universum set size: 5,000 samples

  • Prediction error: improved over standard SVM, i.e. for 500 training samples: 1.4% vs 2% (SVM)
application study predicting gender of human faces
Application Study: predicting gender of human faces
  • Binary classification setting
  • Difficult problem:

dimensionality ~ large (10K - 20K)

labeled sample size ~ small (~ 10 - 20)

  • Humans perform very well for this task
  • Issues:

- possible improvement (vs standard SVM)

- how to choose ‘good’ Universum?

- model parameter tuning

empirical study cont d
Empirical Study(cont’d)
  • Universum generation:

U1 Average: of male and female samples randomly selected from the training set (U. of Essex database)

U2 Empirical Distribution: estimate pixel-wise distribution of the training data. Generate a new picture from this distribution

U3 Animal faces:

universum generation examples
Universum generation: examples

U1 Averaging:

U2 Empirical Distribution:


results of gender classification
Results of gender classification
  • Classification accuracy: improves vs standard SVM by ~ 2% with U1 Universum,

and by ~ 1% with U2 Universum.

  • Universum by averaging gives better results for this problem, when number of Universum samples N = 500 or 1,000
results of gender classification1
Results of gender classification

Universum ~ Animal Faces:

Degrades classification accuracy by 2-5% (vs standard SVM)

Animal faces are not relevant to this problem


learning with structured data vapnik 2006
Learning with Structured Data(Vapnik, 2006)

• Application: Handwritten digit recognition

Labeled training data provided by t persons (t >1)

Goal 1: find a classifier that will generalize well for future samples generated by these persons ~ Learning with Structured Data or Learning using Hidden Information

Goal 2: find t classifiers with generalization (for each person) ~ Multi-Task Learning(MTL)

• Application: Medical diagnosis

Labeled training data provided by t groups of patients (t >1), say men and women (t = 2)

Goal 1: estimate a classifier to predict/diagnose a disease using training data from t groups of patients ~ LWSD

Goal 2: find t classifiers specialized for each group of patients ~ MTL

different ways of using group information
Different Ways of Using Group Information

















svm technology vapnik 2006
SVM+ technology (Vapnik, 2006)
  • Map the input vectors simultaneously into:

- Decision space (standard SVM classifier)

- Correcting space (where correcting functions model slack variables for different groups)

  • Decision space/function~ the same for all groups
  • Correcting functions ~ different for each group (but correcting space may be the same)
  • SVM+ optimization formulation incorporates:

- the capacity of decision function

- capacity of correcting functions for group r

- relative importance (weight) of these two capacities

svm approach vapnik 2006
SVM+ approach (Vapnik, 2006)

Correcting space

Correcting functions


Correcting space


Decision function

Decision space



Class 1

slack variable for group r

Class -1

svm formulation
SVM+ Formulation

Decision Space

Correcting Space

subject to:

svm for multi task learning liang 2008
SVM+ for Multi-task Learning (Liang 2008)

New learning formulation: SVM+MTL

Define decision function for each group as

Common decision function models the relatedness among groups

Correcting functions fine-tune the model for each group (task)



svm mtl formulation
svm+MTL Formulation

Decision Space

Correcting Space

subject to:


empirical validation

Empirical Validation

Different ways of using group info  different learning settings:

- which one yields better generalization?

- how performance is affected by sample size?

Empirical comparisons:

- synthetic data set


different ways of using group information1
Different Ways of Using Group Information

















comparison for synthetic data set
Comparison for Synthetic Data Set

Generate x where each

The coefficient vectors of three tasks are specified as

For each task and each data vector,

Details of methods used:

- linear SVM classifier (single parameterC)

- SVM+, SVM+MTL classifier (3 parameters: linear kernel for decision space, RBF kernel for correcting space, and parameter γ)

- Independent validation set for model selection


experimental results
Experimental Results

Comparison results(ave over 10 trials):

n ~ number of training samples per task

ave test error (%):

Note: relative performance depends on sample size

Note: SVM+ always better than SVM

SVM+MTL always better than mSVM




Motivation for non-standard approaches

Alternative Learning Settings

Summary: Advantages/limitations of non-standard settings

advantages limitations of nonstandard settings

Advantages+limitations of nonstandard settings


- make common sense

- follow methodological framework (VC-theory)

- yield better generalization (but not always)


- need to formalize application requirements  need to understand application domain

- generally more complex learning formulations

- more difficult model selection

- few known empirical comparisons (to date)

SVM+ is a promising new technology for hard problems

references and resources
References and Resources
  • Vapnik, V. Estimation of Dependencies Based on Empirical Data. Empirical Inference Science: Afterword of 2006, Springer, 2006
  • Cherkassky, V. and F. Mulier, Learning from Data, second edition, Wiley, 2007
  • Chapelle, O., Schölkopf, B., and A. Zien, Eds., Semi-Supervised Learning, MIT Press, 2006
  • Cherkassky, V. and Y. Ma, Introduction to Predictive learning, Springer, 2011 (to appear)
  • Hastie, T., R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference and Prediction, New York: Springer, 2001
  • Schölkopf, B. and A. Smola, Learning with Kernels. MIT Press, 2002.

Public-domain SVM software

  • Main web page link
  • LIBSVM software library
  • SVM-Light software library
  • Non-standard SVM-based methodologies: Universum, SVM+, MTL