Part 4: ADVANCED SVM-based LEARNING METHODS . Vladimir Cherkassky University of Minnesota [email protected] Presented at Tech Tune Ups, ECE Dept, June 1, 2011. Electrical and Computer Engineering. 1. 1. 1. OUTLINE. Motivation for non-standard approaches: high-dimensional data
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Part 4:ADVANCED SVM-based LEARNING METHODS
Vladimir Cherkassky
University of Minnesota
Presented at Tech Tune Ups, ECE Dept, June 1, 2011
Electrical and Computer Engineering
1
1
1
OUTLINE
Motivation for non-standard approaches: high-dimensional data
Alternative Learning Settings
- Transduction and SSL
- Inference Through Contradictions
- Learning using privileged information (or SVM+)
- Multi-task Learning
Summary
(1) Margin is large (relative to R)
(2) % of SV’s is small
(3) ratio d/n is small
- standard inductive methods usually fail
Conventional approach:
Incorporate a priori knowledge into learning method
Assumption: a priori knowledge about good model
Non-standard learning formulations:
Incorporate a priori knowledge into new non-standard learning formulation (learning setting)
Assumption: a priori knowledge is about properties of application data and/or goal of learning
OUTLINE
Motivation for non-standard approaches
Alternative Learning Settings
- Transduction and SSL
- Inference Through Contradictions
- Learning with Structured Data
- Multi-task Learning
Summary
labeled training data ~ examples of digits 5 and 8
unlabeled examples (Universum) ~ all other (eight) digits
Training data ~ t groups (i.e., from t different persons)
Test data ~ group label not known
Training data ~ t groups (from different persons)
Test data ~ t groups (group label is known)
Finite training set
Predictive model derived using only training data
Prediction for all possible test inputs
1. Predict only for given test points transduction
2. A priori knowledge in the form of additional ‘typical’ samples learning through contradiction
3. Additional (group) info about training data Learning using privileged information (LUPI) aka SVM+
4. Additional (group) info about training + test data Multi-task learning
Given: labeled training data
and unlabeled test points
Estimate: class labels at these test points
Goal of learning: minimization of risk on the test set:
where
Induction vs Transduction
Single unlabeled test point X
Many test points X aka working samples
(TL1) separate labeled training data using a large-margin hyperplane (as in standard inductive SVM)
(TL2) separating (explain) working data set using a large-margin hyperplane.
Mathematical optimization formulation
subject to
where
Solution (~ decision boundary)
all unlabeled samples assigned to one class
(~ non-convexity of the loss for unlabeled data) –
different opt. heuristics ~ different solutions
- high-dimensional data
- small labeled training set (human-labeled)
- large unlabeled test set
SVMinduction ~74.5%; transduction ~ 82.3%
Ref:J. Weston et al, KDD cup 2001 data analysis: prediction of molecular bioactivity for drug design – binding to thrombin, Bioinformatics 2003
- Goal 1 ~ prediction for unlabeled samples
- Goal 2 ~ estimate an inductive model
- Transduction works better for HDLSS
- SSL works better for low-dimensional data
Given initial labeled set L and unlabeled set U
Repeat:
(1) estimate a classifier using labeled set L
(2) classify randomly chosen unlabeled sample using decision rule estimated in Step (1)
(3) move this new labeled sample to set L
Iterate steps (1) – (3) until all unlabeled samples are classified.
Noisy Hyperbolas: unlabeled samples in green
Initial condition:
Iteration 50Iteration 100 (final)
- info about thespace of admissible models
- info aboutadmissible data samples
- Usually from a different distribution than training/test data
Fig. courtesy of J. Weston (NEC Labs)
Given: labeled training data
and unlabeled Universum samples
Goal of learning: minimization of prediction risk (as in standard inductive setting)
- explain labeled training data using large-margin hyperplane
- achieve maximum falsifiability ~ max # contradictions on the Universum
Math optimization formulation (extension of SVM)
Class 1
Average
Hyper-plane
Class -1
U1: randomly selected digits (0,1,2,3,4,6,7,9)
U2: randomly mixing pixels from images 5 and 8
U3: average of randomly selected examples of 5 and 8
Training set size tried: 250, 500, … 3,000 samples
Universum set size: 5,000 samples
neither Hillary nor Obamadadaism
dimensionality ~ large (10K - 20K)
labeled sample size ~ small (~ 10 - 20)
- possible improvement (vs standard SVM)
- how to choose ‘good’ Universum?
- model parameter tuning
U1 Average: of male and female samples randomly selected from the training set (U. of Essex database)
U2 Empirical Distribution: estimate pixel-wise distribution of the training data. Generate a new picture from this distribution
U3 Animal faces:
U1 Averaging:
U2 Empirical Distribution:
36
and by ~ 1% with U2 Universum.
Universum ~ Animal Faces:
Degrades classification accuracy by 2-5% (vs standard SVM)
Animal faces are not relevant to this problem
38
• Application: Handwritten digit recognition
Labeled training data provided by t persons (t >1)
Goal 1: find a classifier that will generalize well for future samples generated by these persons ~ Learning with Structured Data or Learning using Hidden Information
Goal 2: find t classifiers with generalization (for each person) ~ Multi-Task Learning(MTL)
• Application: Medical diagnosis
Labeled training data provided by t groups of patients (t >1), say men and women (t = 2)
Goal 1: estimate a classifier to predict/diagnose a disease using training data from t groups of patients ~ LWSD
Goal 2: find t classifiers specialized for each group of patients ~ MTL
SVM
sSVM:
f(x)
SVM+
f(x)
SVM+:
SVM
f1(x)
mSVM:
SVM
f2(x)
f1(x)
MTL:
svm+MTL
f2(x)
40
- Decision space (standard SVM classifier)
- Correcting space (where correcting functions model slack variables for different groups)
- the capacity of decision function
- capacity of correcting functions for group r
- relative importance (weight) of these two capacities
Correcting space
Correcting functions
mapping
Correcting space
mapping
Decision function
Decision space
Group1
Group2
Class 1
slack variable for group r
Class -1
Decision Space
Correcting Space
subject to:
New learning formulation: SVM+MTL
Define decision function for each group as
Common decision function models the relatedness among groups
Correcting functions fine-tune the model for each group (task)
.
44
Decision Space
Correcting Space
subject to:
45
Empirical Validation
Different ways of using group info different learning settings:
- which one yields better generalization?
- how performance is affected by sample size?
Empirical comparisons:
- synthetic data set
46
SVM
sSVM:
f(x)
SVM+
f(x)
SVM+:
SVM
f1(x)
mSVM:
SVM
f2(x)
f1(x)
MTL:
svm+MTL
f2(x)
47
Generate x where each
The coefficient vectors of three tasks are specified as
For each task and each data vector,
Details of methods used:
- linear SVM classifier (single parameterC)
- SVM+, SVM+MTL classifier (3 parameters: linear kernel for decision space, RBF kernel for correcting space, and parameter γ)
- Independent validation set for model selection
48
Comparison results(ave over 10 trials):
n ~ number of training samples per task
ave test error (%):
Note: relative performance depends on sample size
Note: SVM+ always better than SVM
SVM+MTL always better than mSVM
49
OUTLINE
Motivation for non-standard approaches
Alternative Learning Settings
Summary: Advantages/limitations of non-standard settings
Advantages+limitations of nonstandard settings
Advantages
- make common sense
- follow methodological framework (VC-theory)
- yield better generalization (but not always)
Limitations
- need to formalize application requirements need to understand application domain
- generally more complex learning formulations
- more difficult model selection
- few known empirical comparisons (to date)
SVM+ is a promising new technology for hard problems
Public-domain SVM software