- 80 Views
- Uploaded on
- Presentation posted in: General

Part 4: ADVANCED SVM-based LEARNING METHODS

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Part 4:ADVANCED SVM-based LEARNING METHODS

Vladimir Cherkassky

University of Minnesota

cherk001@umn.edu

Presented at Tech Tune Ups, ECE Dept, June 1, 2011

Electrical and Computer Engineering

1

1

1

OUTLINE

Motivation for non-standard approaches: high-dimensional data

Alternative Learning Settings

- Transduction and SSL

- Inference Through Contradictions

- Learning using privileged information (or SVM+)

- Multi-task Learning

Summary

- Why linear classifiers can generalize?
(1) Margin is large (relative to R)

(2) % of SV’s is small

(3) ratio d/n is small

- SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both
- What happens when d>>n ?
- standard inductive methods usually fail

Conventional approach:

Incorporate a priori knowledge into learning method

- Preprocessing and feature selection
- Model parameterization (~ good kernels in SVM)
Assumption: a priori knowledge about good model

Non-standard learning formulations:

Incorporate a priori knowledge into new non-standard learning formulation (learning setting)

Assumption: a priori knowledge is about properties of application data and/or goal of learning

- Which type of assumptions makes more sense?

OUTLINE

Motivation for non-standard approaches

Alternative Learning Settings

- Transduction and SSL

- Inference Through Contradictions

- Learning with Structured Data

- Multi-task Learning

Summary

- Application domain:hand-written digit recognition
- Standard inductive setting
- Transduction:labeled training + unlabeled data
- Learning through contradictions:
labeled training data ~ examples of digits 5 and 8

unlabeled examples (Universum) ~ all other (eight) digits

- Learning using hidden information:
Training data ~ t groups (i.e., from t different persons)

Test data ~ group label not known

- Multi-task learning:
Training data ~ t groups (from different persons)

Test data ~ t groups (group label is known)

- Standard Inductive learning assumes
Finite training set

Predictive model derived using only training data

Prediction for all possible test inputs

- Possible modifications
1. Predict only for given test points transduction

2. A priori knowledge in the form of additional ‘typical’ samples learning through contradiction

3. Additional (group) info about training data Learning using privileged information (LUPI) aka SVM+

4. Additional (group) info about training + test data Multi-task learning

- How to incorporate unlabeled test data into the learning process? Assume binary classification
- Estimating function at given points
Given: labeled training data

and unlabeled test points

Estimate: class labels at these test points

Goal of learning: minimization of risk on the test set:

where

Induction vs Transduction

Single unlabeled test point X

Many test points X aka working samples

- Binary classification, linear parameterization, joint set of (training + working) samples
- Two objectives of transductive learning:
(TL1) separate labeled training data using a large-margin hyperplane (as in standard inductive SVM)

(TL2) separating (explain) working data set using a large-margin hyperplane.

- Standard SVM hinge loss for labeled samples
- Loss function for unlabeled samples:
Mathematical optimization formulation

- Given: joint set of (training + working) samples
- Denote slack variables for training, for working
- Minimize
subject to

where

Solution (~ decision boundary)

- Unbalanced situation (small training/ large test)
all unlabeled samples assigned to one class

- Additional constraint:

- Hyperparameters control the trade-off between explanation and margin size
- Soft-margin inductive SVM is a special case of soft-margin transduction with zero slacks
- Dual + kernel version of SVM transduction
- Transductive SVM optimization is not convex
(~ non-convexity of the loss for unlabeled data) –

different opt. heuristics ~ different solutions

- Exact solution (via exhaustive search) possible for small number of test samples (m) – but this solution is NOT very useful (~ inductive SVM).

- Text categorization: classify word documents into a number of predetermined categories
- Email classification: Spam vs non-spam
- Web page classification
- Image database classification
- All these applications:
- high-dimensional data

- small labeled training set (human-labeled)

- large unlabeled test set

- Prediction of molecular bioactivity for drug discovery
- Training data~1,909; test~634 samples
- Input space ~ 139,351-dimensional
- Prediction accuracy:
SVMinduction ~74.5%; transduction ~ 82.3%

Ref:J. Weston et al, KDD cup 2001 data analysis: prediction of molecular bioactivity for drug design – binding to thrombin, Bioinformatics 2003

- Labeled data + unlabeled data Model
- Similar to transduction (but not the same):
- Goal 1 ~ prediction for unlabeled samples

- Goal 2 ~ estimate an inductive model

- Many algorithms
- Applications similar to transduction
- Typically
- Transduction works better for HDLSS

- SSL works better for low-dimensional data

Given initial labeled set L and unlabeled set U

Repeat:

(1) estimate a classifier using labeled set L

(2) classify randomly chosen unlabeled sample using decision rule estimated in Step (1)

(3) move this new labeled sample to set L

Iterate steps (1) – (3) until all unlabeled samples are classified.

Noisy Hyperbolas: unlabeled samples in green

Initial condition:

Iteration 50Iteration 100 (final)

- Motivation: what is a priori knowledge?
- info about thespace of admissible models

- info aboutadmissible data samples

- Labeled training samples + unlabeled samples from the Universum
- Universum samples encode info about the region of input space (where application data lives):
- Usually from a different distribution than training/test data

- Examples of the Universum data
- Large improvement for small training samples

- Handwritten digit recognition: digit 5 vs 8

Fig. courtesy of J. Weston (NEC Labs)

- Inductive setting for binary classification
Given: labeled training data

and unlabeled Universum samples

Goal of learning: minimization of prediction risk (as in standard inductive setting)

- Balance between two goals:
- explain labeled training data using large-margin hyperplane

- achieve maximum falsifiability ~ max # contradictions on the Universum

Math optimization formulation (extension of SVM)

Class 1

Average

Hyper-plane

Class -1

- Two randomly selected examples

- Universum sample:

- Binary classification of handwritten digits 5 and 8
- For this binary classification problem, the following Universum sets had been used:
U1: randomly selected digits (0,1,2,3,4,6,7,9)

U2: randomly mixing pixels from images 5 and 8

U3: average of randomly selected examples of 5 and 8

Training set size tried: 250, 500, … 3,000 samples

Universum set size: 5,000 samples

- Prediction error: improved over standard SVM, i.e. for 500 training samples: 1.4% vs 2% (SVM)

neither Hillary nor Obamadadaism

- Binary classification setting
- Difficult problem:
dimensionality ~ large (10K - 20K)

labeled sample size ~ small (~ 10 - 20)

- Humans perform very well for this task
- Issues:
- possible improvement (vs standard SVM)

- how to choose ‘good’ Universum?

- model parameter tuning

- Universum generation:
U1 Average: of male and female samples randomly selected from the training set (U. of Essex database)

U2 Empirical Distribution: estimate pixel-wise distribution of the training data. Generate a new picture from this distribution

U3 Animal faces:

U1 Averaging:

U2 Empirical Distribution:

36

- Classification accuracy: improves vs standard SVM by ~ 2% with U1 Universum,
and by ~ 1% with U2 Universum.

- Universum by averaging gives better results for this problem, when number of Universum samples N = 500 or 1,000

Universum ~ Animal Faces:

Degrades classification accuracy by 2-5% (vs standard SVM)

Animal faces are not relevant to this problem

38

• Application: Handwritten digit recognition

Labeled training data provided by t persons (t >1)

Goal 1: find a classifier that will generalize well for future samples generated by these persons ~ Learning with Structured Data or Learning using Hidden Information

Goal 2: find t classifiers with generalization (for each person) ~ Multi-Task Learning(MTL)

• Application: Medical diagnosis

Labeled training data provided by t groups of patients (t >1), say men and women (t = 2)

Goal 1: estimate a classifier to predict/diagnose a disease using training data from t groups of patients ~ LWSD

Goal 2: find t classifiers specialized for each group of patients ~ MTL

SVM

sSVM:

f(x)

SVM+

f(x)

SVM+:

SVM

f1(x)

mSVM:

SVM

f2(x)

f1(x)

MTL:

svm+MTL

f2(x)

40

- Map the input vectors simultaneously into:
- Decision space (standard SVM classifier)

- Correcting space (where correcting functions model slack variables for different groups)

- Decision space/function~ the same for all groups
- Correcting functions ~ different for each group (but correcting space may be the same)
- SVM+ optimization formulation incorporates:
- the capacity of decision function

- capacity of correcting functions for group r

- relative importance (weight) of these two capacities

Correcting space

Correcting functions

mapping

Correcting space

mapping

Decision function

Decision space

Group1

Group2

Class 1

slack variable for group r

Class -1

Decision Space

Correcting Space

subject to:

New learning formulation: SVM+MTL

Define decision function for each group as

Common decision function models the relatedness among groups

Correcting functions fine-tune the model for each group (task)

.

44

Decision Space

Correcting Space

subject to:

45

Empirical Validation

Different ways of using group info different learning settings:

- which one yields better generalization?

- how performance is affected by sample size?

Empirical comparisons:

- synthetic data set

46

SVM

sSVM:

f(x)

SVM+

f(x)

SVM+:

SVM

f1(x)

mSVM:

SVM

f2(x)

f1(x)

MTL:

svm+MTL

f2(x)

47

Generate x where each

The coefficient vectors of three tasks are specified as

For each task and each data vector,

Details of methods used:

- linear SVM classifier (single parameterC)

- SVM+, SVM+MTL classifier (3 parameters: linear kernel for decision space, RBF kernel for correcting space, and parameter γ)

- Independent validation set for model selection

48

Comparison results(ave over 10 trials):

n ~ number of training samples per task

ave test error (%):

Note: relative performance depends on sample size

Note: SVM+ always better than SVM

SVM+MTL always better than mSVM

49

OUTLINE

Motivation for non-standard approaches

Alternative Learning Settings

Summary: Advantages/limitations of non-standard settings

Advantages+limitations of nonstandard settings

Advantages

- make common sense

- follow methodological framework (VC-theory)

- yield better generalization (but not always)

Limitations

- need to formalize application requirements need to understand application domain

- generally more complex learning formulations

- more difficult model selection

- few known empirical comparisons (to date)

SVM+ is a promising new technology for hard problems

- Vapnik, V. Estimation of Dependencies Based on Empirical Data. Empirical Inference Science: Afterword of 2006, Springer, 2006
- Cherkassky, V. and F. Mulier, Learning from Data, second edition, Wiley, 2007
- Chapelle, O., Schölkopf, B., and A. Zien, Eds., Semi-Supervised Learning, MIT Press, 2006
- Cherkassky, V. and Y. Ma, Introduction to Predictive learning, Springer, 2011 (to appear)
- Hastie, T., R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference and Prediction, New York: Springer, 2001
- Schölkopf, B. and A. Smola, Learning with Kernels. MIT Press, 2002.
Public-domain SVM software

- Main web page link http://www.kernel-machines.org
- LIBSVM software library http://www.csie.ntu.edu.tw/~cjlin/libsvm/
- SVM-Light software library http://svmlight.joachims.org/
- Non-standard SVM-based methodologies: Universum, SVM+, MTL http://www.ece.umn.edu/users/cherkass/predictive_learning/