Max-Margin Classification of Data with Absent Features

Max-Margin Classification of Data with Absent Features by Chechik, Heitz, Elidan, Abbeel and Koller,JMLR 2008 Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008

Outline • Introduction • Standard SVM • Max-Margin Formulation for Missing Features • Three Algorithms • Experimental Results • Conclusions

Introduction (1) • Pattern of missing features: • due to measurement noise or corruption: existing but unknown • due to the inherent properties of the instances: non-existing Example 1: Two subpopulation of instances (animals and buildings) with few overlapping features (body parts, architectural aspects ); Example 2: In a web-page task, one useful feature of a given page may be the most common topic of other sites that point to it, however, this particular page may have no such parents.

Introduction (2) • Common methods for handling missing features: • (Assume the features exist but their values are unknown) • Single imputation: zeros, mean, kNN • imputation by building probabilistic generative models Proposed method (Assume the features are structurally absent) : Each data instance resides in a lower dimensional subspace of the feature space, determined by its own existing features. We try to maximize the worst-case margin of the separating hyperplane, while measuring the margin of each data instance in its own lower-dimensional subspace.

Standard SVM (1) Binary classification real-valued predictors binary response A classifier could be defined as based on a linear function Parameters

Standard SVM (2) Functional margin for each instance Geometric margin for each instance Geometric margin of a hyper plane SVM: by fixing the functional margin to 1, i.e., ’s: slack variables C: cost Quadratic Programming (QP)

Max-Margin Formulation for Missing Features (1) A 2-D case with missing data margin in the subspace margin in the full feature space Margin of instances with missing features is underestimated.

Max-Margin Formulation for Missing Features (2) Instance margin Optimization problem is instance dependent and thus cannot be taken out of the minimization is non-convex in w It is difficult to solve this optimization problem directly.

Three Algorithms (1) • A convex formulation for linearly separable case • Introduce a lower bound for For a given , this is a second order cone program (SOCP), which is convex and can be solved efficiently. To find the optimal , do a bisection search over . Unfortunately, extending it to the non-separable case is difficult.

Three Algorithms (2) • Average norm: a convex approximation for non-separable case Get rid of the instance dependence define non-separable case

Three Algorithms (3) • Geometric margin: an exact non-convex approach for non-separable case define non-separable case QP for a given set of ’s

Three Algorithms (4) • Geometric margin: the exact non-convex approach for non-separable case Pseudo-code The convergence is not always guaranteed. Cross validation is used to choose an early stopping point.

Experimental Results (1) Zero. Missing values were set to zero. Mean. Missing values were set to the average value of the feature over all data. Flag. Additional features (“flags”) were added, explicitly denoting whether a feature is missing for a given instance. kNN. Missing features were set with the mean value obtained from the K nearest neighbors instances. EM. A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the filled data and (2) re-filling missing values using cluster means, weighted by the posterior probability that a cluster generated the sample. Averaged norm(avg |w|). Proposed approximate convex approach. Geometric margin(geom). Proposed exact non-convex approach.

Experimental Results (2) • UCI data sets (missing at random) • Remove 90% of the features of each sample randomly Remove a patch covered 25% of pixels with location of the patch uniformly sampled. Digits 5 & 6 from MNIST

Experimental Results (3) • Visual object recognition • Task: to determine an automobile is present in a given image or not. Likelihood of patches to match each of 19landmarks Local edge information Generative model Set a threshold (Up to 10) Candidate patches (21-by-21 pixels) for landmarks First 10 principal components for each patch A feature vector (up to 1900features) concatenate PCA If the number of candidates for a given landmark is less than ten, we consider the rest to be structurally absent

Experimental Results (4) An example image: the best 5 candidates matched to the front windshield landmark

Experimental Results (5)

Experimental Results (6) • Metabolic pathway reconstruction Arrows: chemical reactions Purple boxed names: enzymes A fragment of the full metabolic pathway network

Experimental Results (7) Three types of neighborhood relations between enzyme pairs: • Linear chains (ARO7, PHA2) • Forks (TRP2, ARO7): same input, different outputs • Funnels (ARO9, PHA2): same output, different inputs One feature vector (represents an enzyme) Features for linear chain neighbor Features for fork neighbor Features for funnel neighbor A feature vector will have structurally missing entries if the enzyme does not have all types of neighbors, e.g., PHA2 does not have a neighbor of type fork.

Experimental Results (8) Task: to identify if a candidate enzyme is in the right “neighborhood”. Data creation: • Positive samples: from the reactions with known enzymes (in the right “neighborhood”); • Negative samples: for each positive sample, replace the true enzyme with a random impostor, and calculate the features in such a wrong “neighborhood”. The impostor was uniformly chosen from the set of other enzymes.

Experimental Results (9)

Conclusions • The authors presented a modified SVM model for max-margin training of classifiers in the presence of missing features, where the pattern of missing features is an inherent part of the domain. • The authors directly classified instances by skipping the non-existing features, rather than filling them with hypothetical values. • The proposed model was competitive with a range of single imputation approaches when tested in missing-at-random (MAR) settings. • One variant (geometric margin) significantly outperformed other methods in two real problems with non-existing features.

Max-Margin Classification of Data with Absent Features

Max-Margin Classification of Data with Absent Features

Presentation Transcript

Max-Margin Additive Classifiers for Detection

Margin Trees for High-dimensional Classification

Literary Style Classification with Deep Linguistic Features

Data Classification

Absent

Contextual Classification with Functional Max-Margin Markov Networks

Max-Margin Latent Variable Models

Image Classification: Features, Algorithms or Data?

Max-Margin Early Event Detectors

Max-margin sequential learning methods

Hierarchical Classification with the small set of features

Classification of Microarray data

Classification of GAIA data

Data Stream Classification: Training with Limited Amount of Labeled Data

Max-Margin Minimum Entropy Models

Classification with Gene Expression Data

Online Max-Margin Weight Learning with Markov Logic Networks

DATA CLASSIFICATION

Multi-Kernel Multi-Label Learning with Max-Margin Concept Network

Data Classification

Classification of unlabeled data:

Classification of Microarray Data