Optimal Reverse Prediction:

Optimal Reverse Prediction: A Unified Perspective on Supervised, Unsupervised and Semi-supervised Learning Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention Discussion led by Chunping Wang ECE, Duke University October 23, 2009

Outline • Motivations • Preliminary Foundations • Reverse Supervised Least Squares • Relationship between Unsupervised Least Squares and PCA, K-means, and Normalized Graph-cut • Semi-supervised Least Squares • Experiments • Conclusions 1/31

Motivations • Lack of a foundational connection between supervised and unsupervised learning • Supervised learning: minimizing prediction error • Unsupervised learning: re-representing the input data • For semi-supervised learning, one needs to consider both together • The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption” • A unification demonstrated in this paper leads to a novel semi-supervised principle 2/31

Preliminary Foundations • Forward Supervised Least Squares • Data: • a input matrix X, a output matrix Y, • t instances, n features, k responses • regression: • classification: • assumption: X, Y full rank, • Problem: • Find parameters W minimizing least squares loss for a model 3/31

Preliminary Foundations • Linear • Ridge regularization • Kernelization • Instance weighting 4/31

Preliminary Foundations • Principal Components Analysis - dimensionality reduction • k-means – clustering • Normalized Graph-cut – clustering Weighted undirected graph nodes affinity matrix edges Graph partition problem: find a partition minimizing the total weight of edges connecting nodes in distinct subsets. 5/31

Preliminary Foundations • Normalized Graph-cut – clustering • Partition indicator matrix Z • Weighted degree matrix • Total cut • Normalized cut constraint objective objective 6/31 From Xing & Jordan, 2003

First contribution In literature Supervised Unsupervised Least Squares Regression Principle Component Analysis K-means Least Squares Classification Graph Norm Cut 7/31

First contribution This paper Unification Supervised Unsupervised Least Squares Regression Principle Component Analysis K-means Least Squares Classification Graph Norm Cut 7/31

Reverse Supervised Least Squares • Traditional forward least squares: predict the outputs from the inputs • Reverse least squares: predict the inputs from the outputs Given reverse solutions U, the corresponding forward solutions W can be recovered exactly. 8/31

Reverse Supervised Least Squares • Ridge regularization • Kernelization • Instance weighting Recover: Reverse problem: Recover: Reverse problem: Recover: 9/31

Reverse Supervised Least Squares For supervised learning with least squares loss • forward and reverse perspectives are equivalent • each can be recovered exactly from the other • the forward and reverse losses are not identical since they are measured in different units – it is not principled to combine them directly! 10/31

Unsupervised Least Squares Unsupervised learning: no training labels Y are given Principle: optimize over guessed labels Z • forward • reverse For any W, we can choose Z=XW to achieve zero loss It only gives trivial solutions It does not Work! It gives non-trivial solutions 11/31

Unsupervised Least Squares PCA Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. This connection has been made in Jong& Kotz, 1999, and the authors extend it to the kernelized cases Corollary 1 Kernelized reverse prediction is equivalent to kernel principal components analysis. 12/31

Unsupervised Least Squares PCA Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof 13/31

Unsupervised Least Squares PCA Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof The solution for Z is not unique Recall that 13/31

Unsupervised Least Squares PCA Proposition 1 Unconstrained reverse prediction is equivalent to principal components analysis. Proof Consider the SVD of Z: Then The objective Solution 14/31

Unsupervised Least Squares k-means Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Corollary 2 Constrained kernelized reverse prediction is equivalent to kernel k-means. The connection between PCA and k-means clustering has been made in Ding & He, 2004, but the authors show the connection of both to supervised (reverse) least squares. 15/31

Unsupervised Least Squares k-means Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof Equivalent problem Consider the difference Diagonal matrix Counts of data in each class matrix Each row: sum of data in each class 16/31

Unsupervised Least Squares k-means Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof means encoding 17/31

Unsupervised Least Squares k-means Proposition 2 Constrained reverse prediction is equivalent to k-means clustering. Proof Therefore 18/31

Unsupervised Least Squares Norm-cut Proposition 3 For a doubly nonnegative matrix K and weighting , weighted reverse prediction is equivalent to normalized graph-cut. Proof For any Z, the solution to the inner minimization Reduced objective 19/31

Unsupervised Least Squares Norm-cut Proposition 3 For a doubly nonnegative matrix K and weighting , weighted reverse prediction is equivalent to normalized graph-cut. Proof Recall the normalized-cut (from Xing & Jordan, 2003) Since K doubly nonnegative, it could be a affinity matrix. The objective is equivalent to normalized graph-cut. 20/31

Unsupervised Least Squares Norm-cut With a specific K, we can relate normalized graph-cut to the reverse least squares. Corollary 3 The weighted least squares problem is equivalent to normalized graph-cut on if . 21/31

Second contribution Reverse Prediction Supervised Unsupervised Least Squares Learning Principle Component Analysis K-means Graph Norm Cut 22/31 The figure is taken from Xu’s slides

Second contribution Reverse Prediction Semi- Supervised Supervised Unsupervised Least Squares Learning Principle Component Analysis New K-means Graph Norm Cut 22/31 The figure is taken from Xu’s slides

Semi-supervised Least Squares A principled approach: reverse loss decomposition Supervised reverse losses 23/31 The figure is taken from Xu’s slides

Semi-supervised Least Squares A principled approach: reverse loss decomposition Supervised reverse losses Unsupervised reverse losses 23/31 The figure is taken from Xu’s slides

Semi-supervised Least Squares Proposition 4 For any X, Y, and U where Supervised loss Unsupervised loss Squared distance Unsupervised loss depends only on the input data X; Squared distance depends on both X and Y. Note: we cannot get the true supervised loss since we don’t have all the labels Y. We may estimate it using only labeled data, or also using auxiliary unlabeled data. 24/31

Semi-supervised Least Squares Corollary 4 For any U where Supervised loss estimate Unsupervised loss estimate Squared distance estimate Labeled data are scarce, but plenty of unlabeled data are available. The variance of the supervised loss estimate is strictly reduced by introducing the second term to get a better unbiased unsupervised loss estimate. 25/31

Semi-supervised Least Squares A naive approach: Loss on labeled data Loss on unlabeled data Advantages: • The authors combine supervised and unsupervised reverse losses; while previous approaches combine unsupervised (reverse) loss with supervised (forward) loss, which are not in the same units. • Compared to the principled approach, it admits more straightforward optimization procedures (alternating between U and Z) 26/31

Regression ExperimentsLeast Squares + PCA Basic formulation • Two terms are not jointly convex no closed form solution • Learning method: alternating (with a initial U got from supervised setting) • Recovered forward solution • Testing: given a new x, • Can be kernelized 27/31

Regression ExperimentsLeast Squares + PCA Forward root mean squared error (mean± standard deviations for 10 random splits of the data) The values of (k, n; TL , TU) are indicated for each data set. 28/31 The table is taken from Xu’s paper

Classification ExperimentsLeast Squares + k-means • Recovered forward solution • Testing: given a new x, , predict max response Least Squares + Norm-cut 29/31

Classification ExperimentsLeast Squares + k-means Forward root mean squared error (mean± standard deviations for 10 random splits of the data) The values of (k, n; TL , TU) are indicated for each data set. 30/31 The table is taken from Xu’s paper

Conclusions • Two main contributions: • A unified framework based on reverse least squares loss is proposed for several existing supervised and unsupervised algorithms; • In the unified framework, a novel semi-supervised principle is proposed. 31/31

Optimal Reverse Prediction: