Presenter: Yanlin Wu Advisor: Professor Geman Date: 10/17/2006

Presenter: Yanlin Wu Advisor: Professor Geman Date: 10/17/2006

Is cross-validation valid for small-sample classification? Ulisses M. Barga-Neto and Edward R. Dougherty

Background • What is the classification problem? • How to evaluate the accuracy of one classifier? Namely measure the error of classification model? • Different error measuring methods • Things need to pay attention to…

Classification problem • In statistical patter recognition, feature vector and a label ,which takes on numerical values representing the different classes; for two-class problem, Y={0,1} • Classifier is a function: • Error rate of g is: • Bayes classifier: if and otherwise. • For any g, , so that is the optimal classifier.

Training data • Feature-label distribution F is unknown – using a training data to design a classifier • Classification rule is a mapping • A classification rule maps the training data Sn into the designed classifier • The true error of a designed classifier is its error rate given a fixed training dataset: where EF indicates expectation with respect to F

Training data (continued) • The expected error rate over the data is given by where Fn is the joint distribution of the data Sn. • It is also called unconditional error of the classification rule.

Question: How can we state a measure of the true error of a model, since we don’t have access to the universe of observations to test it on, namely we don’t know F ?? Answer: Error estimation methods have been developed.

Error estimation techniques • For all methods, the final model M is built base on all n observations, and then we use these n observations again to estimate the error of the model. • Types: • Re-substitution • Holdout Method • Cross-validation • Bootstrap

Re-substitution • Re-use the same training sample to measure error • This error tends to be biased low and can be made to be arbitrarily close to zero by overfitting of the model and reusing same data to measure error.

Holdout Method • For large samples, randomly choose a subset for test data, design the classifier on , and estimate its error by applying it to . • Unbiased estimator of , with respect to expectation over

Holdout Method Comments • This error can be slightly biased high due to not using all n observations to build the classifier. This bias will tend to decrease as n increases. • The choice of what percentage of the n observation are going to is important. Also it is affected by n • Holdout method can be run multiple times, with the accuracy estimates from all the runs averaged. • Impractical with small samples

Cross-Validation • Algorithm: • Split data into k mutually exclusive subsets , then build the model on k-1 of them and measuring error on the other. • Each subset will act as testing set once • The error is the average of these k error measures. • When k=n, it is called “leave-one-out cross-validation”

Cross-Validation (continued) • Stratified Cross-Validation: the classes are represented in each fold in the same proportion as in the original data – there is evidence that this improves the estimator • K-fold cross-validation estimator is unbiased as an estimator of • Leave-one-out estimator: nearly unbiased as an estimator of

Cross-Validation Comments • May be biased high, same reason as Holdout method • Often used when n is small which can make the Holdout method may become more biased. • Very computationally intensive, especially for large k and n.

Bootstrap Method • Based on the notion of an ‘empirical distribution’ F*, which puts mass 1/n on each of the n data points • A bootstrap sample Sn* from F* consists of n equally-likely draws with replacement from the original data Sn • The probabiliity that any given data point will not appear in Sn* is when n>>1 • A bootstrap sample of size n contains on average of the original data points.

Bootstrap Method (continued) • Bootstrap zero estimator: • In practice, EF* has to be approximated by a sample mean based on independent replicates , for b=1,…,B, where B is recommended to be between 25 and 200: where Pi*b is the actual proportion of times a data point (xi,yi) appears in Sn*b

Bootstrap Method (continued) • The bootstrap zero estimator tends to be a high biased estimator of • 0.632 bootstrap estimator tries to correct this bias: • Bias-corrected bootstrap estimator:

Bootstrap Comments • Computation intensive. • Choice of B is important. • Tends to be slightly more accurate than cross-validation in some situation. But tends to have greater variance.

Classification procedure • Assess gene expressions with microarrays • Determine genes whose expression levels can be used as classifier variables • Apply a rule to design the classifier from the sample data • Apply an error estimation procedure

Error estimation challenges • What if the number of training samples is remarkably small? • The error estimation will be greatly impacted by small samples. • A dilemma: unbiased or small variance? • Prefer small variance: an unbiased estimator with large variance is of little use

Error estimators under small samples • Holdout: impractical with small samples • Resubstitution: always low-biased • cross-validation: have higher variance than that of resubstitution or bootstrap. The variance problem of cross-validation makes its use questionable for the kinds of very small samples!

Variability affecting error estimation • Internal variance and the variability due to the random training sample. The latter is much large than the internal variance. • Error-counting estimates, such as resubstitution and cross-validation, can only change by 1/n increments.

Variability (continued) • In cross-validation, test samples are not independent samples. This adds variance to the estimate. • Surrogate problem: original designed classifier is assessed in terms of surrogate classifier, designed by the classification rule applied on reduce data. If these surrogate classifiers are too different from the original classifier too often, the estimate may be far from the true error rate.

Experimental Setup • Classification rules: • linear discriminant analysis (LDA) • 3-nearest-neighbor (3NN) • decision trees (CART) • Error estimators: • resubstitution (resub) • cross-validation: leave-one-out (loo), 5-fold c-v (cv5), 10-fold c-v (cv10) and repeated 10-fold c-v (cv10r) • Bootstrap: 0.632 bootstrap (b632) and the bias-corrected bootstrap (bbc)

Study terms of error estimators • Study the performance of an error estimator via the distribution of the error: deviation distribution of the error estimator • Estimator bias • Confidence we can have in our estimates from actual samples • The root-mean-square (RMS) error: • Quartiles of the deviation distribution: is less affected by outliers than the mean

Linear Discriminant Analysis (LDA) Class posteriors: for optimal classification. Suppose is the class-conditional density of X in the class G=k, and let be the prior probability of the class k, with . A simple application of Bayes theorem gives us Suppose we model each class density as multivariate Gaussian

LDA (continued) Assume the class have a common covariance matrix , we get LDA In comparing two classes k and l, it is sufficient to look at the log-ratio Linear Discriminant function: is an equivalent description of the decision rule With G(x)=

LDA (continued) • Estimate the parameters of the Gaussian distributions: • 1. where is the number of class-k observations • 2. • 3. Figure1: 3 Gaussian distribution with same covariance and different means. Included are the contours of constant density enclosing 95% of the probability of each class. (Bayes decision boundaries.) Figuare2: 30 sample drawn from each Gaussian distribution, and fitted LDA decision boundaries.

KNN: Nearest-Neighbor Methods Nearest-neighbor methods use those observations in the training set T closest in input space to X to form . Specifically, the k-nearest neighbor fit for IS DEFINED AS FOLLOWS: where is the neighborhood of x defined by k closest points in the training sample. Closeness implies a metric, which for the moment we assume is Euclidean distance (can define other distance also). So in words, we find the observations with closest to x in input space, and average their responses.

Figure 1. 15-nearest neighbor classifier Figure 2 1-nearest neighbor classifier Figure 4 Misclassification curves (training size=20, test size=10000) Figure 3 7-nearest neighbor classifier

Decision tree (CART) • Decide how to split (conditional Gini or conditional Entropy) • Decide when to stop splitting • Decide how to prune the tree • Use Training sample: • Pessimistic Pruning/ Minimal Error Pruning/ Error-based Pruning/Cost Complexity Pruning • Use Pruning Sample: • Error Reduced Pruning

Simulation (synthetic data) • Six sample sizes: 20 to 120 in increments of 20 • Total experimental conditions: 3*6*6=108 • For each experimental condition and sample size, compute the empirical deviation distribution using 1000 replications with different sample data drawn from an underlying model. • True error • Computed exactly for LDA • By Monte-Carlo computation for 3NN and CART

Simulation (synthetic data) Empirical deviation distribution for selected simulations (synthetic data). beta fits, n = 20.

Simulation (synthetic data) Empirical deviation distribution for selected simulations. variance as a function of sample size.

Simulation (synthetic data) • Cross-validation: slightly high-biased, main drawback is high variability. They also tend to produce large outliers • Resubstitution is low-biased, but shows smaller variance than cross-validation • 0.632 bootstrap proved to be the best • Also need to consider computational cost

Simulation (patient data) • Microarrays, from breast tumor samples from 295 patients: 115 good-prognosis, 180 poor-prognosis. • Use log-ratio gene expression values associated with the top p=2 and top p=5 genes. In each case, 1000 observations of size n=20 and 40 were draw independently from the pool of 295 microarrays • Sampling was strafified • True error for each observation of size n: holdout estimator, the 295-n samples points not drawn are used as the test set (good approximation given large test smaple)

Simulation (patient data) Empirical deviation distribution for selected simulations (patient data). beta fits, n = 20.

Simulation (patient data) • The observations are not independent, but only weakly dependent • The results obtained with the patient data confirm the general conclusions obtained with the synthetic data

Conclusion • cross-validation error estimation is much less biased than resubstitution, but with excessive variance. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (much less than resubstitution).

My own opinion • Since the universal distribution of the training sample is unknown, the true error can only be defined on the training sample. So if the number of the training samples is very small, or the sampling method to get this training sample is not carried out correctly, either of which will cause the training samples not be able to represent the universal samples, then the classifiers and the error estimations based on this small number samples can not provide useful information about the universal classification problem.

Outlier sums statistic method Robert Tibshirani, Trevor Hastie

Background • What’s outlier? • Common methods to detect outliers • Outliers in cancer gene study • T-statistic in outlier study • COPA (Cancer Outlier Profile Analysis)

What’s Outlier? • Definition: An outlier is an unusual value in a dataset; one that does not fit the typical pattern of data • Sources of outliers • Recording of measurement errors • Natural variation of the data (valid data)

Outlier analysis • Issues: • If outlier is a true error and not dealt with, results can be severely biased. • If outlier is valid data and is removed, valuable information regarding important patterns in the data is lost • Objective: Identify outliers, then decide how to deal with them.

Outlier detection • Visual inspection of data – not applicable for large complex datasets • Automated methods • Normal distribution-based method • Median Absolute Deviation • Distance Based Method • …

Normal distribution-based method • Works on one variable at a time: Xk , k=1,…,p • Assume normal distribution for each variable • Algorithm: The i th observation’s data for variable Xk (i=1,…,n): xik Sample mean for variale xk : Sample standard deviation for xk : Calculation zik for each i=1,…,n: Lable xik an outlier if |xik|>3, about 0.25% will be labeled if normality assumption correct

Normal distribution-based method • Very dependent on assumption of normality • , themselves are not robust to outliers: • Many positive outliers • Many outliers • values are small if there are real outliers in the data, so fewer outliers will be detected • Only numeric-valued variables (same for other methods)

Robust Normal Method • Deals with robustness problem • Same as normal distribution method, but • Use trimmed mean or median instead of • Use trimmed standard deviation instead of • Calculate and still use |xik|>3 labeling rule ( R superscript represents robust versions of the mean and standard deviation)

Median Absolute Deviation (MAD) • Another method for dealing with robustness problem • Use median as robust estimate of mean • Use MAD as robust estimate o standard deviation Calculate Dik (i=1,…,n): Calculate MAD: Calculate modified zik value: Lbel xik as outlier if |xik|>3.5 Note: 1.4826 used because

Distance Based Method • Non-parametric (no assumption of normality) • Multidimensional – detects outliers across all attributes at once (instead of one attribute at a time) • Algorithm: • Calculate distance between all pairs of observations: the Euclidean distance from observation i to j • label observation i an outlier if less than r% of the total are within d distance of i (r and d are parameters)

Presenter: Yanlin Wu Advisor: Professor Geman Date: 10/17/2006

Presenter: Yanlin Wu Advisor: Professor Geman Date: 10/17/2006

Presentation Transcript

Presenter Name Date

Presenter Name Date

Presenter : Cheng_Ta Wu

Presenter : Cheng-Ta Wu

Presenter Date Place

Presenter: Chun-Ping Wu

Presenter : Cheng-Ta Wu

Presenter : Cheng-Ta Wu

Presenter : Cheng-Ta Wu

Presenter : Cheng-Ta Wu

Presenter : Cheng_Ta Wu

Department / Presenter / Date

Presenter Name Date

Presenter: Hao -Ling Huang Advisor: Ming- Puu Chen Date: 2009/10/28

Student : Sih -Han Chen Advisor : Ho-Ting Wu Date : 2008.5.6

Presenter: Date:

Presenter : Cheng-Ta Wu

Professor: Wan-Shiou Yang Presenter: He-Min Chu Date: 2005/10/28

Presenter(s) Date

Student : Sih-Han Chen Advisor : Ho-Ting Wu Date : 2008.5.6

Presenter: Huang, yi-ru Advisor: Ni, chuen-fa Date: 2008/10/09