1 / 20

Final Exam (not cumulative) Next Tuesday Dec. 12, 7-8:15 PM 1105 SC (This Room)

Final Exam (not cumulative) Next Tuesday Dec. 12, 7-8:15 PM 1105 SC (This Room). Statistical Learning Parameterized Models Generative Models Discriminative Models Bayes - Rule - Networks Naïve - Likelihood function Estimation Maximum Likelihood Maximum A Posteriori

Download Presentation

Final Exam (not cumulative) Next Tuesday Dec. 12, 7-8:15 PM 1105 SC (This Room)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Final Exam (not cumulative) • Next Tuesday • Dec. 12, 7-8:15 PM • 1105 SC (This Room) CS446-Fall ’06

  2. Statistical Learning Parameterized Models Generative Models Discriminative Models Bayes - Rule - Networks Naïve - Likelihood function Estimation Maximum Likelihood Maximum A Posteriori Conjugate Priors K-means Clustering Expectation Maximization Jordan's talk Logistic Regression Information Theory Conditional Information Mutual Information KL Divergence Ensembles Bayes Optimal Classifier Bagging Boosting Weak Learning Margin Distribution Frequentist / Bayesian Statistics ANNs Backpropagation Nonparametric classifiers Nearest Neighbor Bias / Variance k Nearest Neighbor Kernel Smoothing Dimensionality Reduction LDA, PCA, MDS, FA Local Linear Embedding Neural Network Derived Features Model Selection Fit / Regularization AIC, BIC Kolmogorov Complexity, MDL K-fold Cross Validation Leave one out cross validation Learning Curve ROC Curve Topics Since Midterm CS446-Fall ’06

  3. Dimensionality Reduction • Many approaches • Supervised • Feature selection • Linear Discriminant Analysis (LDA/Fisher; we’ve seen this already) • … • Unsupervised - what does this mean? • Principle Component Analysis (PCA) • Multidimensional Scaling • Factor Analysis • … • Also some nonlinear dimensionality reduction • Neural Net application CS446-Fall ’06

  4. Linear Discriminant Analysis (LDA) • Introduce a “new” feature • Linear combination of old features • The new feature should maximize distance between classes • And simultaneously minimize variance within classes: • Maximize | mP – mN|2 / (S2p + s2N) • Project points into subspace and repeat CS446-Fall ’06

  5. Unsupervised – do not use class label • Principle component analysis (PCA) • Like LDA but maximize variance in X • Factor Analysis • Observed raw features are imagined to be derived • Introduce K latent features that “cause” the observed features • Estimate them • Multidimensional Scaling • Suppose we know distances between examples • Find a map in K dimensions that supports • Placement of examples • Computes desired distances CS446-Fall ’06

  6. Y 1 2 X Principle Component Analysis (PCA) Find direction of maximum variance Project Repeat This is an eigenvalue problem; we are looking for eigenvectors CS446-Fall ’06

  7. Principle Component Analysis • Eigenvector of the largest eigenvalue is the direction of greatest variance • Second largest is greatest remaining variance etc. • They are orthogonal and form the new (linear) features • Use eigenvectors of the largest k eigenvaluesor down to some relative size of eigenvalues • Previous LDA example: • First LDA component • First PCA component • Is this good? CS446-Fall ’06

  8. Multidimensional Scaling (MDS) • Given distances between examples • Position examples in a lower dimensional metric space faithful to these distances • Flying times between pairs Works particularly well here. Why? CS446-Fall ’06

  9. Factor Analysis (FA) • Assume the observed features are really manifestations of some number k of more primitive factors • These factors are unobservable • Assume they are uncorrelated linear combinations of the original features • In matrix form: X =  + LF +  • WhereX is an exampleis the mean of each featureL is a matrix of factor loadsF are the factors is a noise term • Similar to PCA except for which can aid in post hocinterpretation of the factors Features Factors Class CS446-Fall ’06

  10. Non-linear Dimensionality Reduction • Linear is limited: many classifiers easily handle anyway • Potential big gains in nonlinear transformationsBut a rich space with huge potential for overfitting • Kernel PCA – PCA with kernel functions • Local Linear Embedding (LLE) – currently very popular • Note that distance is measured along the lower dimensional manifold, assumptions: smoothness, density CS446-Fall ’06

  11. Using Neural Networks • Use hidden layers to learn lower dimensional features • Couple the example features to both the input and output • Learns to reproduce the input features • Hidden layer is floating Features Hidden Layer Features CS446-Fall ’06

  12. Using Neural Networks • Use hidden layers to learn lower dimensional features • Couple the example features to both the input and output • Learns to reproduce the input features • Hidden layer is floating • Limit the number of hidden units Features Hidden Layer Features CS446-Fall ’06

  13. Using Neural Networks • Use hidden layers to learn lower dimensional features • Couple the example features to both the input and output • Learns to reproduce the input features • Hidden layer is floating • Limit the number of hidden units Features Hidden Layer Features Hidden Nodes Learn the Best Nonlinear Transformationsthat Reproduce the Input Features CS446-Fall ’06

  14. Model Selection • Comparing error rate alone on models of different complexity is not fair • Consider the Likelihood Function • It will tend to prefer a more complex model • Why? • Overfitting • Need regularization • Penalty to compensate for complexity • Richer model families are more likely to find a good fit by accident CS446-Fall ’06

  15. Information Criteria • L is the likelihood; k is number of parameters; N is number of examples Prefer the higher scoring models • Akaike Information Criterion (AIC) AIC = ln(L) – k fit complexity penalty • Bayesian Information Criterion (BIC)BIC = ln(L) – k ln(N)/4 • Minimum Description Length (MDL) is the same as BIC • Kolmogorov complexity • Learning = Data Compression • Compression bounds; bound test accuracy from training alone • Luckiness Framework; PAC-Bayes CS446-Fall ’06

  16. Cross Validation • Good For • Setting Parameters • Choosing Models • Evaluating a Learner • Data Resampling Technique • Different partition sets of the training data are somewhat independent • Overlap introduces some bias, this can be estimated if necessary • In statistics: bootstrap, jackknife CS446-Fall ’06

  17. Computing a learning curve • Classifier performance as a function of training • Desire confidences • Perhaps Error Bars: • Usually 95% standard error of the mean • Need multiple runs but have limited data • Each point is generated by cross validation CS446-Fall ’06

  18. Cross Validation • k-fold cross validation • Partition the data D into k equal disjoint sets: d1, d2… dk • For i=1 to k Train on D-di Test on di • Generates a population of results Can compute average performance Confidence measures • Most popular / most standard is k = 10 • When k = |D| it is called “Leave one out cross validation” CS446-Fall ’06

  19. Cross Validation • Every example gets used as a test example exactly once • Every example gets used as a training example k-1 times. • Test sets are independent but training sets overlap significantly. • The hypotheses are generated using (k-1)/k of the training data. • With resampling, “paired” statistical design can be used to compare two or more learners • Paired tests are statistically stronger since outcome variations due to the test set are identical in each fold CS446-Fall ’06

  20. ROC Curve • Often a classifier can be adjusted to have more false positives or more false negatives • This can be used to hide weaknesses of the classifier • Receiver Operating Characteristic Curve • Prob. of True Positivevs. Prob. of False Positiveas sensitivity is increased • The – (left peak) and + (right peak) populations overlap • The classification boundary is the vertical line • The relevant areas are labeled: TP: true positives = Red + Purple; FP: false positives = Pink + Purple; TN: true negatives = Dark Blue + Light Blue; FN: false negatives = Light Blue CS446-Fall ’06

More Related