1 / 25

Prediction with Regression

Prediction with Regression. An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi. Outline. Prediction Estimation Bias Variance Trade-Off Regression Ordinary Least square Ridge regression Lasso. Prediction definition. set of inputs: X 1 , X 2 , …, X p

marcus
Download Presentation

Prediction with Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction with Regression An Introduction to Linear Regression and Shrinkage Methods EhsanKhoddamMohammadi

  2. Outline • Prediction • Estimation • Bias Variance Trade-Off • Regression • Ordinary Least square • Ridge regression • Lasso

  3. Predictiondefinition • set of inputs: X1, X2, …, Xp • the output: Y • We want to analyze the relationship between these variables (interpretation) • We want to estimate output based on inputs (prediction)

  4. Predictionsame concept in different literatures • Machine learning: supervised learning • Finance: forecasting • Politics: prediction • Estimation theory: function approximation

  5. RegressionWhy? • Well-performed and accurate in both Interpretation and Prediction • Strong fundamental in math, statistics and computation • Many modern and advanced methods are based on Regression, even they are variant of regression • New methods are still invented for regression: • Nobel prize are still given to investigations in regression, Hot topic • Could be formulated as optimization problem: • that’s the reason I choose it for this class, it’s more related to subject of class than any other methods I’ve known for prediction

  6. Regressionclassification • Linear Regression • Least square • Best sub-sut Selection, Regression with feature selection • Stepwise Regression • Shrinkage regularization for Regression: • Ridge Regression • Lasso Regression • Non-Linear Regression • Numerical Data fitting • ANN • Discrete regression • Logistic Regression

  7. Before proceeding with regression Let’s investigate on some statistical property of ESTIMATION

  8. Estimating the parameter • assume that we have iid (identically independent distributed) samples X1, . . . ,Xn with unknown distribution. • Estimating p.d.f of them is too hard in many situations, Instead of that, We want to estimate a parameter θ . • is estimation of θ, it is function of X1, . . . ,Xn .

  9. Bias-Variance dilemma • Definition 1 : The bias of an estimator is . If it is 0, the estimator is said to be unbiased. • Definition 2 : The mean squared error (MSE) of an estimator is . • An interesting equation: What does it really mean?

  10. [Image from “More on Regularization and (Generalized) Ridge Operators”, Takane,(2007)]

  11. Test and training error as a function of model complexity. [ Image from “The Elements of Statistical Learning”,Second Edition, Hastie et al. (2008)]

  12. Linear RegressionModel • Set of training data : • Linear Regression model: • Real-valued coefficients β need to be estimated

  13. Linear RegressionLeast square • Most popular estimation method • Minimize the Residual Sum of Squares: How do we minimize it?

  14. Linear RegressionLeast square • Let’s rewrite last formula in this form: • Quadratic function (not a point here but we shall use this property later) • Differentiating respect to β and set it to zero: • Unique Solution: ; Under which assumptions we could obtain unique solution?

  15. Linear RegressionLeast square, Assumptions • X should be full-rank, hence is p.d and invertible, unique solution could be obtained • In another word, features vectors should be linearly independent or uncorrelated • What will be happened to β if X would be non-full-rank matrix or some features would be highly correlated?

  16. Linear RegressionLeast square, flaws • Low bias but High variance: and one could estimate Var(y) by: • It’s hard to find meaning-full relation if we have too many features. What would you recommend to solve these problems?

  17. Linear RegressionImprovements • Model Selection (Feature Selection): • Best-Subset Selection (Branch and Leap , Furnival (1974)) • Step-wise Selection (Greedy approach, sub-optimal but preferred) • mRMR (using mutual information criterion for selection) • Shrinkage Methods: impose constraint on β • Ridge Regression • Lasso Regression

  18. Ridge Regression • When you have a problem want to be solved in statistics, There is always a Russian statistician waiting for you to solve it. (Be careful! just in statistics I guarantee , they will betray you in any other situations) • Andrey Nikolayevich Tychonoffprovides a Tikhonov (!!!) regularization for ill-posed problems , Also known as Ridge Regression in statistics.

  19. Ridge Regressionfirst attempt • Remember this?: Tychonoff added a term to avoid singularity and changed above formula to this: Now, the inverse could be computed even if Is not of full-rank, Also β is still linear function of y. Every thing start from above formula but now we have better point of view than Tychonoff, let’s take a look!

  20. Ridge Regressionbetter motivation • To avoid high variance of β we just impose a constraint on it, our problem is now an optimization problem with constraints.

  21. Even better representation: using lagrangian form Or again even better! in matrix representation form, we could differentiate this formula and set it to zero Could you guess the solution? Could you find a relation between β and βridge when inputs are orthonormal?

  22. LASSO Least Absolute Selection and Shrinkage Operator

  23. LASSO • We impose L1-norm constraint on our regression • No close form exists, it’s non-linear function of y How could you solve above problem? (hint: ask Mr.Iranmehr!)

  24. LASSOWhy? • First attempt for usage of L1-norm, show significant results in signal processing, denoising [Chen et al. (1998)] • Base method for LAR (new and novel method for regression, not covered here) [Efron et al. (2004)] • Good for Sparse model selection where p>N [Donoho (2006b)]

  25. REFERENCES • “The Elements of Statistical Learning”, Second Edition, Hastie et al. , 2008 • “More on Regularization and (Generalized) Ridge Operators”, Takane, 2007 • “Bias, Variance and MSE of Estimators”, Guy Lebanon, 2004 • “Least Squares Optimization with L1-Norm Regularization”, Mark Schmidt, 2005 • “Regularization: Ridge Regression and the LASSO”, Tibshirani, 2006

More Related