1 / 17

Data Mining

Data Mining. Ed Rothman Joe Heidi Kathy CSCAR . Agenda. Purpose. G iven a (training) set of n independent data on a set of ‘p’ potential predictors x 1 , x 2 , x 3 ,….. x p and a response y, our purpose is to predict a future value of the response.

chiko
Download Presentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Ed Rothman Joe Heidi Kathy CSCAR

  2. Agenda

  3. Purpose • Given a (training) set of n independent data on a set of ‘p’ potential predictors x1, x2 , x3 ,…..xp and a response y, our purpose is to predict a future value of the response. • Though there are p distinct variables called predictors, these may well represent potentially many more. For example-- • x21

  4. Basic Measures in Regression • R2 The squared correlation reflects the fraction of the total variation in y accounted for by the linear regression model • The choice of the regression coefficients is such that the value of R2 is maximized for the data set. It is therefore a biased reflection of performance. • The formula for R2 is 1- (sum of squared residuals/sum of squares total).

  5. Predictive Error Sum of Squares • Sum of squared residuals measures the deviations between each of the responses—the y’s—in the sample--and the predicted values. • Press—the predicted error sum of squares measures the deviations between the y’s—in the sample—and the prediction based on all the data except this one response.

  6. Approximate Press • The predictive error sum of squares is approximately the sum of squared errors divided by (1-p/n)2 • The effect of using too many predictors---that do not contribute to the fit in a substantial way is evident. When p is large compared with n, we must have an excellent fit or PRESS can be very large.

  7. Assessment • Training • Validation • Test Data Set • The measure of quality can be more general than measures of fit. We may for example look to a function of both types of errors.

  8. OTHER ISSUES • In multiple regression, the impact of a potential predictor variable on a response is based on the standard error of the estimate. The square of this number—the estimated variance---is a function of four factors. • Sigma(Y at fixed values of the predictor)2/n * • 1/sigma2(predictor) * 1/(1- R2 B)

  9. Sigma(Y at fixed values of the predictor)2 • This is the variance of the response at a fixed value of the predictors. We assume that this number is the same whatever the value of the predictors. However, when you leave an important predictor out this number can be quite large: • The variance of the aggregate is the average of the variances plus the variance between the averages.

  10. Sigma(Y at fixed values of the predictor)2 • Consider leaving the variable “basket number” out of the model • Basket 1: response is 1 or 3 • Basket 2: response is 7 or 9 • The variance within each basket is 1 but the variance of the aggregate is 10!

  11. N: sample size • The larger the sample size the smaller the variance. • Standard errors decrease in proportion to 1 over the square root of the sample size

  12. sigma2(predictor) • This number is the variance in the predictor. A predictor without variance cannot be studied. • In selecting cases—if you are able—to study, choose predictors with substantial variation.

  13. 1/(1- R2 B) • B means the regression of the predictor studied on all the other predictors. • When we have a strong linear relationship between predictors it will create a problem • This is called multicolinearity • Some potential solutions include ridge regression, and principal components

  14. OTHER MODEL BASED ISSUES • The response variable may be binary • Use binary logistic regression • The response variable may be ordinal • Use ordinal logistic analysis • The response variable may have a discrete distribution • GLM, Poisson Regression, etc.

  15. Predictor Effects • Interactions between predictors means that the effect of a change in both is different from the sum of the individual effects • Create new predictors equal to the product of the predictors • The predictors may not impact the response in a smooth fashion • Regression trees

  16. Interactions (local) • In multiple regression models we sometimes model the non-additive effects by including a single function (e.g the product of the two variables), with regression trees we allow interactions that are more specific.

  17. Software Choices • Commercial Software -Statsoft • R and Splus • JMP • Enterprise Minor (SAS)

More Related