Explaining Logistic Regression and Model Selection in Financial Time Series Prediction

Financial Time Series I/Methods of Statistical Prediction Suggested Answers to Project 2 Project 2: Logistic Regression and Model Selection 1/19/2003

Question 1 • Give a brief explanation on the meaning of those commands. • data(birthwt) #Load the data set “birthwt” from the boot package. #We can use an alternative command by loading the whole boot package into the working space with command: require(boot). • attach(birthwt) #The data set “birthwt” is attached to the working space so that the variable in the data set can be accessed directly by its name. Otherwise, we have to use birthwt$varname instead of varname. • race< factor(race,labels =c("white","black", "other")) #The function “factor” is used to encode a vector as a factor “race” is originally being treated as a numerical factor. This command translates into a categorical factor. • table(ftv) #table uses the cross-classified factors to build a contingency table of the counts at each combination of factor levels. The results of this command is as follows, 0 1 2 3 4 6 100 47 30 7 4 1

Question 1 • ftv<-factor( ftv) #The function “factor” is used to encode a vector as a factor • levels(ftv)[-(1:2)]< "2+" # This command transfers the levels of “ftv” into three: “0”, “1”, and “2+.” #2.Convert ptl to two levels and name the new variable as ptd. • ptd<-factor(ptl>0) #The function “factor” is used to encode a vector as a factor. Here, ptd represents a new factor with two levels only. • 3. Create a new data frame bwt. bwt<-data-frame(1ow=factor(1ow), age, lwt, race, smoke =(smoke >0), ptd, ht = (ht >0), ui=(ui>0), ftv) # Create a new data frame • 4. Clean up data. • detach(“birthwt”) #Remove the data set from the search path of available R objects. This command can be used to remove either a data-frame which has been attached or a package which was loaded previously. • rm(race, ptd ,ftv) # “remove” and “rm” can be used to remove objects. Here, the variables “race”, “ptd”, and “ftv”, which have been created in this workspace, are removed and no longer exist.

Question 2 • Give a brief explanation on the specification of the regression model. • birthwtglmGglm(1ow~., family=binomial ,data =bwt) • # “glm” is used to fit generalized linear models (GLM) to the data. • In GLM, two parameters need to be specified. • The first parameter is the error distribution of dependent variable. Here, the chosen distribution is binomial distribution. It is being specified in terms of “family=binomial”. • The second parameter is the unknown parameters in the specification of error distribution. For the binomial distribution, it is the probability of success p. p will be used to associate the independent variables to dependent variable. • There is the so-called link function to associate them. The default of link function with binomial distribution is logit. It is as follows: • F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp(β0 +β1X1+...+βKXK) • The fitted values of this model are the probabilities of “low=1” given all the other variables in the data frame “bwf”. Here, we only consider additive model. It means that all variables enter into the model linearly. (There is no interaction term in the model.)

Question 2 • birthwtglm<- glm(1ow~., family=binomial, data =bwt) • # “glm” is used to fit generalized linear models (GLM) to the data. • In GLM, two parameters need to be specified. • The first parameter is the error distribution of dependent variable. Here, the chosen distribution is binomial distribution. It is being specified in terms of “family=binomial”. • The second parameter is the unknown parameters in the specification of error distribution. For the binomial distribution, it is the probability of success p. p will be used to associate the independent variables to dependent variable. • There is the so-called link function to associate them. The default of link function with binomial distribution is logit. It is as follows: F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp(β0 +β1X1+...+βKXK) • The fitted values of this model are the probabilities of “low=1” given all the other variables in the data frame “bwf”. Here, we only consider additive model. It means that all variables enter into the model linearly. (There is no interaction term in the model.)

Question 2 • summary(birthwt.glm,correlation=F) • It gives the deviance residuals and coefficients. • The command “summary” is a generic function used to produce result summaries of the results of various model fitting functions. • The correlations of coefficients are not shown because the parameter “correlation” is set to be false. • The AIC value is 217.48 in this additive model containing all the variables. In addition, the prediction error rate is 0.2698413 • We can compare the AIC values and the prediction error rates with this one in the following analyses to show the effect of including or excluding some particular variables. • From the probability prob(>|z|) of each variable, we found that the variables “ptdTRUE”, “htTRUE”, “lwt”, and “raceblack” will be significant for the prediction if the significant level α is set at 0.05.

Question 2: Model Selection • Consider a model with the above four “important” variables. glm(1ow~lwt + race + ptd + ht, family=binomial, data = bwt) • The AIC value will become 217.40, which is slightly smaller than one using all of the variables. • In addition, the prediction of the error rate is 0.2698413. • If we only drop the two most insignificant variables “ftv” and “age,” this leads to an alternative model: glm(1ow~lwt + race + ptd + ht + smoke + ui, family=binomial, data = bwt). • The AIC value will become 213.8516,which seems smaller than the two models described above • The prediction error rate is 0.2433862.

Question 2: Model Selection with AIC • Use AIC as the objective of model selection. • birthwtstep< step(birthwtglm, trace=F) • The command “step” selects a formula-based model by AIC. Its basic idea is to remove the variables one by one in order to find better models with smaller AIC values. • The parameter “trace” is turn off so that the above deletion process is not shown. • We can build model with backward elimination or forward selection. • birthwtstep< step(birthwtglm,trace=F, direction= c(“forward”)) • birthwtstep< step(birthwtglm,trace=F, direction= c(“backward”)) • birthwt.step$anova • This command is useful in showing the process of deleting variables. • For this data set, it removes “ftv” and “age” these two variables and final AIC value is reduced to 213.8516. The selection procedure stops because no more reduction of the AIC value can be achieved by removing any other variables. • It is interesting that the model selected by the AIC criterion is consistent with that one we use a different criterion.

Question 3 • Repeat the steps in Question 2 and consider all models include pairwise interactions. • In order to ensure that the co-linearity is not present in the model, we usually start on checking the correlation of coefficients among all independent variables. • The difficult question is how to address the correlation with categorical variables. Refer to your note on association. • Model building strategy: • Strategy 1: backward elimination • Add all pairwise interactions and then remove them one by one. • Implementation of strategy 1 • Start form an additive model with all independent variables. Birthwt.glm <- glm(1ow~^2, family=binomial, data=bwt, maxit=20) • Due to convergence problem, we can increase the upper bound of the number of iterations, which I done by using “maxit”. • Suggestion: Start form the best additive linear model derived in Question 2 (Exclude the two variables: ftv and age.)

Question 3: backward elimination • birthwt.glm <-glm(low~(-ftv-age)^2, family=binomial, data=bwt, maxit=20) • Consider an additive model without ftv and age. • This leads to the following model low ~age + lwt + race + smoke + ptd + ht + ui + fw + age:ht + age:ftv + lwt:smoke + lwt:ht + lwt:ui + race:ht + smoke:ht + ptd:ht + ht:ui + ht:ftv • model selection with AIC • birthwt.step.pwall <- stepAIC(birthwt.glm.pwall, trace=F) • birthwtstep.pwall$anova • This leads to the following model with AIC=210.8205 low ~age + 1wt + race + smoke + ptd + ht + ui + fw+ lwt:smoke + lwt:ht + lwt:ui + ht:ui • Suggestion: Can we just compare all possible two pairwise interaction terms? • The best one is with the interaction terms age:ftv and ht:ui. • Its AIC is 209.0006. • Although the variable “age” and“ftv” are not “important” when we only consider an additive linear model, they are included when the interaction tem is also taken into consideration.

Question 3: Model Searching Strategy • birthwt.step.both <- stepAIC(birthwt.glm, scope = list(upper = ~ .^2, lower = ~ 1), trace=F) • The direction of stepwise search can be one of “both”, “backward”, or “forward”, with a default of “both”. • If the “scope“ argument is missing, the default for “direction” is “backward”. • Therefore, we do not only remove predictors from the model but also add predictors to reduce the AIC value. • For this data set, start with original model with no interaction term. • The interaction terms “age:ftv” and “smoke:ui” are added sequentially. • Finally, the “race” tem is removed. • The process stops at the model low ~age + lwt + smoke + ptd + ht + ui + ftv + age:ftv + smoke:ui with the lowest AIC value 207.0734.

Question 5 • Repeat the above procedure with the two model being chosen with cross-validation to give prediction error again. • How do we divide the data randomly into several groups (fold=5 for example)? • Use the cv.glm procedure in the “boot” package. • Some students write their own version of cross-validation. • cv.glm • Cross-validation for Generalized Linear Models • This function calculates the estimated K-fold cross-validation prediction error for generalized linear models. • cv.glm(data, glmfit, cost, K) • Data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response. • glmfit: An object of class "glm" containing the results of a generalized linear model fitted to data. • cost: A function of two vector arguments specifying the cost function for the cross-validation. The first argument to cost should correspond to the observed responses and the second argument should correspond to the predicted or fitted responses from the generalized linear model. The default is the average squared error function. • K: The number of groups into which the data should be split to estimate the cross-validation prediction error.

Cross-validation Algorithm bwt.shuffle<- bwt; shuf:iter <- 1000; datasize <- length(bwt$low) for(i in 1:shuf.iter){ n1<- round(runif(n=1, min = 1, max=datasize)) n2<- round(runif(n=1, min = 1, max= datasize)) temp <- bwt.shuffle[n1,] bwt.shuffle[n1,] <- bwt.shuffIe[n2,] bwt.shume[n2,] <- temp } fold<- k; testsize <- round(datasize/fold); rate.fold <-rep(0,fold) for(i in 1:fold){ test.start <- (i-1)*testsize; test.end <- test.start + testsize if (test.end>datasize) test.end=datasize bwt.test<- data.frame(bwt.shuffIe[(test.start+1):test.end,]) bwt.train<- data.frame(bwt.shuffle[-((test.start+1):test.end),]) train.glm<- glm(1ow~.,family=binomial, data=bwt.train) pred<-predict(train.glm, subset(bwt.test, select =c(age,lwt, race, smoke, ptd, ht,ui, ftv)), type ="response") rate-fold[i]< sum(round(pred)==bwt.test$low)/length(pred) } cvrate<- mean(rate.fold); cat("prediction error rate of cross validation:",1-cvrate, “\n")

Explaining Logistic Regression and Model Selection in Financial Time Series Prediction