1 / 33

Lecture

Lecture. Data Mining in R. Logistic regression: two classes. Consider Logistic model with one predictor X=Price of the car Y=Equipment Logistic model Use function glm ( formula , family , data) Formula: Response~Model

forbes
Download Presentation

Lecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture Data Mining in R 732A44 Programming in R

  2. Logistic regression: twoclasses • Consider Logistic modelwithonepredictor X=Price of the car Y=Equipment • Logistic model • Usefunctionglm(formula, family, data) • Formula: Response~Model • Modelconsistsofa+b (addition), a:b (interaction terms, a*b (addition and interaction) . All predictors • Family: specifybinomial 732A44 Programming in R

  3. Logistic regression: twoclasses reg<-glm(X3...Equipment~Price.in.SEK., family=binomial, data=mydata); 732A44 Programming in R

  4. Logistic regression: severalpredictors Data about contraceptive use • Several analysis plots can be obtained by plot(lrfit) • Response: matrix success/failure 732A44 Programming in R

  5. Logistic regression Furthercomments • Nominal logistic regressions (librarymlogit, functionmlogit) • Stepwisemodelselection: step() function. • Prediction: predict() function 732A44 Programming in R

  6. Smoothing splines Minimize a penalized sum of squared residuals where λ is smoothing parameter. λ=0 : any function interpolating data λ=+ : least squares line fit 732A44 Programming in R

  7. Smoothingsplines • smooth.spline(x, y, df, spar, cv,…) • Dfdegreesoffreedom • Spar: penalty parameter • CV= • TRUE=GCV • FALSE=CV • NA= no CV plot(m2$Kilometer,m2$Price, main="df=40"); res<-smooth.spline( m2$Kilometer, m2$Price,df=40); lines(res, col="blue"); 732A44 Programming in R

  8. Generalized additive models A function of the expected response is additive in the set of inputs, i.e., Example: Nonlinear logistic regression of a binary response 732A44 Programming in R

  9. GAM • gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp) • Method: method for selectionofsmoothing parameters • Select: TRUE – variableselection is performed • Sp: smoothing parameters (maximal df) • Formula: usual terms and spline terms s(…) Library: mgcv • Car properties • Predict.gam() can be used for predictions bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3) vis.gam(bp, theta=10, phi=30); 732A44 Programming in R

  10. GAM Smoothingcomponents plot(bp, pages=1) 732A44 Programming in R

  11. Principal components analysis Idea: Introduce a new coordinate system (PC1, PC2, …) where • The first principal component (PC1) is the direction that maximizes the variance of the projected data • The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed • … In the new coordinate system, coefficients corresponding to the last principal components are very small  can take away this columns PC1 PC2 732A44 Programming in R

  12. Principal components analysis • princomp(x, ...) m4<-m3; m4$MODEL<-c(); res<-princomp(m4); loadings(res); plot(res); biplot(res); summary(res); 732A44 Programming in R

  13. Decision trees 20 X1 <9 >=9 X2 X2 <16 >=16 <7 >=7 0 1 1 X1 10 <15 >=15 1 0 0 10 20 732A44 Programming in R

  14. Regression tree example 732A44 Programming in R

  15. Training-validation-test • Training-validation (60/40) • If training-validation-test is required, usesimilarstrategy sub <- sample(nrow(m2), floor(nrow(m2) * 0.6)) training <- m2[sub, ] validation <- m2[-sub, ] 732A44 Programming in R

  16. Decision trees by CART Growing a full tree Library ”tree”. • Createtree: tree(formula, data, subset, split = c("deviance", "gini"),…) • Subset: ifsubsetofcasesneedsto be used for training • Split: splitting criterion • More parameters withcontrol parameter • Prunetreewithhelpofvalidation set: prune.tree(tree, newdata, method= c("deviance", "misclass”),…) • Prunetreewith cross-validation: cv.tree(object, FUN = prune.tree, K = 10, ...) • K is number of folds in cross-validation 732A44 Programming in R

  17. Classificationtrees: CART Example: Olive oils in Italy sub <- sample(nrow(m5), floor(nrow(m5) * 0.6)) training <- m5[sub, ] validation <- m5[-sub, ] mytree<-tree(Area~.-Region-X,data=training); summary(mytree) plot(mytree,type="uniform"); text(mytree,cex=0.5); 732A44 Programming in R

  18. Classificationtrees: CART • Dependenceof the misclassification rate on the lengthof the tree: treeseq1<-prune.tree(mytree, newdata=validation,method="misclass") plot(treeseq1); title("Validation"); treeseq2<-cv.tree(mytree, method="misclass") plot(treeseq2); title("CV"); 732A44 Programming in R

  19. Regression trees: CART mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training); mytree3<-prune.tree(mytree2, best=4) #totally 4 leaves print(mytree3) summary(mytree3) plot.tree(mytree3) text(mytree3) 732A44 Programming in R

  20. Decision trees: othertechniques • Conditionalinferencetrees Library: party • CART, anotherlibrary ”rpart” training$X<-c(); training$Area<-c(); mytree4<-ctree(Region~.,data=training); print(mytree4) plot(mytree4, type= "simple");# gives nice plots 732A44 Programming in R

  21. Neural network • Input nodes, input layer • [Hidden nodes, Hidden layer(s)] • Output nodes, output layer • Weights • Activation functions • Combination functions … f1 fK z1 z2 … zM … x1 x2 xp 732A44 Programming in R

  22. Neural networks • Feed –forward NNs Library: neuralnet • neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm= "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…) • Hidden: vectorshowingamountofhidden neurons at eachlayer • Rep: amountofrunsofnetwork • Startweights: starting weights • Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr” • Err.fct: anyfunction +”sse”+”ce” (cross-entropy) • Act.fct:anyfunction+”logistic”+”tanh” • Linear.output: TRUE, if no activation at the output • confidence.interval(x, alpha = 0.05) Confidence intervals for weights • compute(x, covariate) Prediction • plot(x,…) plot given neural network 732A44 Programming in R

  23. Neural networks • Example mynet<-neuralnet( Region~eicosenoic+linoleic+linolenic+palmitic, data=training, rep=5, hidden=c(2,2),act.fct="tanh") plot(mynet); mynet$result.matrix 732A44 Programming in R

  24. Neural networks • Predictionwithcompute() • Findingmisclassification rate: table(true_values,predictedvalues) – not only for neural networks • Another package, ready for qualitativeresponse (classicalnnet): mynet1<-nnet( Region~eicosenoic+linoleic, data=training, size=3); coef(mynet1) predict(mynet1, data=validation); 732A44 Programming in R

  25. Clustering • Purpose is toidentifygroupsof observations intointput space (separated) • K-means • Hierarchical • Density-based 732A44 Programming in R

  26. K-means • Amountofseeds K should be given • Starting seed positions needed • kmeans(x, centers, iter.max = 10, nstart = 1) • X: data frame • Centers: either ”K” valueor set of initial cluster centers • Iter.max: maximum numberof iterations res<-kmeans(data.frame (m5$linoleic, m5$eicosenoic),2); 732A44 Programming in R

  27. K-means • Onewaytovisualize plot(m5$linoleic, m5$eicosenoic, col=res$cluster); points(res$centers[,1], res$centers[,2], col = 1:2, pch = 8, cex=2) 732A44 Programming in R

  28. Hierarchicalclustering • Agglomerative • Place eachpointinto a single cluster • Mergenearest clusters untilyou get 1 cluster • Meaningof ”twoobjectsareclose”? • Measureofproximity (ex: quantiative vars, Euclidiandistance) • Similaritymeasuresrs(=1 if same object, <1 otherwise) • Ex: correlation • Dissimilaritymeasureδrs(=0 if same object, >0 otherwise) • Ex: euclidiandistance 732A44 Programming in R

  29. Hierarchicalclustering • hclust(d, method = "complete", members=NULL) • D: dissimilaritymeasure • Method: ”ward”,"single",  "complete",  "average", "mcquitty", "median" or "centroid". Returned: a treeshowingmergingsequence • cutree(tree, k = NULL, h = NULL) • K: number of clusters to make • H: at which level to cut Returned: cluster index 732A44 Programming in R

  30. Hierarchicalclustering • Example x<-data.frame(m5$linolenic, m5$eicosenoic); m5_dist<-dist(x); m5_dend<-hclust(m5_dist, method="complete") plot(m5_dend); 732A44 Programming in R

  31. Hierarchicalclustering • Example  DO NOT forgettostandardize! clust=cutree(m5_dend, k=2); plot(m5$linoleic, m5$eicosenoic, col=clust); 732A44 Programming in R

  32. Density-basedclustering • Kernel-baseddensityestimation. Library: pdfcluster • pdfCluster(x, h = h.norm(x), hmult = 0.75,…) • X: Data to be partitioned • h: a vectorofsmoothing parameters • Hmult: shrinkagefactor x<-data.frame(m5$linolenic, m5$eicosenoic); res<-pdfCluster(x); plot(res) 732A44 Programming in R

  33. Reference http://cran.r-project.org/doc/contrib/YanchangZhao-refcard-data-mining.pdf 732A44 Programming in R

More Related