Lab exercises regression knn and k means
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

Lab exercises: regression, kNN and K-means PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Lab exercises: regression, kNN and K-means. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4 b , February 14, 2014. Today. Linear regression K Nearest Neighbors K Means. The Dataset(s). http://escience.rpi.edu/data/DA Some new ones; nyt / and sales/ and the fb100/ (.mat files)

Download Presentation

Lab exercises: regression, kNN and K-means

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lab exercises regression knn and k means

Lab exercises: regression, kNN and K-means

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 4b, February 14, 2014


Today

Today

  • Linear regression

  • K Nearest Neighbors

  • K Means


The dataset s

The Dataset(s)

  • http://escience.rpi.edu/data/DA

  • Some new ones; nyt/ and sales/ and the fb100/ (.mat files)

  • 3 script (fragments, i.e. they will not run as-is, I think) to help with code for today: Lab4b_{1,2,3}.R


Linear and least squares

Linear and least-squares

> multivariate <- read.csv(”EPI_data.csv")

> attach(EPI_data);

> boxplot(ENVHEALTH,DALY,AIR_H,WATER_H)

> lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H)

> lmENVH

… (what should you get?)

> summary(lmENVH)

> cENVH<-coef(lmENVH)


Predict

Predict

> DALYNEW<-c(seq(5,95,5))

> AIR_HNEW<-c(seq(5,95,5))

> WATER_HNEW<-c(seq(5,95,5))

> NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW)

> pENV<- predict(lmENV,NEW,interval=“prediction”)

> cENV<- predict(lmENV,NEW,interval=“confidence”)


Repeat for

Repeat for

AIR_E

CLIMATE


Remember a few useful cmds

Remember a few useful cmds

head(<object>)

tail(<object>)

summary(<object>)


K nearest neighbors classification

K Nearest Neighbors (classification)

> nyt1<-read.csv(“nyt1.csv")

> nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),]

> nnyt1<-dim(nyt1)[1]# shrink it down!

> sampling.rate=0.9

> num.test.set.labels=nnyt1*(1.-sampling.rate)

> training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE)

> train<-subset(nyt1[training,],select=c(Age,Impressions))

> testing<-setdiff(1:nnyt1,training)

> test<-subset(nyt1[testing,],select=c(Age,Impressions))

> cg<-nyt1$Gender[training]

> true.labels<-nyt1$Gender[testing]

> classif<-knn(train,test,cg,k=5)

#

> classif

> attributes(.Last.value)

# interpretation to come!


Regression

Regression

> bronx<-read.xlsx(”sales/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE)

> plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) )

> m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx)

  • What’s wrong?


Clean up

Clean up…

> bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),]

> m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx)

#

> summary(m1)


Lab exercises regression knn and k means

Call:

lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx)

Residuals:

Min 1Q Median 3Q Max

-14.4529 0.0377 0.4160 0.6572 3.8159

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.0271 0.3088 22.75 <2e-16 ***

log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.95 on 2435 degrees of freedom

Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229

F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16


Lab exercises regression knn and k means

Plot

> plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE))

> abline(m1,col="red",lwd=2)

# then

> plot(resid(m1))


Another model 2

Another model (2)?

Add two more variables to the linear model LAND.SQUARE.FEET and NEIGHBORHOOD

Repeat but suppress the intercept (2a)


Model 3 4

Model 3/4

Model 3

Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD, BUILDING.CLASS.CATEGORY

Model 4

Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD*BUILDING.CLASS.CATEGORY


Solution model 2

Solution model 2

> m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx)

> summary(m2)

> plot(resid(m2))

#

> m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx)

> summary(m2a)

> plot(resid(m2a))


Solution model 3 and 4

Solution model 3 and 4

> m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx)

> summary(m3)

> plot(resid(m3))

#

> m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx)

> summary(m4)

> plot(resid(m4))


And now a complex example

And now… a complex example

> install.packages("geoPlot")

> install.packages(”xslx")

> require(class)

> require(gdata)

> require(geoPlot)

> require(”xslx”)

#if not already read-in:

bronx<-read.xlsx(”sales/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE)


View bronx

View(bronx)

#clean up with regular expressions

bronx$SALE.PRICE<- as.numeric(gsub("[^]:digit:]]","",bronx$SALE.PRICE))

#missing values?

sum(is.na(bronx$SALE.PRICE))

#zero sale prices

sum(bronx$SALE.PRICE==0)

#clean these numeric and date fields

bronx$GROSS.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$GROSS.SQUARE.FEET))

bronx$LAND.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$LAND.SQUARE.FEET))

bronx$SALE.DATE<- as.Date(gsub("[^]:digit:]]","",bronx$SALE.DATE))

bronx$YEAR.BUILT<- as.numeric(gsub("[^]:digit:]]","",bronx$YEAR.BUILT))

bronx$ZIP.CODE<- as.character(gsub("[^]:digit:]]","",bronx$ZIP.CODE))


More corrections

More corrections

#filter out low prices

minprice<-10000

bronx<-bronx[which(bronx$SALE.PRICE>=minprice),]

#how many left?

nval<-dim(bronx)[1]

#addresses contain apartment #'s even though there is another column for that - remove them - compresses addresses

bronx$ADDRESSONLY<- gsub("[,][[:print:]]*","",gsub("[ ]+","",trim(bronx$ADDRESS)))

#new data frame for sorting the addresses, fixing etc.

bronxadd<-unique(data.frame(bronx$ADDRESSONLY, bronx$ZIP.CODE,stringsAsFactors=FALSE))

# fix the names

names(bronxadd)<-c("ADDRESSONLY","ZIP.CODE")

bronxadd<-bronxadd[order(bronxadd$ADDRESSONLY),]


Yep more

Yep, more…

# duplicates?

duplicates<-duplicated(bronxadd$ADDRESSONLY)

##if(duplicates) dupadd<-bronxadd[bronxadd$duplicates,1]

##bronxadd<-bronxadd[(bronxadd$ADDRESSONLY!=dupadd[1] & bronxadd$ADDRESSONLY != dupadd[2]),]

#how many?

nadd<-dim(bronxadd)[1]


Oh we want nearest neighbors how

Oh, we want nearest neighbors? How?

#problem, we need a spatial distribution since none of the columns have that

#we will use google maps so limit the number to under 500 (ask me why)

nsample=450

addsample<-bronxadd[sample.int(nadd,size=nsample),]

#new data frame for the full address

addrlist<-data.frame(1:nsample,addsample$ADDRESSONLY,rep("NEW YORK",times=nsample),rep("NY",times=nsample),addsample$ZIP.CODE,rep("US",times=nsample))

#look them up

querylist<-addrListLookup(addrlist)


Lots missing why

Lots missing – why?

# how many returned valid lat/long?

querylist$matched <- (querylist$latitude !=0)

unmatchedid<- which(!querylist$matched)

#MANY missing - what's up?

unmatched<- length(unmatchedid)

#WEST -> W and EAST -> E - do again.

addrlist2<-data.frame(1:unmatched,gsub(" WEST "," W ",gsub(" EAST "," E ",addsample[unmatchedid,1])),rep("NEW YORK",times=unmatched),rep("NY",times=unmatched),addsample[unmatchedid,2],rep("US",times=unmatched))

querylist[unmatchedid,1:4]<-addrListLookup(addrlist2)[,1:4]

querylist$matched <- (querylist$latitude !=0)

unmatchedid<- which(!querylist$matched)

unmatched<- length(unmatchedid)


Not enough

Not enough

#this fixed a LOT but we need more: STREET and AVENUE (could have done PLACE) and others

addrlist3<-data.frame(1:unmatched,gsub("WEST","W",gsub("EAST","E",gsub("STREET","ST ", gsub("AVENUE","AVE", addsample[unmatchedid,1])))),rep("NEW YORK", times=unmatched), rep("NY",times=unmatched), addsample[unmatchedid,2], rep("US",times=unmatched))

querylist[unmatchedid,1:4]<-addrListLookup(addrlist3)[,1:4]

querylist$matched <- (querylist$latitude !=0)

unmatchedid<- which(!querylist$matched)

unmatched<- length(unmatchedid)

# 9 left now? good enough.


Rebuild

Rebuild!

addsample<-cbind(addsample,querylist$latitude,querylist$longitude)

##names(addsample[3:4])<-c("latitude","longitude") - this was meant to correct the column names but did not work for me

addsample<-addsample[addsample$'querylist$latitude'!=0,] # note ' ' to work around column name

adduse<-merge(bronx,addsample)

adduse<-adduse[!is.na(adduse$latitude),]


Most satisfying part

Most satisfying part!

mapcoord<-adduse[,c(2,4,24,25)]

table(mapcoord$NEIGHBORHOOD)

mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD)

#

geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGHBORHOOD)


Did you forget the knn

Did you forget the KNN?

#almost there.

mapcoord$class<as.numeric(mapcoord$NEIGHBORHOOD)

nclass<-dim(mapcoord)[1]

split<-0.8

trainid<-sample.int(nclass,floor(split*nclass))

testid<-(1:nclass)[-trainid]

##mappred<-mapcoord[testid,]

##mappred$class<as.numeric(mappred$NEIGHBORHOOD)


Lab exercises regression knn and k means

KNN!

kmax<-10

knnpred<-matrix(NA,ncol=kmax,nrow=length(testid))

knntesterr<-rep(NA,times=kmax)

for (i in 1:kmax){# loop over k

knnpred[,i]<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=i)

knntesterr[i]<-sum(knnpred[,i]!=mapcoord[testid,2])/length(testid)

}


Finally k means

Finally K-Means!

> mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude')

> mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))

> fitted(mapobj,method=c("centers","classes"))

> plot(mapmeans,mapobj$cluster)


Assignment 3

Assignment 3

  • Preliminary and Statistical Analysis. Due ~ Feb 28. 15% (written)

    • Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets.


Tentative assignments

Tentative assignments

  • Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ early March. 15% (10% written and 5% oral; individual);

  • Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual);

  • Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual);

  • Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).


Admin info keep print this slide

Admin info (keep/ print this slide)

  • Class: ITWS-4963/ITWS 6965

  • Hours: 12:00pm-1:50pm Tuesday/ Friday

  • Location: SAGE 3101

  • Instructor: Peter Fox

  • Instructor contact: [email protected], 518.276.4862 (do not leave a msg)

  • Contact hours: Monday** 3:00-4:00pm (or by email appt)

  • Contact location: Winslow 2120 (sometimes Lally 207A announced by email)

  • TA: Lakshmi Chenicheri [email protected]

  • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

    • Schedule, lectures, syllabus, reading, assignments, etc.


Table matlab r scipy numpy

Table: Matlab/R/scipy-numpy

  • http://hyperpolyglot.org/numerical-analysis


  • Login