1 / 33

正交(对称,全)回归

正交(对称,全)回归. 哪个 y, 哪个 x. 身高 预测体重还是体重预测身高 父亲身高预测孩子身高,还是孩子身高判断父亲身高. #heights {alr3} #Karl Pearson organized the collection of data on over 1100 families in England #in the period 1893 to 1898. This particular data set gives the heights

kostya
Download Presentation

正交(对称,全)回归

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 正交(对称,全)回归

  2. 哪个y, 哪个x • 身高 预测体重还是体重预测身高 • 父亲身高预测孩子身高,还是孩子身高判断父亲身高

  3. #heights {alr3} #Karl Pearson organized the collection of data on over 1100 families in England #in the period 1893 to 1898. This particular data set gives the heights # in inches of mothers and their daughters, with up to two daughters per mother. # All daughters are at least age 18, and all mothers are younger than 65. #Data were given in the source as a frequency table to the nearest inch. #Rounding error has been added to remove discreteness from graph • #Davis {car} The Davis data frame has 200 rows and 5 columns. The subjects were men and women engaged in regular exercise. There are some missing data. • # father.son {UsingR} #1078 measurements of a father's height and his son's height

  4. father.son {UsingR} • summary(lm(fheight~sheight,father.son)) • summary(lm(sheight~fheight,father.son)) • o1=lm(fheight~sheight,father.son) • o2=lm(sheight~fheight,father.son) • plot(fheight~sheight,father.son) • s.prid=expand.grid(sheight=seq(50,90,1))

  5. s.prid$fheight=predict(o1,s.prid) • s.prid2=expand.grid(fheight=s.prid$fheight) • s.prid2$sheight=predict(o2,s.prid2) • lines(fheight~sheight,s.prid,col="red") • lines(fheight~sheight,s.prid2,col="blue") • legend("topleft",c("fheight~sheight","sheight~fheight"),lty=1,col=c("red","blue"))

  6. 对称回归 • 如果难以确定x, y中哪个是响应变量, 如何建立两者之间的函数关系? • 如果x, y地位对等(对称),y~x以及x~y都不合理。应该使用对称回归方法,包括major-axis reg(或orthogonal reg), reduced major reg(或impartial reg),bisector reg(或double regression) • Pearson给出了major axis regression (也称作orthogonal regression) 方法, 这是一种对称回归方法。

  7. Reduced major axis regression (impartial regression):the SD line

  8. 其它symmetric regression • Bisector regression (double regression):平分y~x, x~y回归直线的夹角

  9. 二元正态分布-回归、逆回归

  10. 程序 • ol<-function(x,y) • { • s_xy=sum((x-mean(x))*(y-mean(y))) • s_xx=sum((x-mean(x))^2) • s_yy=sum((y-mean(y))^2) • b1=s_xy/s_xx • b2=s_yy/s_xy • r=cor(x,y) • b_ol=(-(b2-1/b1)+sign(r)*sqrt(4+(b2-1/b1)^2))/2 • b_sd=sign(r)*sqrt(b1*b2) • b_bi=(b1*b2-1+sqrt((1+b1^2)*(1+b2^2)))/(b1+b2) • B=list(b_xy=b1,b_yx=b2,b_ol=b_ol,b_sd=b_sd,b_bi=b_bi) • return(B) • }

  11. 数据 • IQ=c(90,92,93,95,97,98,100) • P=c(39,42,36,45,39,45,42)

  12. 分析 • B=as.numeric(ol(IQ,P)) • A=mean(P)-B*mean(IQ) • plot(IQ,P) • lines(IQ,A[1]+B[1]*IQ) • lines(IQ,A[2]+B[2]*IQ,col="purple") • lines(IQ,A[3]+B[3]*IQ,col="red") • lines(IQ,A[4]+B[4]*IQ,col="blue") • lines(IQ,A[5]+B[5]*IQ,col="green") • legend("topleft",c("x~y","y~x","ol","sd","bi"),lty=1,col=c("black","purple","red","blue","green"))

  13. 一些特殊问题 • 1. 异常点/标准化 • 2. 中心化

  14. 1. 异常值(outlier)/标准化 例1.1.青年人IQ分数的分布为正态,超过99%分位数的可定义为智力超常者(outlier):

  15. 例1.2.体重指数。肥胖的不恰当的定义:重量超过群体95%分位数的人为肥胖:例1.2.体重指数。肥胖的不恰当的定义:重量超过群体95%分位数的人为肥胖: 不同身高、性别、年龄的人不具可比性。即μ是若干因素的函数。 一个简单但繁琐的办法是分层,对给定群体发现W分布,并定义超过C(比如标准正态分布95%分位数)的人为肥胖。

  16. 另外一个做法是消除掉log(H)对log(W)的影响(同时控制性别G、年龄),即假设回归模型:另外一个做法是消除掉log(H)对log(W)的影响(同时控制性别G、年龄),即假设回归模型:

  17. 2. 中心化

More Related