330 likes | 465 Views
正交(对称,全)回归. 哪个 y, 哪个 x. 身高 预测体重还是体重预测身高 父亲身高预测孩子身高,还是孩子身高判断父亲身高. #heights {alr3} #Karl Pearson organized the collection of data on over 1100 families in England #in the period 1893 to 1898. This particular data set gives the heights
E N D
哪个y, 哪个x • 身高 预测体重还是体重预测身高 • 父亲身高预测孩子身高,还是孩子身高判断父亲身高
#heights {alr3} #Karl Pearson organized the collection of data on over 1100 families in England #in the period 1893 to 1898. This particular data set gives the heights # in inches of mothers and their daughters, with up to two daughters per mother. # All daughters are at least age 18, and all mothers are younger than 65. #Data were given in the source as a frequency table to the nearest inch. #Rounding error has been added to remove discreteness from graph • #Davis {car} The Davis data frame has 200 rows and 5 columns. The subjects were men and women engaged in regular exercise. There are some missing data. • # father.son {UsingR} #1078 measurements of a father's height and his son's height
father.son {UsingR} • summary(lm(fheight~sheight,father.son)) • summary(lm(sheight~fheight,father.son)) • o1=lm(fheight~sheight,father.son) • o2=lm(sheight~fheight,father.son) • plot(fheight~sheight,father.son) • s.prid=expand.grid(sheight=seq(50,90,1))
s.prid$fheight=predict(o1,s.prid) • s.prid2=expand.grid(fheight=s.prid$fheight) • s.prid2$sheight=predict(o2,s.prid2) • lines(fheight~sheight,s.prid,col="red") • lines(fheight~sheight,s.prid2,col="blue") • legend("topleft",c("fheight~sheight","sheight~fheight"),lty=1,col=c("red","blue"))
对称回归 • 如果难以确定x, y中哪个是响应变量, 如何建立两者之间的函数关系? • 如果x, y地位对等(对称),y~x以及x~y都不合理。应该使用对称回归方法,包括major-axis reg(或orthogonal reg), reduced major reg(或impartial reg),bisector reg(或double regression) • Pearson给出了major axis regression (也称作orthogonal regression) 方法, 这是一种对称回归方法。
Reduced major axis regression (impartial regression):the SD line
其它symmetric regression • Bisector regression (double regression):平分y~x, x~y回归直线的夹角
程序 • ol<-function(x,y) • { • s_xy=sum((x-mean(x))*(y-mean(y))) • s_xx=sum((x-mean(x))^2) • s_yy=sum((y-mean(y))^2) • b1=s_xy/s_xx • b2=s_yy/s_xy • r=cor(x,y) • b_ol=(-(b2-1/b1)+sign(r)*sqrt(4+(b2-1/b1)^2))/2 • b_sd=sign(r)*sqrt(b1*b2) • b_bi=(b1*b2-1+sqrt((1+b1^2)*(1+b2^2)))/(b1+b2) • B=list(b_xy=b1,b_yx=b2,b_ol=b_ol,b_sd=b_sd,b_bi=b_bi) • return(B) • }
数据 • IQ=c(90,92,93,95,97,98,100) • P=c(39,42,36,45,39,45,42)
分析 • B=as.numeric(ol(IQ,P)) • A=mean(P)-B*mean(IQ) • plot(IQ,P) • lines(IQ,A[1]+B[1]*IQ) • lines(IQ,A[2]+B[2]*IQ,col="purple") • lines(IQ,A[3]+B[3]*IQ,col="red") • lines(IQ,A[4]+B[4]*IQ,col="blue") • lines(IQ,A[5]+B[5]*IQ,col="green") • legend("topleft",c("x~y","y~x","ol","sd","bi"),lty=1,col=c("black","purple","red","blue","green"))
一些特殊问题 • 1. 异常点/标准化 • 2. 中心化
1. 异常值(outlier)/标准化 例1.1.青年人IQ分数的分布为正态,超过99%分位数的可定义为智力超常者(outlier):
例1.2.体重指数。肥胖的不恰当的定义:重量超过群体95%分位数的人为肥胖:例1.2.体重指数。肥胖的不恰当的定义:重量超过群体95%分位数的人为肥胖: 不同身高、性别、年龄的人不具可比性。即μ是若干因素的函数。 一个简单但繁琐的办法是分层,对给定群体发现W分布,并定义超过C(比如标准正态分布95%分位数)的人为肥胖。
另外一个做法是消除掉log(H)对log(W)的影响(同时控制性别G、年龄),即假设回归模型:另外一个做法是消除掉log(H)对log(W)的影响(同时控制性别G、年龄),即假设回归模型: