AMS 572 Group #2

Multiple Linear Regression AMS 572 Group #2

Outline • Jinmiao Fu—Introduction and History • Ning Ma—Establish and Fitting of the model • Ruoyu Zhou—Multiple Regression Model in Matrix Notation • DaweiXu and Yuan Shang—Statistical Inference for Multiple Regression • Yu Mu—Regression Diagnostics • Chen Wang and Tianyu Lu—Topics in Regression Modeling • TianFeng—Variable Selection Methods • Hua Mo—Chapter Summary and modern application

Introduction • Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable

Example: The relationship between an adult’s health and his/her daily eating amount of wheat, vegetable and meat.

History

Karl Pearson (1857–1936) Lawyer, Germanist, eugenicist, mathematician and statistician Correlation coefficient Method of moments Pearson's system of continuous curves. Chi distance, P-value Statistical hypothesis testing theory, statistical decision theory. Pearson's chi-square test, Principal component analysis.

Sir Francis Galton FRS (16 February 1822 – 17 January 1911) Anthropology and polymathy Doctoral students Karl Pearson In the late 1860s, Galton conceived the standard deviation. He created the statistical concept of correlation and also discovered the properties of the bivariate normal distribution and its relationship to regression analysis

Galton invented the use of the regression line (Bulmer 2003, p. 184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas.

The publication by his cousin Charles Darwin of The Origin of Species in 1859 was an event that changed Galton's life. He came to be gripped by the work, especially the first chapter on "Variation under Domestication" concerning the breeding of domestic animals.

Adrien-Marie Legendre (18 September 1752 – 10 January 1833) was a French mathematician. He made important contributions to statistics, number theory, abstract algebra and mathematical analysis. He developed the least squares method, which has broad application in linear regression, signal processing, statistics, and curve fitting.

Johann Carl Friedrich Gauss (30 April 1777 – 23 February 1855) was a German mathematician and scientist who contributed significantly to many fields, including number theory, statistics, analysis, differential geometry, geodesy, geophysics, electrostatics, astronomy and optics.

Gauss, who was 23 at the time, heard about the problem and tackled it. After three months of intense work, he predicted a position for Ceres in December 1801—just about a year after its first sighting—and this turned out to be accurate within a half-degree. In the process, he so streamlined the cumbersome mathematics of 18th century orbital prediction that his work—published a few years later as Theory of Celestial Movement—remains a cornerstone of astronomical computation.

Itintroduced the Gaussian gravitational constant, and contained an influential treatment of the method of least squares, a procedure used in all sciences to this day to minimize the impact of measurement error. Gauss was able to prove the method in 1809 under the assumption of normally distributed errors (see Gauss–Markov theorem; see also Gaussian). The method had been described earlier by Adrien-Marie Legendre in 1805, but Gauss claimed that he had been using it since 1795.

Sir Ronald Aylmer Fisher FRS (17 February 1890 – 29 July 1962) was an English statistician, evolutionary biologist, eugenicist and geneticist. He was described by Anders Hald as "a genius who almost single-handedly created the foundations for modern statistical science," and Richard Dawkins described him as "the greatest of Darwin's successors".

In addition to "analysis of variance", Fisher invented the technique of maximum likelihood and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminator and Fisher information.

Establish and Fitting of the Model

Probabilistic Model : the observed value of the random variable(r.v.) depends on fixed predictor values ,i=1,2,3,…,n unknown model parameters i.i.d ~N (0, ) n is the number of observations.

Fitting the model • LS provides estimates of the unknown model parameters, which minimizesQ (j=1,2,…,k)

Tire tread wear vs. mileage (example11.1 in textbook) The table gives the measurements on the groove of one tire after every 4000 miles. Our Goal: to build a model to find the relation between the mileage and groove depth of the tire.

SAS code----fitting the model Data example； Input mile depth @@； Sqmile=mile*mile； Datalines; 0 394.33 4 329.5 8 291 12 255.17 16 229.33 20 204.83 24 179 28 163.83 32 150.33 ; run; Proc reg data=example; Model Depth= mile sqmile; Run;

Depth=386.26-12.77mile+0.172sqmile

Goodness of Fit of the Model • Residuals • are the fitted values An overall measure of the goodness of fit Error sum of squares (SSE): • total sum of squares (SST): regression sum of squares (SSR):

Multiple Regression Model In Matrix Notation

1. Transform the Formulas to Matrix Notation

The first column of X denotes the constant term (We can treat this as with)

Finally let where the (k+1)1 vectors of unknown parameters LS estimates

Formula becomes • Simultaneously, the linear equation are changed to Solve this equation respect to and we get (if the inverse of the matrix exists.) -1

2. Example 11.2 (Tire Wear Data: Quadratic Fit Using Hand Calculations) • We will do Example 11.1 again in this part using the matrix approach. • For the quadratic model to be fitted

-1 • According to formula we need to calculate first and then invert it and get

Finally, we calculate the vector of LS estimates

Therefore, the LS quadratic model is This model is the same as we obtained in Example 11.1.

Statistical Inference for Multiple Regression

Statistical Inference for Multiple Regression • Determine which predictor variables have statistically significant effects • We test the hypotheses: • If we can’t reject H0j, then xj is not a significant predictor of y.

Statistical Inference on • Review statistical inference for Simple Linear Regression

Statistical Inference on • What about Multiple Regression? • The steps are similar

Statistical Inference on • What’s Vjj? Why ? 1. Mean Recall from simple linear regression, the least squares estimators for the regression parameters and are unbiased. Here, of least squares estimators is also unbiased.

Statistical Inference on • 2.Variance • Constant Variance assumption:

Statistical Inference on • Let Vjj be the jth diagonal of the matrix

Statistical Inference on

Statistical Inference on • Therefore,

Statistical Inference on • Derivation of confidence interval of The 100(1-α)% confidence interval for is

Statistical Inference on • Rejects H0j if

Prediction of Future Observation • Having fitted a multiple regression model, suppose we wish to predict the future value of Y for a specified vector of predictor variables x*=(x0*,x1*,…,xk*) • One way is to estimate E(Y*) by a confidence interval(CI).

Prediction of Future Observation

F-Test for Consider: Here is the overall null hypothesis, which states that none of the variables are related to . The alternative one shows at least one is related.

How to Build a F-Test…… • The test statistic F=MSR/MSE follows F-distribution with k and n-(k+1) d.f. The α -level test rejects if recallthat MSE(error mean square) with n-(k+1) degrees of freedom.

The relation between F and r F can be written as a function of r. By using the formula: F can be as: We see that F is an increasing function of r ² and test the significance of it.

AMS 572 Group #2

AMS 572 Group #2

Presentation Transcript

Group 5 AMS 572 Professor: Wei Zhu

Group #4 AMS 572 – Data Analysis

AMS 572

AMS

572