Chapter 13
Download
1 / 85

Chapter 13 - PowerPoint PPT Presentation


  • 322 Views
  • Updated On :
loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chapter 13' - paul


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Chapter 13

Chapter 13

Simple Linear Regression

&

Correlation

Inferential Methods

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Deterministic models
Deterministic Models

  • Consider the two variables x and y. A deterministic relationship is one in which the value of y (the dependent variable) is described by some formula or mathematical notation such as y = f(x), y = 3 + 2 x or

  • y = 5e-2x where x is the dependent variable.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Probabilistic models
Probabilistic Models

  • A description of the relation between two variables x and y that are not deterministically related can be given by specifying a probabilistic model.

  • The general form of an additive probabilistic model allows y to be larger or smaller than f(x) by a random amount, e.

  • The model equation is of the form

  • Y = deterministic function of x + random deviation

  • = f(x) + e

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Probabilistic models1

e=-1.5

Probabilistic Models

Deviations from the deterministic part of a probabilistic model

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Simple linear regression model
Simple Linear Regression Model

The simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the true or population regression line.

When a value of the independent variable x is fixed and an observation on the dependent variable y is made,

y =  + x + e

Without the random deviation e, all observed points (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation allows points to deviate from the line by random amounts.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Simple linear regression model1

Population regression line

(Slope )

Observation when x = x1

(positive deviation)

e2

e2

Observation when x = x2

(positive deviation)

a = vertical intercept

0

x = x1

x = x2

0

Simple Linear Regression Model

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Basic assumptions of the simple linear regression model
Basic Assumptions of the Simple Linear Regression Model

  • The distribution of e at any particular x value has mean value 0 (µe = 0).

  • The standard deviation of e (which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by .

  • The distribution of e at any particular x value is normal.

  • The random deviations e1, e2, …, en associated with different observations are independent of one another.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


More about the simple linear regression model

For any fixed x value, y itself has a normal distribution.

More About the Simple Linear Regression Model

and

(standard deviation of y for fixed x) = .

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Interpretation of terms

Small s

Large s

Interpretation of Terms

The slope  of the population regression line is the mean (average) change in y associated with a 1-unit increase in x.

The vertical intercept a is the height of the population line when x = 0.

The size of  determines the extent to which the (x, y) observations deviate from the population line.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Illustration of assumptions
Illustration of Assumptions

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Estimates for the regression line
Estimates for the Regression Line

The point estimates of b, the slope, and a, the y intercept of the population regression line, are the slope and y intercept, respectively, of the least squares line.

That is,

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Interpretation of y a bx
Interpretation of y = a + bx

  • Let x* denote a specific value of the predictor variable x. The a + bx* has two interpetations:

    • a + bx* is a point estimate of the mean y value when x = x*.

    • a + bx* is a point prediction of an individual y value to be observed when x = x*.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example
Example

The following data was collected in a study of age and fatness in humans.

One of the questions was, “What is the relationship between age and fatness?”

* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example1
Example

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example2
Example

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example3
Example

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example4
Example

A point estimate for the %Fat for a human who is 45 years old is

If 45 is put into the equation for x, we have both an estimated %Fat for a 45 year old human or an estimated average %Fat for 45 year old humans

The two interpretations are quite different.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example5
Example

A plot of the data points along with the least squares regression line created with Minitab is given to the right.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Terminology
Terminology

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Definition formulae
Definition formulae

The total sum of squares, denoted by SSTo, is defined as

The residual sum of squares, denoted by SSResid, is defined as

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Calculation formulae recalled
Calculation Formulae Recalled

SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas:

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Coefficient of determination
Coefficient of Determination

  • The coefficient of determination, denoted by r2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Estimated standard deviation s e

The statistic for estimating the variance s2 is

where

Estimated Standard Deviation, se

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Estimated standard deviation s e1
Estimated Standard Deviation, se

The estimate of s is the estimated standard deviation

The number of degrees of freedom associated with estimating 2 or  in simple linear regression is n - 2.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example continued
Example continued

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example continued1
Example continued

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example continued2
Example continued

With r2 = 0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age.

The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves.

This would suggest that the model is only useful in the sense of provide gross “ballpark” estimates for %Fat for humans based on age.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Properties of the sampling distribution of b

Properties of the Sampling Distribution of b

When the four basic assumptions of the simple linear regression model are satisfied, the following conditions are met:

  • The mean value of b is . Specifically,

    • mb=b and hence b is an unbiased statistic for estimating 

  • The statistic b has a normal distribution (a consequence of the error e being normally distributed)

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Estimated standard deviation of b

The estimated standard deviation of the statistic b is

When then four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable

is the t distribution with df = n - 2

Estimated Standard Deviation of b

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Confidence interval for
Confidence interval for

When then four basic assumptions of the simple linear regression model are satisfied, a confidence interval for , the slope of the population regression line, has the form

b  (t critical value)sb

where the t critical value is based on

df = n - 2.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example continued3

A 95% confidence interval estimate for b is

Example continued

Recall

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example continued4
Example continued

A 95% confidence interval estimate for b is

Based on sample data, we are 95% confident that the true mean increase in %Fat associated with a year of age is between 0.324% and 0.772%.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example continued5

Regression line

Estimated slope b

Estimated y intercept a

residual df = n -2

SSResid

SSTo

Example continued

Minitab output looks like

Regression Analysis: % Fat y versus Age (x)

The regression equation is

% Fat y = 3.22 + 0.548 Age (x)

Predictor Coef SE Coef T P

Constant 3.221 5.076 0.63 0.535

Age (x) 0.5480 0.1056 5.19 0.000

S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4%

Analysis of Variance

Source DF SS MS F P

Regression 1 891.87 891.87 26.94 0.000

Residual Error 16 529.66 33.10

Total 17 1421.54

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Hypothesis tests concerning
Hypothesis Tests Concerning

Null hypothesis: H0:  = hypothesized value

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Hypothesis tests concerning1
Hypothesis Tests Concerning

Alternate hypothesis and finding the P-value:

  • Ha:  > hypothesized value

    P-value = Area under the t curve with n - 2 degrees of freedom to the right of the calculated t

  • Ha:  < hypothesized value

    P-value = Area under the t curve with n - 2 degrees of freedom to the left of the calculated t

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Hypothesis tests concerning2
Hypothesis Tests Concerning

  • Ha:  hypothesized value

    • If t is positive, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the right of the calculated t)

    • If t is negative, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the left of the calculated t)

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Hypothesis tests concerning3
Hypothesis Tests Concerning

Assumptions:

The distribution of e at any particular x value has mean value 0 (me = 0)

The standard deviation of e is , which does not depend on x

The distribution of e at any particular x value is normal

The random deviations e1, e2, … , en associated with different observations are independent of one another

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Hypothesis tests concerning4

The test statistic simplifies to and is called the t ratio.

Hypothesis Tests Concerning 

Quite often the test is performed with the hypotheses

H0:  = 0 vs. Ha:  0

This particular form of the test is called the model utility test for simple linear regression.

The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example6
Example the

Consider the following data on percentage unemployment and suicide rates.

* Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example7
Example the

The plot of the data points produced by Minitab follows

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example8
Example the

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example9
Example the

Some basic summary statistics

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example10
Example the

Continuing with the calculations

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example11
Example the

Continuing with the calculations

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example12
Example the

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example model utility test

Example - Model Utility Test

  •  = the true average change in suicide rate associated with an increase in the unemployment rate of 1 percentage point

  • H0:= 0

  • Ha: 0

  • has not been preselected. We shall interpret the observed level of significance (P-value)

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example model utility test1
Example - Model Utility Test the

  • Assumptions: The following plot (Minitab) of the data shows a linear pattern and the variability of points does not appear to be changing with x. Assuming that the distribution of errors (residuals) at any given x value is approximately normal, the assumptions of the simple linear regression model are appropriate.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example model utility test2

Example - Model Utility Test

  • P-value: The table of tail areas for t-distributions only has t values  4, so we can see that the corresponding tail area is < 0.002. Since this is a two-tail test the P-value < 0.004. (Actual calculation gives a P-value = 0.002)

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example model utility test3
Example - Model Utility Test the

  • Conclusion:

  • Even though no specific significance level was chosen for the test, with the P-value being so small (< 0.004) one would generally reject the null hypothesis that  = 0 and conclude that there is a useful linear relationship between the % unemployed and the suicide rate.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example minitab output

P-value the

T value for Model Utility Test

H0: b = 0 Ha: b 0

Example - Minitab Output

Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x)

The regression equation is

Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x)

Predictor Coef SE Coef T P

Constant -93.86 51.25 -1.83 0.100

Percenta 59.05 14.24 4.15 0.002

S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8%

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example reality check
Example – Reality Check! the

Although the medel utility test indicates that the model is useful, we should be a bit reticent to use the model principally as a estimation tool.

Notice that s = 36.06, where the actual range of suicide rates is 235 – 71 = 164. This means to typical error in estimating the suicide rate would be approximately 22% of the range in error. With 9 of the 11 data points having suicide rates at or below 104, this would constitute a very large amount of error in the estimation.

The statistics is very clear: We have established a strong positive linear relationship between percentage employed and the suicide rate. I would just not be particularly meaningful or useful to provide actual numerical estimates for suicide rates.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Residual analysis
Residual Analysis the

  • The simple linear regression model equation is y =  + x + ewhere e represents the random deviation of an observed y value from the population regression line + x .

  • Key assumptions about e

    • At any particular x value, the distribution of e is a normal distribution

    • At any particular x value, the standard deviation of e is , which is constant over all values of x.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Residual analysis1
Residual Analysis the

To check on these assumptions, one would examine the deviations e1, e2, …, en.

Generally, the deviations are not known, so we check on the assumptions by looking at the residuals which are the deviations from the estimated line, a + bx.

The residuals are given by

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Standardized residuals

The estimated standard deviation of a residual depends on the x value. The estimated standard deviation of the ith residual, , is given by

Standardized Residuals

Recall: A quantity is standardized by subtracting its mean value and then dividing by its true (or estimated) standard deviation.

For the residuals, the true mean is zero (0) if the assumptions are true.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Standardized residuals1
Standardized Residuals the x value. The estimated standard deviation of the i

As you can see from the formula for the estimated standard deviation the calculation of the standardized residuals is a bit of a calculational nightmare.

Fortunately, most statistical software packages are set up to perform these calculations and do so quite proficiently.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Standardized residuals example

Percentage the x value. The estimated standard deviation of the i

Suicide

Standardized

Residual

City

y

ˆ

ˆ

y - y

Unemployed

Rate

Residual

New York

3.0

72

83.31

-11.31

-0.34

Los Angeles

4.7

224

183.70

40.30

1.34

Chicago

3.0

82

83.31

-1.31

-0.04

Philadelphia

3.2

92

95.12

-3.12

-0.09

Detroit

3.8

104

130.55

-26.55

-0.78

Boston

2.5

71

53.78

17.22

0.55

San Francisco

4.8

235

189.61

45.39

1.56

Washington

2.7

81

65.59

15.41

0.48

Pittsburgh

4.4

86

165.99

-79.98

-2.50

St. Louis

3.1

102

89.21

12.79

0.38

Cleveland

3.5

104

112.84

-8.84

-0.26

Standardized Residuals - Example

Consider the data on percentage unemployment and suicide rates

Notice that the standardized residual for Pittsburgh is -2.50, somewhat large for this size data set.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example13

Pittsburgh the x value. The estimated standard deviation of the i

This point has an unusually high residual

Example

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Normal plots
Normal Plots the x value. The estimated standard deviation of the i

Notice that both of the normal plots look similar. If a software package is available to do the calculation and plots, it is preferable to look at the normal plot of the standardized residuals.

In both cases, the points look reasonable linear with the possible exception of Pittsburgh, so the assumption that the errors are normally distributed seems to be supported by the sample data.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


More comments
More Comments the x value. The estimated standard deviation of the i

The fact that Pittsburgh has a large standardized residual makes it worthwhile to look at that city carefully to make sure the figures were reported correctly. One might also look to see if there are some reasons that Pittsburgh should be looked at separately because some other characteristic distinguishes it from all of the other cities.

Pittsburgh does have a large effect on model.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Visual interpretation of standardized residuals
Visual Interpretation of Standardized Residuals the x value. The estimated standard deviation of the i

This plot is an example of a satisfactory plot that indicates that the model assumptions are reasonable.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Visual interpretation of standardized residuals1
Visual Interpretation of Standardized Residuals the x value. The estimated standard deviation of the i

This plot suggests that a curvilinear regression model is needed.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Visual interpretation of standardized residuals2
Visual Interpretation of Standardized Residuals the x value. The estimated standard deviation of the i

This plot suggests a non-constant variance. The assumptions of the model are not correct.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Visual interpretation of standardized residuals3
Visual Interpretation of Standardized Residuals the x value. The estimated standard deviation of the i

This plot shows a data point with a large standardized residual.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Visual interpretation of standardized residuals4
Visual Interpretation of Standardized Residuals the x value. The estimated standard deviation of the i

This plot shows a potentially influential observation.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example unemployment vs suicide rate

Generally decreasing pattern to these points. the x value. The estimated standard deviation of the i

These two points are quite influential since they are far away from the others in terms of the % unemployed

Unusually large residual – clearly an influential point

Example - % Unemployment vs. Suicide Rate

This plot of the residuals (errors) indicates some possible problems with this linear model. You can see a pattern to the points.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Properties of the sampling distribution of a bx for a fixed x value
Properties of the Sampling Distribution of a + bx for a Fixed x Value

  • Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a + bx* has the following properties:

    • The mean value of a + bx* is  + x*, so a + bx* is an unbiased statistic for estimating the average y value when x = x*

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Properties of the sampling distribution of a bx for a fixed x value1

Properties of the Sampling Distribution of a + bx for a Fixed x Value

  • The distribution of the statistic a + bx* is normal.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Addition information about the sampling distribution of a bx for a fixed x value

The denoted by estimated standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by

When the four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable

is the t distribution with df = n - 2.

Addition Information about the Sampling Distribution of a + bx for a Fixed x Value

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Confidence interval for a mean y value

Many authors give the following equivalent form for the confidence interval.

Confidence Interval for a Mean y Value

When the four basic assumptions of the simple linear regression model are met, a confidence interval for a + bx*, the average y value when x has the value x*, is

a + bx*  (t critical value)sa+bx*

Where the t critical value is based on

df = n -2.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Confidence interval for a single y value

When the four basic assumptions of the simple linear regression model are met, a prediction interval for y*, a single y observation made when x has the value x*, has the form

Where the t critical value is based on df = n -2.

Many authors give the following equivalent form for the prediction interval.

Confidence Interval for a Single y Value

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example mean annual temperature vs mortality
Example regression model are met, a - Mean Annual Temperature vs. Mortality

Data was collected in certain regions of Great Britain, Norway and Sweden to study the relationship between the mean annual temperature and the mortality rate for a specific type of breast cancer in women.

* Lea, A.J. (1965) New Observations on distribution of neoplasms of female breast in certain European countries. British Medical Journal, 1, 488-490

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example mean annual temperature vs mortality1

Regression Analysis: Mortality index versus Mean annual temperature

The regression equation is

Mortality index = - 21.8 + 2.36 Mean annual temperature

Predictor Coef SE Coef T P

Constant -21.79 15.67 -1.39 0.186

Mean ann 2.3577 0.3489 6.76 0.000

S = 7.545 R-Sq = 76.5% R-Sq(adj) = 74.9%

Analysis of Variance

Source DF SS MS F P

Regression 1 2599.5 2599.5 45.67 0.000

Residual Error 14 796.9 56.9

Total 15 3396.4

Unusual Observations

Obs Mean ann Mortalit Fit SE Fit Residual St Resid

15 31.8 67.30 53.18 4.85 14.12 2.44RX

R denotes an observation with a large standardized residual

X denotes an observation whose X value gives it large influence.

Example - Mean Annual Temperature vs. Mortality

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example mean annual temperature vs mortality2

The point has a large standardized residual and is influential because of the low Mean Annual Temperature.

Example - Mean Annual Temperature vs. Mortality

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example mean annual temperature vs mortality3

Predicted Values for New Observations influential because of the low Mean Annual Temperature.

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 53.18 4.85 ( 42.79, 63.57) ( 33.95, 72.41) X

2 60.72 3.84 ( 52.48, 68.96) ( 42.57, 78.88)

3 72.51 2.48 ( 67.20, 77.82) ( 55.48, 89.54)

4 83.34 1.89 ( 79.30, 87.39) ( 66.66, 100.02)

5 96.09 2.67 ( 90.37, 101.81) ( 78.93, 113.25)

6 99.16 3.01 ( 92.71, 105.60) ( 81.74, 116.57)

X denotes a row with X values away from the center

Values of Predictors for New Observations

New Obs Mean ann

1 31.8

2 35.0

3 40.0

4 44.6

5 50.0

6 51.3

Example - Mean Annual Temperature vs. Mortality

These are the x* values for which the above fits, standard errors of the fits, 95% confidence intervals for Mean y values and prediction intervals for y values given above.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example mean annual temperature vs mortality4

95% confidence interval for Mean y value at x = 40. influential because of the low Mean Annual Temperature.(67.20, 77.82)

95% prediction interval for single y value at x = 45. (67.62,100.98)

Example - Mean Annual Temperature vs. Mortality

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


A test for independence in a bivariate normal population

Test statistic: influential because of the low Mean Annual Temperature.

The t critical value is based on df = n - 2

A Test for Independence in a Bivariate Normal Population

Null hypothesis: H0:  = 0

Assumption: r is the correlation coefficient for a random sample from a bivariate normal population.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


A test for independence in a bivariate normal population1
A Test for Independence in a Bivariate Normal Population influential because of the low Mean Annual Temperature.

Alternate hypothesis: H0:  > 0 (Positive dependence): P-value is the area under the appropriate t curve to the right of the computed t.

Alternate hypothesis: H0:  < 0 (Negative dependence): P-value is the area under the appropriate t curve to the right of the computed t.

  • Alternate hypothesis: H0:  0 (Dependence):

  • P-value is

    • twice the area under the appropriate t curve to the left of the computed t value if t < 0 and

    • twice the area under the appropriate t curve to the right of the computed t value if t > 0

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example14
Example influential because of the low Mean Annual Temperature.

Recall the data from the study of %Fat vs. Age for humans.

There are 18 data points and a quick calculation of the Pierson correlation coefficient gives

r = 0.79209.

We will test to see if there is a dependence at the 0.05 significance level.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example15

Example

  •  = the correlation between % fat and age in the population from which the sample was selected

  • H0:  = 0

  • Ha:  0

  • = 0.05

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example16
Example influential because of the low Mean Annual Temperature.

  • Looking at the two normal plots, we can see it is not reasonable to assume that either the distribution of age nor the distribution of % fat are normal. (Notice, the data points deviate from a linear pattern quite substantially. Since neither is normal, we shall not continue with the test.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Another example height vs joint length
Another Example influential because of the low Mean Annual Temperature.Height vs. Joint Length

The professor in an elementary statistics class wanted to explain correlation so he needed some bivariate data. He asked his class (presumably a random or representative sample of late adolescent humans) to measure the length of the metacarpal bone on the index finger of the right hand (in cm) and height (in ft). The data are provided on the next slide.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example height vs joint length
Example - Height vs. Joint Length influential because of the low Mean Annual Temperature.

There are 17 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.74908.

We will test to see if the true population correlation coefficient is positive at the 0.05 level of significance.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example height vs joint length1

Example - Height vs. Joint Length

  •  = the true correlation between height and right index finger metacarpal joint in the population from which the sample was selected

  • H0:  = 0

  • Ha: > 0

  • = 0.05

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example height vs joint length2
Example - Height vs. Joint Length influential because of the low Mean Annual Temperature.

  • Looking at the two normal plots, we can see it is reasonable to assume that the distribution of age and the distribution of % fat are both normal. (Notice, the data points follow a reasonably linear pattern. This appears to confirm the assumption that the sample is from a bivariate normal distribution. We will assume that the class was a random sample of young adults.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


Example height vs joint length3

  • Calculation: influential because of the low Mean Annual Temperature.

Example - Height vs. Joint Length

  • P-value: Looking on the table of tail areas for t curves under 15 degrees of freedom, 4.379 is off the bottom of the table, so P-value < 0.001. Minitab reports the P-value to be 0.001.

  • Conclusion: The P-value is smaller than a = 0.05, so we can reject H0. We can conclude that the true population correlation coefficient is greater then 0. I.e., the metacarpal bone is longer for taller people.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc.


ad