- 73 Views
- Uploaded on
- Presentation posted in: General

Outline

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

- Basic Concepts
- Sample Size Calculation
- Precision analysis
- Power analysis

- Power analysis for three most-frequently-used regressions
- Logistic Regression
- Cox Regression
- Linear Regression

- There are two kinds of errors occur when testing hypotheses.
- Type I error: If the null hypothesis is rejected when it is true, then a type I error occurs.
- Type II error: If the null hypothesis is not rejected when it is false, then a type II error is made.

- Probabilities of making type I and II errors
- Significance level: an upper bound for .
- Power: the probability of correctly rejecting the null hypothesis when the null hypothesis is false, i.e.

- In practice, sample size may be determined based on either precision analysis or power analysis. Next a few slides will tell you what each analysis is in details. Since power analysis is more practical, the discussion will be focused on power analysis.

- For a confidence interval, the precision of the interval depends on its width. The narrower the interval is, the more precise the inference is. Therefore, the precision analysis for sample size determination is to consider the maximum half width of the confidence interval of the unknown parameter that one is willing to accept. The maximum half width of the confidence interval is usually referred to as the maximum error of an estimate of the unknown parameter.

- For example, let be independent and identically distributed normal random variables with mean and variance . When is known, a confidence interval for can be obtained as
where is the th percentile of the standard normal distribution.The maximum error, denoted by , is then defined as

- Thus, the sample size required to achieve the desired maximum error can be chosen as
- An Example: suppose that we wish to have a 95% assurance that the error in the estimated mean is less than 10% of the standard deviation (i.e., 0.1 ). The required sample size is

- Since a type I error is usually considered to be more important and serious error which one would like to avoid, a typical approach in hypothesis testing is to control at an acceptable level and try to minimize
by choosing an appropriate sample size. In other words, the null hypothesis can be tested at pre-determined level of significance with a desired power ( ). This concept for determination of sample size is usually referred to as power analysis for sample size determination.

- For determination of sample size based on power analysis, the investigator is required to specify the following information. First of all, select a significance level at which the chance of wrongly concluding that a difference exists when in fact there is no real difference (type I error). Typically, 0.05 is chosen. Secondly, select a desired power at which the chance of correctly detecting a difference when the difference truly exists. A conventional choice of power is either 90% or 80%.

- Thirdly, specify a clinically meaningful difference, denoted by . The larger is, the larger the sample size is needed. Finally, the knowledge regarding the standard deviation( i.e. ), of the primary endpoint considered in the study is also required for sample size determination. A very precise method of measurement( small ) will permit detection of any given difference with a much smaller sample size than would be required with a less precise measurement.

- Suppose there are two groups of observations, namely
(treatment) and (control). Assume that and are independent and normally distributed with means and and variances and respectively. Suppose the hypotheses of interest are

For illustration purpose, we assume (i) and are known, and (ii) . Under these assumptions, a Z-statistic can be used to test the mean difference.

- The Z-test is given by
Under the null hypothesis of no treatment difference, Z is distributed as N(0,1). Hence, we reject the null hypothesis when

Under the alternative hypothesis that , Z is distributed as , where

- The corresponding power is then given by
- To achieve the desired power of , we set
- This leads to the required sample size

- An example: suppose the objective of the study is to compare a test drug with a control and the standard deviation for the treatment group is 1 and the standard deviation of the control group is 2. Then, by choosing , we have
- Thus, a total of 106 subjects is required for achieving a 90% power for detection of a clinically meaningful difference of at the 5% level of significance.

- There are a variety of programs that are available for sample size calculation. They are different in terms of cost and coverage of sample size calculation scenarios. For a comprehensive review, go to the following link:
http://www.biostat.ucsf.edu/sampsize.html

- While one can decide the type of the program to be used for sample size calculation, the program called PASS has been used as the most reliable, comprehensive and acceptable program in academic settings especially in NIH grant submissions. However, PASS is not a free program. The cost is about 650$ per license.

- Simple logistic regression expresses the relationship between a binary response variable( ) and a covariate( ). The simple logistic regression model relates the probability of to by the formula
where P is the probability of given the value of the covariate .

- Suppose one wants to test the null hypothesis that
where is the odds ratio comparing the odds at one standard deviation of above the mean with the odds at the mean of .

- Hsieh, Block, and Larsen (1998) gave the following sample size formula when is normally distributed.

- The sample size formula indicates that to determine the required sample size, one needs to know the following factors:
- : Significance level
- : desired Power
- : odds ratio to be detected:
- : probability of at the mean of the covariate

- Example 1 : A study is to be undertaken to study the relationship between post-traumatic stress disorder and heart rate after viewing video tapes containing violent sequences. Heart rate is assumed to be normally distributed. The post-traumatic stress disorder rate is thought to be 7% among the soldiers with mean heart rate. The researchers want a sample size large enough to detect an odds ratio of 1.5 with 90% power at the 0.05 significance level with a two-sided test.

- The example described on previous slide indicates that
- Plugging these values into the sample size formula, we have

- The following Splus codes can be used to carry out
the hand-calculation on previous slide

simple.logistic.regression.continuous<-function(alpha,beta,p0,B){

# alpha---significance level

# beta---one minus power

# p0---probability at the mean of the covariate

# B---odds ratio comparing the odds at one standard deviation of the covariate

# above the mean with the odds at the mean

N<-(qnorm(1-alpha/2)+qnorm(1-beta))**2/(p0*(1-p0)*(log(B))**2)

N

}

simple.logistic.regression.continuous(0.05,0.1,0.07,1.5)

- Summary statements: A logistic regression of post-traumatic stress disorder on heart rate (assuming normal distribution) with a sample size of 982 observations achieves 90% power at a 0.05 significance level to detect an odds ratio of 1.5: the ratio of the odds at the mean of heart rate to the odds at one standard deviation above the mean.

- When is a binary covariate, one also wants to test the null hypothesis that
where is the odds ratio comparing the odds with the odds at . Notice that the interpretation of is different from that in the continuous covariate case.

- Hsieh, Block, and Larsen (1998) also gave a sample size formula when is binomially distributed. the sample size formula is
where

- The sample size formula indicates that to determine the required sample size, one needs to know the following factors:
- : Significance level
- : desired Power
- : odds ratio to be detected:
- : probability of at
- : the proportion of the sample with

- Example 2: A study is to be undertaken to study the relationship between post-traumatic stress disorder and gender. The post-traumatic stress disorder rate is thought to be 7% among the males, and the proportion of female is 50%. The researchers want a sample size large enough to detect an odds ratio of 1.5 with 90% power at the 0.05 significance level with a two-sided test.

- The example described on previous slide indicates that
- To apply the sample size formula, we still need to calculate . They can be obtained by

- Plugging these values into the sample size formula, we have

- The following Splus codes can be used to carry out
the hand-calculation on the previous slide

simple.logistic.regression.binary<-function(alpha,beta,p0,B,R){

# alpha---significance level

# beta---one minus power

# p0---probability at the mean of the covariate

# B---odds ratio to be detected

# R—the proportion of the sample with x1=1

p1<-B*p0/(1-p0+B*p0)

pbar<-(1-R)*p0+R*p1

temp1<-pbar*(1-pbar)/R

temp2<-p0*(1-p0)+p1*(1-p1)*(1-R)/R

temp3<-(p1-p0)^2*(1-R)

N<-(qnorm(1-alpha/2)*sqrt(temp1)+qnorm(1-beta)*sqrt(temp2))^2/temp3

N

}

simple.logistic.regression.binary(0.05,0.2,0.1,2,0.5)

- Summary statements: A logistic regression of post-traumatic stress disorder on gender with a sample size of 565 observations achieves 80% power at a 0.05 significance level to detect an odds ratio of 1.5: the ratio of odds when one is a female to the odds when one is a male .

- Multiple logistic regression expresses the relationship between a binary response variable, , and two or more covariate, . The multiple logistic regression model relates the probability of
to by the formula

where P is the probability of Y=1 given the values of the

covariates.

- When there are multiple covariates, the following adjustment was given by Hsieh, Block, and Larsen (1998) to give the adjusted sample size,
- Where is the sample size resulting from the simple logistic regression with (the variable of interest) being the covariate, and is the multiple correlation coefficient between
and the remaining covariates, and is equal to the proportion of the variance of explained by the remaining covariates.

- Example 3 : A study is to be undertaken to study the relationship between post-traumatic stress disorder and heart rate after viewing video tapes containing violent sequences. Heart rate is assume to be normally distributed. The post-traumatic stress disorder rate is thought to be 7% among the soldiers with mean heart rate. In addition to heart rate, two more covariates: gender and age, are intended to be included in the model. The multiple correlation of heart rate with gender and age is 0.2. The researchers want a sample size large enough to detect an odds ratio of 1.5 with 90% power at the 0.05 significance level with a two-sided test.

- From example 1, when only heart rate is the covariate in the model, the required sample size is
Thus the adjusted sample size with two more covariates with multiple correlation of 0.2 added in the model becomes

- Summary statements: A multiple logistic regression of post-traumatic stress disorder on heart rate, gender and age with a sample size of 1023 observations achieves 90% power at a 0.05 significance level to detect an odds ratio of 1.5: the ratio of the odds at the mean of heart rate to the odds at one standard deviation above the mean of heart rate, controlling for gender and age.

- Cox proportional hazards regression models the relationship between the hazard function
of survival time and k covariates using the following formula

where is the baseline hazard.

- Suppose one wants to test the null hypothesis
where is the hazard ratio : the ratio of the hazard rate at one standard deviation of above the mean to the hazard rate at the mean of

- Hsieh and Lavori (2000) gave the following sample size formula when is normally distributed.

- The sample size formula indicates that to determine the required sample size, one needs to know the following factors:
- : Significance level
- : desired Power
- : Hazard ratio to be detected:
- : The proportion of subjects that become
incidence cases

- : The variance of

- Compute required sample size to detect a hazard ratio of 1.5 for a continuous covariate of interest with standard deviation 0.3, assuming only 85% of subjects survive until the end of the study

- Example 4: Compute required sample size to achieve power 80% in detecting a hazard ratio of 1.5 for a continuous covariate of interest with standard deviation 0.3, assuming only 85% of subjects survive until the end of the study

- When is binary covariate, one also wants to test the null hypothesis
where is the hazard ratio : the ratio of the hazard at

to the hazard at

- Schoenfeld (1983) gave the following sample size formula when is binomally distributed.

- The sample size formula indicates that to determine the required sample size, one needs to know the following factors:
- : Significance level
- : desired Power
- : Hazard ratio to be detected:
- : The proportion of subjects that become
incidence cases

- : The proportion of the sample with

- The sample size formula indicates that to determine the required sample size, one needs to know the following factors:
- : Significance level
- : desired Power
- : Hazard ratio to be detected:
- : The proportion of subjects that become
incidence cases

- : The proportion of the sample with

- Example 5: Compute required sample size to achieve power 80% in detecting a hazard ratio of 1.5 for a binary covariate of interest with exposure rate of 0.2, assuming only 85% of subjects survive until the end of the study

- When there are multiple covariates, the following adjustment was given by Hsieh, and Lavori (2000) to give the adjusted sample size,
- Where is the sample size resulting from the simple Cox regression with (the variable of interest) being the covariate, and is the multiple correlation coefficient between
and the remaining covariates, is equal to the proportion of the variance of explained by the remaining covariates.

- From example 4, when only the continuous covariate is in the model, the required sample size is
Thus the adjusted sample size with two more covariates with multiple correlation of 0.2 added in the model becomes

- Linear regression expresses the relationship between a continuous response variable, , and one or more covariate, . The multiple logistic regression model relates to by the formula
where is a normally distributed random variable with mean

0 and variance

- Suppose one wants to test the null hypothesis
where C refers to the variable controlled, and T refers the variables tested.

- Let be the achieved when is regressed on those in set C, and when on those in both sets T
and C.

- The formula for computing the power is
where

(1) is the (1- )% percentile of central F distribution with u and v degrees of freedom. The value of u is the number of variable in T, v=n-u-k-1, and k is the number of variables in C , and

(2) F is distributed as a non-central F with u and v degrees of freedom and non-centrality parameter . The value of

- The sample size formula indicates that to determine the required sample size, one needs to know the following factors:
- : Significance level
- : Sample size
- u : the number of covariates tested
- k : the number of covariates controlled
- : the achieved when is regressed on those in set C
- : the achieved when is regressed on those in set T
and C

- Example 6: A school district is designing a multiple regression study looking at the effect of gender, family income, mother's education and language spoken in the home on the English language proficiency scores of Latino high school students. The variables gender, family income, and Mother's education are control variables and not of primary research interest. The variable language spoken in the home, the variable of primary research interest, is a categorical research variable with three levels: 1) Spanish only, 2) both Spanish and English, and 3) English only. Since there are three levels, it will take two dummy variables to code language spoken in the home.

- The full regression model will look something like this,
engprof = b0 + b1(gender) + b2(income) + b3(momeduc) + b4(homelang1) + b5(homelang2)

- Thus, the primary research hypotheses are the test of b4 = b5=0.

- To begin with, we believe, from previous research, that the for the full-model with five predictor variables (3 control, and 2 dummy variables for the categorical variable) will be will be about 0.48.
- We think that it will add about 0.03 to the R2 when the two dummy variables are added last to the model. This means that the for the model without the two dummy variables would be about 0.45.
- The total number of variables controlled is three and the number being tested is two.

power.linear.regression<-function(N,alpha,kc,kt,Rsq.c,Rsq.tc){

#N--sample size

#alpha-significance level

#kc--the number of variables controlled

#kt--the number of variables tested

#Rsq.c--the R square when the variables controlled are in the model

#Rsq.c--the R square when both the variables controlled and the variables tested are in the model

u<-kt

v<-N-kc-kt-1

falpha<-qf(1-alpha,u,v)

fsq<-(Rsq.tc-Rsq.c)/(1-Rsq.tc)

lambda<-N*fsq

power<-1-pf(falpha,u,v,lambda)

power

}

power.linear.regression(170,0.05,3,2,0.45,0.48)

- Hsieh, F.Y., Bloch, D.A., and Larsen, M.D. A Simple Method of Sample Size Calculation for Linear and Logistic Regression, Statistics in Medicine,17,1623-1634(1998)
http://personal.health.usf.edu/ywu/logistic.pdf

- Schoenfeld, D.A. Sample-size Formula for the Proportional-Hazards Regression Model. Biometrics, 39, 499-563(1983)
http://personal.health.usf.edu/ywu/cox_bin.pdf

- Hsieh, F.Y.,Lavori, P.W. Sample-Size Calculations for the Cox Proportional Hazards Regression Model with Nonbinary Covariates. Controlled Clinical Trials, 21 552-560(2000)
http://personal.health.usf.edu/ywu/cox.pdf

- Suppose somebody asks you to calculate the sample size for a matched case-control study, and you know that the sample size formula is available from the following paper:
http://personal.health.usf.edu/ywu/matched_case_control.pdf

1. Identify the sample size formula, and write a Splus function to carry out the sample size calculation;

2. Use an example to illustrate how to use your Splus function, and write a statement to summarize the result of your power analysis.