Biostatistics in practice
1 / 48

Biostatistics in Practice - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Biostatistics in Practice. Session 2: Summarization of Quantitative Information. Peter D. Christenson Biostatistician http://gcrc. /Biostat. Topics for this Session. Experimental Units Independence of Measurements Graphs: Summarizing Results Graphs: Aids for Analysis

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Biostatistics in Practice

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Biostatistics in practice

Biostatistics in Practice

Session 2:

Summarization of Quantitative Information

Peter D. Christenson


Topics for this session

Topics for this Session

Experimental Units

Independence of Measurements

Graphs: Summarizing Results

Graphs: Aids for Analysis

Summary Measures

Confidence Intervals

Prediction Intervals

Most practical from this session

Most Practical from this Session

Geometric Means

Confidence Intervals

Reference Ranges

Justify Methods from Graphs

Experimental units independence of measurements

Experimental Units_____Independence of Measurements

Statistical independence

Statistical Independence

Experimental units are the smallest independent entities for addressing a scientific question in an analysis of an experiment.

“Independent” refers to the measurement that is made and the question, not the units.

Definition: If knowledge of the value for a unit does not provide information about another unit’s value, given other factors (and the overall mean) in the analysis of the experiment, then the units are independent for this measurement.

There may be a hierarchy of units.

Importance of independence

Importance of Independence

Many basic statistical methods require that measurements are independent for the analysis to be valid.

Other methods can incorporate the lack of independence.

There can be some subjectivity regarding independence. Statistical methods use models. Models can be wrong.

Example units and independence

Example: Units and Independence

Ten mice receive treatment A, each is bled, and blood samples are each divided into 3 aliquots. The same is done for 10 mice on treatment B.

  • A serum hormone is measured in the 60 aliquots and compared between A and B.

  • The aliquots for a mouse are not independent.

  • The unit is a mouse.

  • A summary statistic from a mouse’s 3 aliquots (e.g., maximum or mean) are independent.

  • N=10 and 10, not 30 and 30.

Example continued

Example, Continued

  • One of the 30 A aliquots is further divided into 25 parts and 5 different in vitro challenges are each made to a random set of 5 of the parts. The same is done for a single B aliquot.

  • For this challenge experiment, each part is a unit, the values of challenge response are independent, and N=25+25.

  • For comparing A and B, there are only N=1+1 experimental units, the two mice.

Biostatistics in practice

Experimental Units in Case Study

Biostatistics in practice

Experimental Units in Case Study

There is a nested hierarchy of several "levels" of data: Schools, children within the schools, and diets received by every child. What would you use for the "N" for this study?

Which outcomes do you intuitively think are correlated (in common language)? Results from one child's three diets? Results from children in the same school? Schools?

Biostatistics in practice

Experimental Units in Case Study

N = Number of children

Results from one child's three diets cannot be modeled as independent.

Results from children in the same school also could be “correlated” (dependent). They can be modeled as independent, if the effect of school is included in the analysis. Knowing one child’s score and the school mean gives no info on another child’s score.

Biostatistics in practice

Units and Analysis in the Case Study

N = Number of children


This method is a complex generalization of methods we discuss in Session 3.

For any method, though, you need to inform the software of the correct experimental units. For some experiments, it is obvious and implicit.

Graphs summarizing results

Graphs:Summarizing Results

Common graphical summaries

Common Graphical Summaries

Graph NameY-axisX-axis

HistogramCount or %Category

ScatterplotContinuous Continuous

Dot PlotContinuous Category

Box PlotPercentiles Category

Line PlotMean or value Category


Many of the examples are from

Data graphical displays

Data Graphical Displays


Scatter plot

Raw Data


* Raw data version is a stem-leaf plot. We will see one later.

Data graphical displays1

Data Graphical Displays

Dot Plot

Box Plot

Raw Data


Data graphical displays2

Data Graphical Displays

Line or Profile Plot

Summarized - bars can represent various types of ranges

Data graphical displays3

Data Graphical Displays

Kaplan-Meier Plot

Probability of Surviving 5 years is 0.35

This is not necessarily 35% of subjects

Graphs aids for analysis

Graphs:Aids for Analysis

Graphical aids for analysis

Graphical Aids for Analysis

Most statistical analyses involve modeling.

Parametric methods (t-test, ANOVA, Χ2) have stronger requirements than non-parametric methods (rank -based).

Every method is based on data satisfying certain requirements.

Many of these requirements can be assessed with some useful common graphics.

Look at the data for analysis requirements

Look at the Data for Analysis Requirements

  • What do we look for?

  • In Histograms (one variable):

  • Ideal: Symmetric, bell-shaped.

  • Potential Problems:

    • Skewness.

    • Multiple peaks.

    • Many values at, say, 0, and bell-shaped otherwise.

    • Outliers.

Example histogram ok for typical analyses

Example Histogram: OK for Typical* Analyses

  • Symmetric.

  • One peak.

  • Roughly bell-shaped.

  • No outliers.

*Typical: mean, SD, confidence intervals, to be discussed in later slides.

Biostatistics in practice

Histograms: Not OK for Typical Analyses



Need to transform intensity to another scale, e.g. Log(intensity)

Need to summarize with percentiles, not mean.

Biostatistics in practice

Histograms: Not OK for Typical Analyses

Truncated Values


Undetectable in 28 samples (<LLOQ)


Need to use percentiles for most analyses.

Need to use median, not mean, and percentiles.

Look at the data for analysis requirements1

Look at the Data for Analysis Requirements

  • What do we look for?

  • In Scatter Plots (two variables):

  • Ideal: Football-shaped; ellipse.

  • Potential Problems:

    • Outliers.

    • Funnel-shaped.

    • Gap with no values for one or both variables.

Example scatter plot ok for typical analyses

Example Scatter Plot: OK for Typical Analyses

Biostatistics in practice

Scatter Plot: Not OK for Typical Analyses

Gap and Outlier


Ott, Amer J Obstet Gyn 2005;192:1803-9.

Ferber et al, Amer J Obstet Gyn 2004;190:1473-5.

Should transform y-value to another scale, e.g. logarithm.

Consider analyzing subgroups.

Summary measures

Summary Measures

Common summary measures

Common Summary Measures

Mean and SD or SEMGeometric MeanZ-ScoresCorrelationSurvival ProbabilityRisks, Odds, and Hazards

Summary statistics one variable

Summary Statistics: One Variable

  • Data Reduction to a few summary measures.

  • Basic: Need Typical Value and Variability of Values

  • Typical Values (“Location”):

    • Mean for symmetric data.

    • Median for skewed data.

    • Geometric mean for some skewed data - details in later slides.

Summary statistics variation in values

Summary Statistics:Variation in Values

  • Standard Deviation, SD =~ 1.25 *(Average |deviation| of values from their mean).

  • Standard, convention, non-intuitive values.

  • SD of what? E.g., SD of individuals, or of group means.

  • Fundamental, critical measure for most statistical methods.

Biostatistics in practice

Examples: Mean and SD



Mean = 60.6 min.

SD = 9.6 min.

Mean = 15.1

SD = 2.8

Note that the entire range of data in A is about 6SDs wide, and is the source of the “Six Sigma” process used in quality control and business.

Biostatistics in practice

Examples: Mean and SD



SD = 1.1 min.

Mean = 70.3

Mean = 1.0 min.

SD = 22.3

Summary statistics rule of thumb

Summary Statistics:Rule of Thumb

  • For bell-shaped distributions of data (“normally” distributed):

  • ~ 68% of values are within mean ±1 SD

  • ~ 95% of values are within mean ±2 SD

    • “(Normal) Reference Range”

  • ~ 99.7% of values are within mean ±3 SD

  • Summary statistics geometric means

    Summary Statistics: Geometric means

    • Commonly used for skewed data.

      • Take logs of individual values.

      • Find, say, mean ±2 SD → mean and (low, up) of the logged values.

      • Find antilogs of mean, low, up. Call them GM, low2, up2 (back on original scale).

      • GM is the “geometric mean”. The interval (low2,up2) is skewed about GM (corresponds to graph).

  • [See next slide]

  • Geometric means

    Geometric Means

    These are flipped histograms rotated 90º, with box plots.

    Any log base can be used.

    GM = exp(4.633)

    = 102.8

    low2 = exp(4.633-2*1.09)

    = 11.6

    upp2 = exp(4.633+2*1.09)

    = 909.6

    ≈ 909.6

    ≈ 102.8

    ≈ 11.6

    Confidence intervals

    Confidence Intervals

    Reference ranges - or Prediction Intervals -are for individuals.

    Contains values for 95% of individuals.


    Confidence intervals (CI) are for a summary measure (parameter) for an entire population.

    Contains the (still unknown) summary measure for “everyone” with 95% certainty.

    Z score measure mean sd

    Z- Score = (Measure - Mean)/SD

    Mean = 60.6 min.SD = 9.6 min.

    Standardize a measure to have mean=0 and SD=1.

    Z-scores make different measures comparable.

    41 61 79

    Mean = 0SD = 1

    -2 0 2

    Z-Score = (Time-60.6)/9.6

    Outcome measure in case study

    Outcome Measure in Case Study

    GHA = Global Hyperactivity Aggregate

    For each child at each time:

    Z1 = Z-Score for ADHD from Teachers

    Z2 = Z-Score for WWP from Parents

    Z3 = Z-Score for ADHD in Classroom

    Z4 = Z-Score for Conner on Computer

    All have higher values ↔ more hyperactive.

    Z’s make each measure scaled similarly.

    GHA= Mean of Z1, Z2, Z3, Z4

    Confidence interval for population mean

    Confidence Interval for Population Mean

    95% Reference range - or Prediction Interval - or “Normal Range”, if subjects normal, is

    sample mean ± 2(SD)


    95% Confidence interval (CI) for the (true, but unknown) mean for the entire population is

    sample mean ± 2(SD/√N)

    SD/√N is called “Std Error of the Mean” (SEM)

    Confidence interval more details

    Confidence Interval: More Details

    Confidence interval (CI) for the (true, but unknown) mean for the entire population is

    95%, N=100:sample mean ± 1.98(SD/√N)

    95%, N= 30:sample mean ± 2.05(SD/√N)

    90%, N=100:sample mean ± 1.66(SD/√N)

    99%, N=100:sample mean ± 2.63(SD/√N)

    If N is small (N<30?), need normally, bell-shaped, data distribution. Otherwise, skewness is OK. This is not true for the PI, where percentiles are needed.

    Confidence interval case study

    Confidence Interval: Case Study

    Table 2

    Adjusted CI

    0.13 -0.12 -0.37

    Confidence Interval:

    -0.14 ±1.99(1.04/√73) =

    -0.14 ± 0.24 → -0.38 to 0.10

    close to

    Prediction Interval:

    -0.14 ±1.99(1.04) =

    -0.14 ± 2.07 → -2.21 to 1.93

    Ci for the antibody example

    CI for the Antibody Example

    GM = exp(4.633)

    = 102.8

    low2 = exp(4.633-2*1.09)

    = 11.6

    upp2 = exp(4.633+2*1.09)

    = 909.6

    So, there is 95% assurance that an individual is between 11.6 and 909.6, the PI.

    So, there is 95% certainty that the population mean is between 92.1 and 114.8, the CI.

    GM = exp(4.633)

    = 102.8

    low2 = exp(4.633-2*1.09 /√394)

    = 92.1

    upp2 = exp(4.633+2*1.09 /√394)

    = 114.8

    Summary statistics two variables correlation

    Summary Statistics:Two Variables (Correlation)

    • Always look at scatterplot.

    • Correlation, r, ranges from -1 (perfect inverse relation) to +1 (perfect direct). Zero=no relation.

    • Specific to the ranges of the two variables.

    • Typically, cannot extrapolate to populations with other ranges.

    • Measures association, not causation.

    • We will examine details in Session 5.

    Correlation depends on range of data

    Correlation Depends on Range of Data



    Graph B contains only the points from graph A that are in the ellipse.

    Correlation is reduced in graph B.

    Thus: correlation between two quantities may be quite different in different study populations.

    Correlation and measurement precision

    Correlation and Measurement Precision






    r=0 for s

    5 6


    A lack of correlation for the subpopulation with 5<x<6 may be due to inability to measure x and y well.

    Lack of evidence of association is not evidence of lack of association.

    Biostatistics in practice

    Summary Statistics: Survival Probability

    Example: 100 subjects start a study. Nine subjects drop out at 2 years and 7 drop out at 4 yrs and 20, 20, and 17 died in the intervals 0-2, 2-4, 4-5 yrs.

    Then, the 0-2 yr interval has 80/100 surviving.

    The 2-4 interval has 51/71 surviving; 4-5 has 27/44 surviving.

    So, 5-yr survival prob is (80/100)(51/71)(27/44) = 0.35.

    Actually uses finer subdivisions than 0-2, 2-4, 4-5 years, with exact death times.

    Don’t know vital status of 16 subjects at 5 years.

    Summary statistics relative likelihood of an event

    Summary Statistics:Relative Likelihood of an Event

    Compare groups A and B on mortality.

    Relative Risk = ProbA[Death] / ProbB[Death]

    where Prob[Death] ≈ Deaths per 100 Persons

    Odds Ratio = OddsA[Death] / OddsB[Death]

    where Odds= Prob[Death] / Prob[Survival]

    Hazard Ratio≈ IA[Death] / IB[Death]

    where I = Incidence

    = Deaths per 100 PersonDays

  • Login