Lecture 8

Lecture 8 • Resistance of two-sample t-tools and outliers (Chapters 3.3-3.4) • Transformations of the Data (Chapter 3.5)

Outliers and resistance • Outliers are observations relatively far from their estimated means. • Outliers may arise either • (a) if the population distribution is long-tailed. • (b) they don’t belong to the population of interest (come from contaminating population) • A statistical procedure is resistant if one or a few outliers cannot have an undue influence on result.

Resistance • Illustration for understanding resistance: the sample mean is not resistant; the sample median is. • Sample: 9, 3, 5, 8, 100 • Mean with outlier: 25, without: 6.2 • Median with outlier: 8, without: 6.5 • t-tools are not resistant to outliers because they are based on sample means.

Strategy for Dealing With Outliers • Follow Display 3.6 • Important aspect of strategy: An outlier does not get swept under the rug simply because it is different from the other observations. To warrant its removal, an explanation for why it is different must be established.

Excluding Observations from Analysis in JMP for Investigating Outliers • Click on row you want to exclude. • Click on rows menu and then click exclude/unexclude. A red circle with a line through it will appear next to the excluded observation. • Multiple observations can be excluded. • To include an observation that was excluded back into the analysis, click on excluded row, click on rows menu and then click exclude/unexclude. The red circle next to observation should disappear.

Conceptual Question #6 • (a) What course of action would you propose for the statistical analysis if it was learned that Vietnam veteran #646 (the largest observation in Display 3.6) worked for several years, after Vietnam, handling herbicides with dioxin? • (b) What would you propose if this was learned instead for Vietnam veteran #645 (second largest observation)?

Rules of thumb for validity of t-tools • Assumptions and rules of thumb for validity of t-tools in the face of violations • Normality: Look for gross skewness. Okay if both sample sizes greater than 30. • Equal spread: Validity okay if ratio of larger sample standard deviation to smaller sample standard deviation is less than 2 and ratio of larger group size to smaller group size is less than 2. Consider transformations. • Outliers: Look for outliers in box plots, especially very extreme points (more than 3 box-lengths away from box). Apply the examination strategy in Display 3.6. • Independence: If indep. not appropriate, apply matched pairs if appropriate or other tools later in course.

Case Study 3.1.1: Cloud Seeding • A random experiment was conducted to test a hypothesis that massive injection of silver iodide into cumulus clouds can lead to increased rainfall. • On each of 52 days that were deemed suitable for cloud seeding, a random mechanism was used to decide whether to seed target cloud on that day or leave it unseeded as a control. • Airplane flew through cloud in both cases, experimenters were blind to whether seeding was used – double blind trial. • Question of interest: Did cloud seeding cause higher rainfall in this experiment?

The log transformation • Let log denote the logarithm to the base e, ln, log(x)=c means • log(2.718)=1, log(2.7182)=2, etc. • Procedure: • Transform to get two new columns: • Graphically examine to see if the t-tools are appropriate for • If appropriate, use t-tools on • Interpret results on original scale

Cloud seeding data after log transformation

Interpretation – Causal Inference • If the randomized experiment model with additive treatment effect is thought to hold for the log-transformed data, then an experimental unit that would respond to treatment 1 with a logged outcome of log(Y) would respond to treatment 2 with a logged outcome of log(Y)+ • i.e., experimental unit responds to treatment 1 with an outcome of Y and treatment 2 with an outcome of Y • Multiplicative treatment effect model: The effect of the treatment 2 is to multiply the treatment 1 outcome by

Inference for multiplicative treatment effects • To test whether there is any treatment effect, perform the usual t-test for with the log transformed data • To describe the treatment effect, “back-transform” the estimate of and the endpoints of the confidence interval for from the log-transformed data.

Log Transformation for Population Inference • Consider comparing means of two populations. If the populations appear skewed with the larger population having the larger spread, using the t-tools to analyze the log transformed data might be more appropriate. • Using the t-tools on the log transformed data is appropriate (i.e., produces approximately valid results) if and are approximately normally distributed.

Inference for Population Medians • If distributions of Z1=log(Y1) and Z2=log(Y2) appear approximately normal with equal SD, then we can make inferences about the ratio of population medians for Y1 and Y2 as follows: • To test if population medians are the same, test the null hypothesis that the means of Z1 and Z2 are the same • An estimate of the ratio of the population 2 median to the population 1 median is exp( ). • To form a confidence interval for the ratio of population medians, form a confidence interval for the difference in the means of Z1 and Z2, (U,L). A confidence interval for the ratio of the population 2 median to the population 1 median is

When to use log transformation What indicates that log might work? • Distributions are skewed • Spread is greater in the distribution with larger center • The data values differ by orders of magnitude, e.g., as a rough guide, the ratio of the largest to the smallest is >10 (or perhaps >4) • Multiplicative statement is desirable

Other transformations • Square root transformation - applies to data that are counts and to measurements of area • Reciprocal transformation - applies to data that are waiting times (e.g., time to failure of lightbulbs), reciprocal of time measurement can often be interpreted directly as a rate or a speed • Goals of transformation: Establish a scale on which two groups have roughly the same spread. • Inferences from log transformation are directly interpretable when converted back to original scale of measurement. Other transformations are not so easily interpretable, e.g., square of difference between means of and is not so easily interpretable.

Lecture 8

Lecture 8

Presentation Transcript

Lecture 8

Lecture 8

Lecture 8

Lecture 8

Lecture #8

Lecture 8

Lecture 8

Lecture 8

Lecture 8

Lecture 8

Lecture 8

LECTURE № 8

LECTURE 8

Lecture 8

Lecture 8

Lecture 8

Lecture 8

Lecture 8

Lecture 8