1 / 28

# Introduction to resampling in MATLAB - PowerPoint PPT Presentation

Feb 5 2009 Austin Hilliard. Introduction to resampling in MATLAB. So you've done an experiment. Two independent datasets: control experimental Have n numbers in each dataset, representing independent measurements Question:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Introduction to resampling in MATLAB' - emil

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Austin Hilliard

Introduction to resampling in MATLAB

• Two independent datasets:

• control

• experimental

• Have n numbers in each dataset, representing independent measurements

• Question:

Did the experimental manipulation have an effect, i.e. did it make a difference?

• Is there a significant (i.e. “real”) numerical difference between the control and experimental datasets?

• What is the probability that the two datasets are subsets of two different/independent distributions?

• What is the probability that the two datasets are NOT subsets of the same underlying distribution?

• What do we need?

• A number(s) to represent/summarize each dataset

• A test for comparing these numbers

→ test statistic

• A way to assess the significance of the test statistic

→ distribution of test statistics generated from datasets that ARE from the same underlying distribution

• Know your data!

• Visualize it

• Does it have a central tendency? (i.e. is histogram vaguely mound shaped?)

• - If yes, use the mean as descriptor

• - If no, median may be better

• Typical test: compute difference (subtraction) between descriptors

Simulate and inspect data statistic soon, that's the resampling part...

ctrl = 10.*(5.75+randn(25,1));

exp = 10.*(5+randn(25,1));

boxplot([ctrl exp])

groups = {'control''experimental'}

boxplot([ctrl exp],'labels',groups)

Central tendency? statistic soon, that's the resampling part...

hist(ctrl) hist(exp)

title('Control group') title('Experimental group')

Compute descriptors and test stat statistic soon, that's the resampling part...

• Our data looks pretty mound shaped, so let's use the mean as data descriptor

• Take the absolute value of the difference as our test stat

ctrl_d = mean(ctrl);

exp_d = mean(exp);

test_stat = abs(ctrl_d - exp_d);

Assessing the test statistic statistic soon, that's the resampling part...

• What we would really like to test:

• The probability that the two datasets are subsets of two different/independent distributions

• Problem:

• We don't know what those hypothetical independent distributions are, if we did we wouldn't have to go through all of this!

Assessing the test statistic statistic soon, that's the resampling part...

• Solution:

• Compute the probability that the two datasets are NOT subsets of the same underlying distribution

• How to do this?

• Start by assuming datasets are subsets of same distribution → null hypothesis

• See what test statistic looks like when null hypothesis is true

Assessing the test statistic statistic soon, that's the resampling part...

• Need to generate a distribution of test statistic generated when datasets really ARE from the same underlying distribution

→ distribution of test stat under null hypothesis

• Then we can quantify how (a)typical the value of our test stat is under null hypothesis

• if typical, our datasets are likely from same distribution → no effect of experiment

• if atypical, there is a good chance datasets are from different distributions → experiment had an effect

Generate the null distribution using bootstrap statistic soon, that's the resampling part...

• Basic procedure:

• Combine datasets

• Randomly sample (w/ replacement) from combined dataset to create two pseudo datasets w/ same n as real datasets

• Compute descriptors and test statistic for pseudo datasets

• Repeat 10,000 times, keeping track of pseudo test stat

Combine datasets and resample statistic soon, that's the resampling part...

box = [ctrl; exp]; % combine datasets into 'box'

% could use concat()

pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist'

% to store results of resampling

% --- start resampling ---

for trials = 1:10000 % resample 10000 times

pseudo_ctrl = ... % create pseudo groups to be

sample(length(ctrl), box); % same size as actual groups

pseudo_exp = ...

sample(length(exp), box);

pseudo_ctrl_d = ... % compute pseudo group

mean(pseudo_ctrl); % descriptors

pseudo_exp_d = ...

mean(pseudo_exp);

pseudo_stat = ... % compute pseudo test stat

abs(pseudo_ctrl_d - pseudo_exp_d);

pseudo_stat_dist(trials) = ... % store pseudo test stat to

pseudo_stat; % build null distribution

end

Now what? statistic soon, that's the resampling part...

• Compute how many values of pseudo test stat are greater than our actual test stat, then divide by the total number of data points in the null distribution

• This approximates the likelihood of getting our actual test stat (i.e. the likelihood of seeing a difference as big as we see) if our two datasets were truly from the same underlying distribution

→ p-value

Compute p-val and visualize null distribution statistic soon, that's the resampling part...

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

How to interpret the p-value? statistic soon, that's the resampling part...

• Assuming that the null hypothesis is true, the p-value is the likelihood that we would see a value of the test statistic that is greater than the value of our actual test statistic

Restated →

• Assuming both groups really are from the same underlying distribution, the p-value is the likelihood that we would see a difference between them as big as we do

“Paired” data statistic soon, that's the resampling part...

• What if your two datasets are not actually independent?

• Maybe you have a single group of subjects/birds/cells/etc. with measurements taken under different conditions

• Important: the measurements within each condition must still be independent

Bootstrap to test paired data statistic soon, that's the resampling part...

• Basic strategy is the same, but...

• Instead of one descriptor/group, have as many descriptors as data pairs

• compute difference between measurement under condition1 and measurement under condition2

• Test statistic is mean/median/etc. of the differences

• Slightly different resampling procedure

Visualize data statistic soon, that's the resampling part...

• Investigate distributions for raw data, also should plot distribution of the differences between paired data-points

• Why?

→ Test statistic is computed from these differences, not the raw datasets

Resampling procedure statistic soon, that's the resampling part...

• Randomly sample 1 or -1 and store in vector

• Multiply by vector of differences

• randomizes direction of change

• Compute test stat (mean, median, etc.) for randomized differences

• Repeat

• Now we have distribution of test statistic if direction of change is random, i.e. NOT dependent on experimental conditions

Compute descriptors and test stat statistic soon, that's the resampling part...

diffs = ctrl - exp; % compute change from ctrl condition

% to exp condition

test_stat = abs(mean(diffs)); % compute test statistic

pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test

% stats generated from resampling

Resampling statistic soon, that's the resampling part...

for trials = 1:10000 % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's

% and -1's same size as diffs

pseudo_diffs = signs .* diffs; % multiply by actual diffs to

% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from

% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to

% grow null distribution

end

Compute p-val and visualize null distribution statistic soon, that's the resampling part...

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

function statistic soon, that's the resampling part... [] = bootstrap2groupsMEAN_PAIRED (ctrl, exp)

% --- prepare for resampling ---

diffs = ctrl - exp; % compute change from ctrl condition

% to exp condition

test_stat = abs(mean(diffs)) % compute test statistic

pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test

% stats generated from resampling

% --- resampling ---

for trials = 1:10000 % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's

% and -1's same size as diffs

pseudo_diffs = signs .* diffs; % multiply by actual diffs to

% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from

% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to

% grow null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

function statistic soon, that's the resampling part... [] = bootstrap2groupsMEANS (ctrl, exp)

%

% function [] = bootstrap2groupsMEANS (ctrl, exp)

%

% 'ctrl' and 'exp' are assumed to be column vectors, this is

% necessary for creation of 'box'

%

% --- prepare for resampling ---

ctrl_d = mean(ctrl); % compute descriptors

exp_d = mean(exp); % 'ctrl_d' and 'exp_d'

test_stat = abs(ctrl_d - exp_d) % compute test statistic 'test_stat'

box = [ctrl; exp]; % combine datasets into 'box'

% could use concat()

pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist'

% to store results of resampling

% --- start resampling ---

for trials = 1:10000 % resample 10,000 times

pseudo_ctrl = ... % create pseudo groups to be

sample(length(ctrl), box); % same size as actual groups

pseudo_exp = ...

sample(length(exp), box);

pseudo_ctrl_d = ... % compute pseudo group

mean(pseudo_ctrl); % descriptors

pseudo_exp_d = ...

mean(pseudo_exp);

pseudo_stat = ... % compute pseudo test stat

abs(pseudo_ctrl_d - pseudo_exp_d);

pseudo_stat_dist(trials) = ... % store pseudo test stat to

pseudo_stat; % build null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

• Don't forget to look statistic soon, that's the resampling part...online to get a more generalized version of 'bootstrap2groupsMEANS.m' called 'bootstrap2groups.m'

• It's more flexible (can define method for descriptor and iterations as input), stores output in a struct, plots raw data, has one-tailed tests if needed, and can be used for demo (try running 'out = bootstrap2groups(@mean)'

Possible topics for next week statistic soon, that's the resampling part...

• confidence intervals

• normality testing

• ANOVA (1 or 2-way)

• power