- 53 Views
- Uploaded on
- Presentation posted in: General

Introduction to resampling in MATLAB

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Feb 5 2009

Austin Hilliard

- Two independent datasets:
- control
- experimental

- Have n numbers in each dataset, representing independent measurements
- Question:
Did the experimental manipulation have an effect, i.e. did it make a difference?

- Is there a significant (i.e. “real”) numerical difference between the control and experimental datasets?
- What is the probability that the two datasets are subsets of two different/independent distributions?
- What is the probability that the two datasets are NOT subsets of the same underlying distribution?

- What do we need?
- A number(s) to represent/summarize each dataset
→ datadescriptors

- A test for comparing these numbers
→ test statistic

- A way to assess the significance of the test statistic
→ distribution of test statistics generated from datasets that ARE from the same underlying distribution

- A number(s) to represent/summarize each dataset

- Know your data!
- Visualize it
- Does it have a central tendency? (i.e. is histogram vaguely mound shaped?)
- - If yes, use the mean as descriptor
- - If no, median may be better

- Typical test: compute difference (subtraction) between descriptors

- We'll discuss how to assess the significance of your test statistic soon, that's the resampling part...
- But first let's visualize some data

ctrl = 10.*(5.75+randn(25,1));

exp = 10.*(5+randn(25,1));

boxplot([ctrl exp])

groups = {'control''experimental'}

boxplot([ctrl exp],'labels',groups)

hist(ctrl) hist(exp)

title('Control group') title('Experimental group')

- Our data looks pretty mound shaped, so let's use the mean as data descriptor
- Take the absolute value of the difference as our test stat

ctrl_d = mean(ctrl);

exp_d = mean(exp);

test_stat = abs(ctrl_d - exp_d);

- What we would really like to test:
- The probability that the two datasets are subsets of two different/independent distributions

- Problem:
- We don't know what those hypothetical independent distributions are, if we did we wouldn't have to go through all of this!

- Solution:
- Compute the probability that the two datasets are NOT subsets of the same underlying distribution

- How to do this?
- Start by assuming datasets are subsets of same distribution → null hypothesis
- See what test statistic looks like when null hypothesis is true

- Need to generate a distribution of test statistic generated when datasets really ARE from the same underlying distribution
→ distribution of test stat under null hypothesis

- Then we can quantify how (a)typical the value of our test stat is under null hypothesis
- if typical, our datasets are likely from same distribution → no effect of experiment
- if atypical, there is a good chance datasets are from different distributions → experiment had an effect

- Basic procedure:
- Combine datasets
- Randomly sample (w/ replacement) from combined dataset to create two pseudo datasets w/ same n as real datasets
- Compute descriptors and test statistic for pseudo datasets
- Repeat 10,000 times, keeping track of pseudo test stat

box = [ctrl; exp]; % combine datasets into 'box'

% could use concat()

pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist'

% to store results of resampling

% --- start resampling ---

for trials = 1:10000 % resample 10000 times

pseudo_ctrl = ... % create pseudo groups to be

sample(length(ctrl), box); % same size as actual groups

pseudo_exp = ...

sample(length(exp), box);

pseudo_ctrl_d = ... % compute pseudo group

mean(pseudo_ctrl); % descriptors

pseudo_exp_d = ...

mean(pseudo_exp);

pseudo_stat = ... % compute pseudo test stat

abs(pseudo_ctrl_d - pseudo_exp_d);

pseudo_stat_dist(trials) = ... % store pseudo test stat to

pseudo_stat; % build null distribution

end

- Compute how many values of pseudo test stat are greater than our actual test stat, then divide by the total number of data points in the null distribution
- This approximates the likelihood of getting our actual test stat (i.e. the likelihood of seeing a difference as big as we see) if our two datasets were truly from the same underlying distribution
→ p-value

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

- Assuming that the null hypothesis is true, the p-value is the likelihood that we would see a value of the test statistic that is greater than the value of our actual test statistic
Restated →

- Assuming both groups really are from the same underlying distribution, the p-value is the likelihood that we would see a difference between them as big as we do

- What if your two datasets are not actually independent?
- Maybe you have a single group of subjects/birds/cells/etc. with measurements taken under different conditions
- Important: the measurements within each condition must still be independent

- Basic strategy is the same, but...
- Instead of one descriptor/group, have as many descriptors as data pairs
- compute difference between measurement under condition1 and measurement under condition2

- Test statistic is mean/median/etc. of the differences
- Slightly different resampling procedure

- Investigate distributions for raw data, also should plot distribution of the differences between paired data-points
- Why?
→ Test statistic is computed from these differences, not the raw datasets

- Randomly sample 1 or -1 and store in vector
- Multiply by vector of differences
- randomizes direction of change

- Compute test stat (mean, median, etc.) for randomized differences
- Repeat
- Now we have distribution of test statistic if direction of change is random, i.e. NOT dependent on experimental conditions

diffs = ctrl - exp; % compute change from ctrl condition

% to exp condition

test_stat = abs(mean(diffs)); % compute test statistic

pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test

% stats generated from resampling

for trials = 1:10000 % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's

% and -1's same size as diffs

pseudo_diffs = signs .* diffs; % multiply by actual diffs to

% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from

% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to

% grow null distribution

end

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

function [] = bootstrap2groupsMEAN_PAIRED (ctrl, exp)

% --- prepare for resampling ---

diffs = ctrl - exp; % compute change from ctrl condition

% to exp condition

test_stat = abs(mean(diffs)) % compute test statistic

pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test

% stats generated from resampling

% --- resampling ---

for trials = 1:10000 % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's

% and -1's same size as diffs

pseudo_diffs = signs .* diffs; % multiply by actual diffs to

% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from

% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to

% grow null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

function [] = bootstrap2groupsMEANS (ctrl, exp)

%

% function [] = bootstrap2groupsMEANS (ctrl, exp)

%

% 'ctrl' and 'exp' are assumed to be column vectors, this is

% necessary for creation of 'box'

%

% --- prepare for resampling ---

ctrl_d = mean(ctrl); % compute descriptors

exp_d = mean(exp); % 'ctrl_d' and 'exp_d'

test_stat = abs(ctrl_d - exp_d) % compute test statistic 'test_stat'

box = [ctrl; exp]; % combine datasets into 'box'

% could use concat()

pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist'

% to store results of resampling

% --- start resampling ---

for trials = 1:10000 % resample 10,000 times

pseudo_ctrl = ... % create pseudo groups to be

sample(length(ctrl), box); % same size as actual groups

pseudo_exp = ...

sample(length(exp), box);

pseudo_ctrl_d = ... % compute pseudo group

mean(pseudo_ctrl); % descriptors

pseudo_exp_d = ...

mean(pseudo_exp);

pseudo_stat = ... % compute pseudo test stat

abs(pseudo_ctrl_d - pseudo_exp_d);

pseudo_stat_dist(trials) = ... % store pseudo test stat to

pseudo_stat; % build null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

- Don't forget to look online to get a more generalized version of 'bootstrap2groupsMEANS.m' called 'bootstrap2groups.m'
- It's more flexible (can define method for descriptor and iterations as input), stores output in a struct, plots raw data, has one-tailed tests if needed, and can be used for demo (try running 'out = bootstrap2groups(@mean)'

- confidence intervals
- normality testing
- ANOVA (1 or 2-way)
- power