# Introduction to resampling in MATLAB - PowerPoint PPT Presentation

1 / 28

Feb 5 2009 Austin Hilliard. Introduction to resampling in MATLAB. So you've done an experiment. Two independent datasets: control experimental Have n numbers in each dataset, representing independent measurements Question:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Introduction to resampling in MATLAB

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Feb 5 2009

Austin Hilliard

### So you've done an experiment...

• Two independent datasets:

• control

• experimental

• Have n numbers in each dataset, representing independent measurements

• Question:

Did the experimental manipulation have an effect, i.e. did it make a difference?

### The question restated

• Is there a significant (i.e. “real”) numerical difference between the control and experimental datasets?

• What is the probability that the two datasets are subsets of two different/independent distributions?

• What is the probability that the two datasets are NOT subsets of the same underlying distribution?

• What do we need?

• A number(s) to represent/summarize each dataset

• A test for comparing these numbers

→ test statistic

• A way to assess the significance of the test statistic

→ distribution of test statistics generated from datasets that ARE from the same underlying distribution

### Descriptors and tests

• Visualize it

• Does it have a central tendency? (i.e. is histogram vaguely mound shaped?)

• - If yes, use the mean as descriptor

• - If no, median may be better

• Typical test: compute difference (subtraction) between descriptors

• We'll discuss how to assess the significance of your test statistic soon, that's the resampling part...

• But first let's visualize some data

### Simulate and inspect data

ctrl = 10.*(5.75+randn(25,1));

exp = 10.*(5+randn(25,1));

boxplot([ctrl exp])

groups = {'control''experimental'}

boxplot([ctrl exp],'labels',groups)

### Central tendency?

hist(ctrl) hist(exp)

title('Control group') title('Experimental group')

### Compute descriptors and test stat

• Our data looks pretty mound shaped, so let's use the mean as data descriptor

• Take the absolute value of the difference as our test stat

ctrl_d = mean(ctrl);

exp_d = mean(exp);

test_stat = abs(ctrl_d - exp_d);

### Assessing the test statistic

• What we would really like to test:

• The probability that the two datasets are subsets of two different/independent distributions

• Problem:

• We don't know what those hypothetical independent distributions are, if we did we wouldn't have to go through all of this!

### Assessing the test statistic

• Solution:

• Compute the probability that the two datasets are NOT subsets of the same underlying distribution

• How to do this?

• Start by assuming datasets are subsets of same distribution → null hypothesis

• See what test statistic looks like when null hypothesis is true

### Assessing the test statistic

• Need to generate a distribution of test statistic generated when datasets really ARE from the same underlying distribution

→ distribution of test stat under null hypothesis

• Then we can quantify how (a)typical the value of our test stat is under null hypothesis

• if typical, our datasets are likely from same distribution → no effect of experiment

• if atypical, there is a good chance datasets are from different distributions → experiment had an effect

### Generate the null distribution using bootstrap

• Basic procedure:

• Combine datasets

• Randomly sample (w/ replacement) from combined dataset to create two pseudo datasets w/ same n as real datasets

• Compute descriptors and test statistic for pseudo datasets

• Repeat 10,000 times, keeping track of pseudo test stat

### Combine datasets and resample

box = [ctrl; exp]; % combine datasets into 'box'

% could use concat()

pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist'

% to store results of resampling

% --- start resampling ---

for trials = 1:10000 % resample 10000 times

pseudo_ctrl = ... % create pseudo groups to be

sample(length(ctrl), box); % same size as actual groups

pseudo_exp = ...

sample(length(exp), box);

pseudo_ctrl_d = ... % compute pseudo group

mean(pseudo_ctrl); % descriptors

pseudo_exp_d = ...

mean(pseudo_exp);

pseudo_stat = ... % compute pseudo test stat

abs(pseudo_ctrl_d - pseudo_exp_d);

pseudo_stat_dist(trials) = ... % store pseudo test stat to

pseudo_stat; % build null distribution

end

### Now what?

• Compute how many values of pseudo test stat are greater than our actual test stat, then divide by the total number of data points in the null distribution

• This approximates the likelihood of getting our actual test stat (i.e. the likelihood of seeing a difference as big as we see) if our two datasets were truly from the same underlying distribution

→ p-value

### Compute p-val and visualize null distribution

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

### How to interpret the p-value?

• Assuming that the null hypothesis is true, the p-value is the likelihood that we would see a value of the test statistic that is greater than the value of our actual test statistic

Restated →

• Assuming both groups really are from the same underlying distribution, the p-value is the likelihood that we would see a difference between them as big as we do

### “Paired” data

• What if your two datasets are not actually independent?

• Maybe you have a single group of subjects/birds/cells/etc. with measurements taken under different conditions

• Important: the measurements within each condition must still be independent

### Bootstrap to test paired data

• Basic strategy is the same, but...

• Instead of one descriptor/group, have as many descriptors as data pairs

• compute difference between measurement under condition1 and measurement under condition2

• Test statistic is mean/median/etc. of the differences

• Slightly different resampling procedure

### Visualize data

• Investigate distributions for raw data, also should plot distribution of the differences between paired data-points

• Why?

→ Test statistic is computed from these differences, not the raw datasets

### Resampling procedure

• Randomly sample 1 or -1 and store in vector

• Multiply by vector of differences

• randomizes direction of change

• Compute test stat (mean, median, etc.) for randomized differences

• Repeat

• Now we have distribution of test statistic if direction of change is random, i.e. NOT dependent on experimental conditions

### Compute descriptors and test stat

diffs = ctrl - exp; % compute change from ctrl condition

% to exp condition

test_stat = abs(mean(diffs)); % compute test statistic

pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test

% stats generated from resampling

### Resampling

for trials = 1:10000 % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's

% and -1's same size as diffs

pseudo_diffs = signs .* diffs; % multiply by actual diffs to

% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from

% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to

% grow null distribution

end

### Compute p-val and visualize null distribution

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

function [] = bootstrap2groupsMEAN_PAIRED (ctrl, exp)

% --- prepare for resampling ---

diffs = ctrl - exp; % compute change from ctrl condition

% to exp condition

test_stat = abs(mean(diffs)) % compute test statistic

pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test

% stats generated from resampling

% --- resampling ---

for trials = 1:10000 % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's

% and -1's same size as diffs

pseudo_diffs = signs .* diffs; % multiply by actual diffs to

% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from

% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to

% grow null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

function [] = bootstrap2groupsMEANS (ctrl, exp)

%

% function [] = bootstrap2groupsMEANS (ctrl, exp)

%

% 'ctrl' and 'exp' are assumed to be column vectors, this is

% necessary for creation of 'box'

%

% --- prepare for resampling ---

ctrl_d = mean(ctrl); % compute descriptors

exp_d = mean(exp); % 'ctrl_d' and 'exp_d'

test_stat = abs(ctrl_d - exp_d) % compute test statistic 'test_stat'

box = [ctrl; exp]; % combine datasets into 'box'

% could use concat()

pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist'

% to store results of resampling

% --- start resampling ---

for trials = 1:10000 % resample 10,000 times

pseudo_ctrl = ... % create pseudo groups to be

sample(length(ctrl), box); % same size as actual groups

pseudo_exp = ...

sample(length(exp), box);

pseudo_ctrl_d = ... % compute pseudo group

mean(pseudo_ctrl); % descriptors

pseudo_exp_d = ...

mean(pseudo_exp);

pseudo_stat = ... % compute pseudo test stat

abs(pseudo_ctrl_d - pseudo_exp_d);

pseudo_stat_dist(trials) = ... % store pseudo test stat to

pseudo_stat; % build null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')

• Don't forget to look online to get a more generalized version of 'bootstrap2groupsMEANS.m' called 'bootstrap2groups.m'

• It's more flexible (can define method for descriptor and iterations as input), stores output in a struct, plots raw data, has one-tailed tests if needed, and can be used for demo (try running 'out = bootstrap2groups(@mean)'

### Possible topics for next week

• confidence intervals

• normality testing

• ANOVA (1 or 2-way)

• power