Feb 5 2009 austin hilliard
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Introduction to resampling in MATLAB PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

Feb 5 2009 Austin Hilliard. Introduction to resampling in MATLAB. So you've done an experiment. Two independent datasets: control experimental Have n numbers in each dataset, representing independent measurements Question:

Download Presentation

Introduction to resampling in MATLAB

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Feb 5 2009 austin hilliard

Feb 5 2009

Austin Hilliard

Introduction to resampling in MATLAB


So you ve done an experiment

So you've done an experiment...

  • Two independent datasets:

    • control

    • experimental

  • Have n numbers in each dataset, representing independent measurements

  • Question:

    Did the experimental manipulation have an effect, i.e. did it make a difference?


The question restated

The question restated

  • Is there a significant (i.e. “real”) numerical difference between the control and experimental datasets?

  • What is the probability that the two datasets are subsets of two different/independent distributions?

  • What is the probability that the two datasets are NOT subsets of the same underlying distribution?


Answering the question

Answering the question

  • What do we need?

    • A number(s) to represent/summarize each dataset

      → datadescriptors

    • A test for comparing these numbers

      → test statistic

    • A way to assess the significance of the test statistic

      → distribution of test statistics generated from datasets that ARE from the same underlying distribution


Descriptors and tests

Descriptors and tests

  • Know your data!

    • Visualize it

    • Does it have a central tendency? (i.e. is histogram vaguely mound shaped?)

      • - If yes, use the mean as descriptor

      • - If no, median may be better

  • Typical test: compute difference (subtraction) between descriptors


Introduction to resampling in matlab

  • We'll discuss how to assess the significance of your test statistic soon, that's the resampling part...

  • But first let's visualize some data


Simulate and inspect data

Simulate and inspect data

ctrl = 10.*(5.75+randn(25,1));

exp = 10.*(5+randn(25,1));

boxplot([ctrl exp])

groups = {'control''experimental'}

boxplot([ctrl exp],'labels',groups)


Central tendency

Central tendency?

hist(ctrl) hist(exp)

title('Control group') title('Experimental group')


Compute descriptors and test stat

Compute descriptors and test stat

  • Our data looks pretty mound shaped, so let's use the mean as data descriptor

  • Take the absolute value of the difference as our test stat

ctrl_d = mean(ctrl);

exp_d = mean(exp);

test_stat = abs(ctrl_d - exp_d);


Assessing the test statistic

Assessing the test statistic

  • What we would really like to test:

    • The probability that the two datasets are subsets of two different/independent distributions

  • Problem:

    • We don't know what those hypothetical independent distributions are, if we did we wouldn't have to go through all of this!


Assessing the test statistic1

Assessing the test statistic

  • Solution:

    • Compute the probability that the two datasets are NOT subsets of the same underlying distribution

  • How to do this?

    • Start by assuming datasets are subsets of same distribution → null hypothesis

    • See what test statistic looks like when null hypothesis is true


Assessing the test statistic2

Assessing the test statistic

  • Need to generate a distribution of test statistic generated when datasets really ARE from the same underlying distribution

    → distribution of test stat under null hypothesis

  • Then we can quantify how (a)typical the value of our test stat is under null hypothesis

    • if typical, our datasets are likely from same distribution → no effect of experiment

    • if atypical, there is a good chance datasets are from different distributions → experiment had an effect


Generate the null distribution using bootstrap

Generate the null distribution using bootstrap

  • Basic procedure:

    • Combine datasets

    • Randomly sample (w/ replacement) from combined dataset to create two pseudo datasets w/ same n as real datasets

    • Compute descriptors and test statistic for pseudo datasets

    • Repeat 10,000 times, keeping track of pseudo test stat


Combine datasets and resample

Combine datasets and resample

box = [ctrl; exp]; % combine datasets into 'box'

% could use concat()

pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist'

% to store results of resampling

% --- start resampling ---

for trials = 1:10000 % resample 10000 times

pseudo_ctrl = ... % create pseudo groups to be

sample(length(ctrl), box); % same size as actual groups

pseudo_exp = ...

sample(length(exp), box);

pseudo_ctrl_d = ... % compute pseudo group

mean(pseudo_ctrl); % descriptors

pseudo_exp_d = ...

mean(pseudo_exp);

pseudo_stat = ... % compute pseudo test stat

abs(pseudo_ctrl_d - pseudo_exp_d);

pseudo_stat_dist(trials) = ... % store pseudo test stat to

pseudo_stat; % build null distribution

end


Now what

Now what?

  • Compute how many values of pseudo test stat are greater than our actual test stat, then divide by the total number of data points in the null distribution

  • This approximates the likelihood of getting our actual test stat (i.e. the likelihood of seeing a difference as big as we see) if our two datasets were truly from the same underlying distribution

    → p-value


Compute p val and visualize null distribution

Compute p-val and visualize null distribution

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')


How to interpret the p value

How to interpret the p-value?

  • Assuming that the null hypothesis is true, the p-value is the likelihood that we would see a value of the test statistic that is greater than the value of our actual test statistic

    Restated →

  • Assuming both groups really are from the same underlying distribution, the p-value is the likelihood that we would see a difference between them as big as we do


Paired data

“Paired” data

  • What if your two datasets are not actually independent?

  • Maybe you have a single group of subjects/birds/cells/etc. with measurements taken under different conditions

  • Important: the measurements within each condition must still be independent


Bootstrap to test paired data

Bootstrap to test paired data

  • Basic strategy is the same, but...

  • Instead of one descriptor/group, have as many descriptors as data pairs

    • compute difference between measurement under condition1 and measurement under condition2

  • Test statistic is mean/median/etc. of the differences

  • Slightly different resampling procedure


Visualize data

Visualize data

  • Investigate distributions for raw data, also should plot distribution of the differences between paired data-points

  • Why?

    → Test statistic is computed from these differences, not the raw datasets


Resampling procedure

Resampling procedure

  • Randomly sample 1 or -1 and store in vector

  • Multiply by vector of differences

    • randomizes direction of change

  • Compute test stat (mean, median, etc.) for randomized differences

  • Repeat

  • Now we have distribution of test statistic if direction of change is random, i.e. NOT dependent on experimental conditions


Compute descriptors and test stat1

Compute descriptors and test stat

diffs = ctrl - exp; % compute change from ctrl condition

% to exp condition

test_stat = abs(mean(diffs)); % compute test statistic

pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test

% stats generated from resampling


Resampling

Resampling

for trials = 1:10000 % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's

% and -1's same size as diffs

pseudo_diffs = signs .* diffs; % multiply by actual diffs to

% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from

% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to

% grow null distribution

end


Compute p val and visualize null distribution1

Compute p-val and visualize null distribution

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')


Introduction to resampling in matlab

function [] = bootstrap2groupsMEAN_PAIRED (ctrl, exp)

% --- prepare for resampling ---

diffs = ctrl - exp; % compute change from ctrl condition

% to exp condition

test_stat = abs(mean(diffs)) % compute test statistic

pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test

% stats generated from resampling

% --- resampling ---

for trials = 1:10000 % resample 10,000 times

signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's

% and -1's same size as diffs

pseudo_diffs = signs .* diffs; % multiply by actual diffs to

% randomize direction of diff

pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from

% pseudo diffs

pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to

% grow null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')


Introduction to resampling in matlab

function [] = bootstrap2groupsMEANS (ctrl, exp)

%

% function [] = bootstrap2groupsMEANS (ctrl, exp)

%

% 'ctrl' and 'exp' are assumed to be column vectors, this is

% necessary for creation of 'box'

%

% --- prepare for resampling ---

ctrl_d = mean(ctrl); % compute descriptors

exp_d = mean(exp); % 'ctrl_d' and 'exp_d'

test_stat = abs(ctrl_d - exp_d) % compute test statistic 'test_stat'

box = [ctrl; exp]; % combine datasets into 'box'

% could use concat()

pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist'

% to store results of resampling

% --- start resampling ---

for trials = 1:10000 % resample 10,000 times

pseudo_ctrl = ... % create pseudo groups to be

sample(length(ctrl), box); % same size as actual groups

pseudo_exp = ...

sample(length(exp), box);

pseudo_ctrl_d = ... % compute pseudo group

mean(pseudo_ctrl); % descriptors

pseudo_exp_d = ...

mean(pseudo_exp);

pseudo_stat = ... % compute pseudo test stat

abs(pseudo_ctrl_d - pseudo_exp_d);

pseudo_stat_dist(trials) = ... % store pseudo test stat to

pseudo_stat; % build null distribution

end

% --- end resampling ---

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats

% bigger than actual stat

pval = bigger / length(pseudo_stat_dist) % divide by total number

% of pseudo test stats to

% compute p-value

% could use proportion()

hist(pseudo_stat_dist) % show histogram of pseudo

title('Pseudo test stat distribution') % test stats

xlabel('Pseudo test stat')

ylabel('Frequency')


Introduction to resampling in matlab

  • Don't forget to look online to get a more generalized version of 'bootstrap2groupsMEANS.m' called 'bootstrap2groups.m'

  • It's more flexible (can define method for descriptor and iterations as input), stores output in a struct, plots raw data, has one-tailed tests if needed, and can be used for demo (try running 'out = bootstrap2groups(@mean)'


Possible topics for next week

Possible topics for next week

  • confidence intervals

  • normality testing

  • ANOVA (1 or 2-way)

  • power


  • Login