Primer on Probability

1 / 33

# Primer on Probability - PowerPoint PPT Presentation

Primer on Probability. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 25 th , 2012. BMI/CS 576. Definition of probability.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Primer on Probability' - josiah

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Primer on Probability

Sushmita Roy

BMI/CS 576

www.biostat.wisc.edu/bmi576/

Sushmita Roy

sroy@biostat.wisc.edu

Sep 25th, 2012

BMI/CS 576

Definition of probability
• frequentist interpretation: the probability of an event from a random experiment is the proportion of the time events of same kind will occur in the long run, when the experiment is repeated
• examples
• the probability my flight to Chicago will be on time
• the probability this ticket will win the lottery
• the probability it will rain tomorrow
• always a number in the interval [0,1]

0 means “never occurs”

1 means “always occurs”

Sample spaces
• sample space: a set of possible outcomes for some event
• event: a subset of sample space
• examples
• flight to Chicago: {on time, late}
• lottery: {ticket 1 wins, ticket 2 wins,…,ticket n wins}
• weather tomorrow:

{rain, not rain} or

{sun, rain, snow} or

{sun, clouds, rain, snow, sleet} or…

Random variables
• random variable: a function associating a value with an attribute of the outcome of an experiment
• example
• X represents the outcome of my flight to Chicago
• we write the probability of my flight being on time as P(X = on-time)
• or when it’s clear which variable we’re referring to, we may use the shorthand P(on-time)
Notation
• uppercase letters and capitalized words denote random variables
• lowercase letters and uncapitalized words denote values
• we’ll denote a particular value for a variable as follows
• we’ll also use the shorthand form
• for Boolean random variables, we’ll use the shorthand
0.3

0.2

0.1

sun

rain

sleet

snow

clouds

Probability distributions
• if X is a random variable, the function given by P(X = x)for each x is the probability distribution of X
• requirements:
Joint distributions
• joint probability distribution: the function given by P(X = x, Y = y)
• read “X equals xandY equals y”
•  example

probability that it’s sunny

and my flight is on time

Marginal distributions
• the marginal distribution of X is defined by

“the distribution of X ignoring other variables”

• this definition generalizes to more than two variables, e.g.
Marginal distribution example

joint distribution

marginal distribution for X

Conditional distributions
• the conditional distribution of Xgiven Y is defined as:

“the distribution of X given that we know the value of Y”

Conditional distribution example

conditional distribution for X

givenY=on-time

joint distribution

Independence
• two random variables, X and Y, are independent if
Independence example #1

joint distribution

marginal distributions

Are X and Y independent here?

NO.

Independence example #2

joint distribution

marginal distributions

Are X and Y independent here?

YES.

Conditional independence
• two random variables X and Y are conditionally independent given Z if
• “once you know the value of Z, knowing Y doesn’t tell you anything about X”
• alternatively
Conditional independence example

NO.

Conditional independence example

Are Fever and Vomitconditionally independent given Flu:

YES.

Chain rule of probability
• for two variables
• for three variables
• etc.
• to see that this is true, note that
Bayes theorem
• this theorem is extremely useful
• there are many cases when it is hard to estimate P(x| y) directly, but it’s not too hard to estimate P(y| x) andP(x)
Bayes theorem example
• MDs usually aren’t good at estimating P(Disorder| Symptom)
• they’re usually better at estimating P(Symptom| Disorder)
• if we can estimate P(Fever| Flu) and P(Flu) we can use Bayes’ Theorem to do diagnosis
Expected values
• the expected value of a random variable that takes on numerical values is defined as:

this is the same thing as the mean

• we can also talk about the expected value of a function of a random variable
Expected value examples
• Suppose each lottery ticket costs \$1 and the winning ticket pays out \$100. The probability that a particular ticket is the winning ticket is 0.001.
The binomial distribution
• distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each)
• e.g. the probability of x heads in ncoin flips

p=0.5

p=0.1

P(X=x)

x

x

The multinomial distribution
• k possible outcomes on each trial
• probability pifor outcome xi in each trial
• distribution over the number of occurrences xifor each outcome in a fixed number n of independent trials
• e.g. with k=6 (a six-sided die) and n=30

vector of outcome

occurrences

Statistics of alignment scores

Q: How do we assess whether an alignment provides good evidence for homology?

A: determine how likely it is that such an alignment score would result from chance.

What is “chance”?

• real but non-homologous sequences
• real sequences shuffled to preserve compositional properties
• sequences generated randomly based upon a DNA/protein sequence model
Model forunrelatedsequences
• we’ll assume that each position in the alignment is sampled randomly from some distribution of amino acids
• let be the probability of amino acid a
• the probability of an n-character alignment of x and y is given by
Model forrelatedsequences
• we’ll assume that each pair of aligned amino acids evolved from a common ancestor
• let be the probability that evolution gave rise to amino acid a in one sequence and b in another sequence
• the probability of an alignment of x and y is given by
taking the log, we getProbabilistic model of alignments
• How can we decide which possibility (U or R) is more likely?
• one principled way is to consider the relative likelihood of the two possibilities
Probabilistic model of alignments
• the score for an alignment is thus given by:
• the substitution matrix score for the pair a, b should thus be given by:
Scores from random alignments
• suppose we assume
• sequence lengths m and n
• a particular substitution matrix and amino-acid frequencies
• and we consider generating random sequences of lengths m and n and finding the best alignment of these sequences
• this will give us a distribution over alignment scores for random pairs of sequences
The extreme value distribution
• but we’re picking thebest alignments, so we want to know what the distribution of max scores for alignments against a random set of sequences looks like
• this is given by an extreme value distribution
Distribution of scores
• the expected number of alignments, E, with score at least S is given by:
• S is a given score threshold
• m and n are the lengths of the sequences under consideration
• K and are constants that can be calculated from
• the substitution matrix
• the frequencies of the individual amino acids
Statistics of alignment scores
• to generalize this to searching a database, have n represent the summed length of the sequences in the DB (adjusting for edge effects)
• the NCBI BLAST server does just this
• theory for gapped alignments not as well developed
• computational experiments suggest this analysis holds for gapped alignments (but K and must be estimated from data)