This presentation is the property of its rightful owner.
1 / 33

# Primer on Probability PowerPoint PPT Presentation

Primer on Probability. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 25 th , 2012. BMI/CS 576. Definition of probability.

Primer on Probability

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Primer on Probability

Sushmita Roy

BMI/CS 576

www.biostat.wisc.edu/bmi576/

Sushmita Roy

sroy@biostat.wisc.edu

Sep 25th, 2012

BMI/CS 576

### Definition of probability

• frequentist interpretation: the probability of an event from a random experiment is the proportion of the time events of same kind will occur in the long run, when the experiment is repeated

• examples

• the probability my flight to Chicago will be on time

• the probability this ticket will win the lottery

• the probability it will rain tomorrow

• always a number in the interval [0,1]

0 means “never occurs”

1 means “always occurs”

### Sample spaces

• sample space: a set of possible outcomes for some event

• event: a subset of sample space

• examples

• flight to Chicago: {on time, late}

• lottery: {ticket 1 wins, ticket 2 wins,…,ticket n wins}

• weather tomorrow:

{rain, not rain} or

{sun, rain, snow} or

{sun, clouds, rain, snow, sleet} or…

### Random variables

• random variable: a function associating a value with an attribute of the outcome of an experiment

• example

• X represents the outcome of my flight to Chicago

• we write the probability of my flight being on time as P(X = on-time)

• or when it’s clear which variable we’re referring to, we may use the shorthand P(on-time)

### Notation

• uppercase letters and capitalized words denote random variables

• lowercase letters and uncapitalized words denote values

• we’ll denote a particular value for a variable as follows

• we’ll also use the shorthand form

• for Boolean random variables, we’ll use the shorthand

0.3

0.2

0.1

sun

rain

sleet

snow

clouds

### Probability distributions

• if X is a random variable, the function given by P(X = x)for each x is the probability distribution of X

• requirements:

### Joint distributions

• joint probability distribution: the function given by P(X = x, Y = y)

• read “X equals xandY equals y”

•  example

probability that it’s sunny

and my flight is on time

### Marginal distributions

• the marginal distribution of X is defined by

“the distribution of X ignoring other variables”

• this definition generalizes to more than two variables, e.g.

### Marginal distribution example

joint distribution

marginal distribution for X

### Conditional distributions

• the conditional distribution of Xgiven Y is defined as:

“the distribution of X given that we know the value of Y”

### Conditional distribution example

conditional distribution for X

givenY=on-time

joint distribution

### Independence

• two random variables, X and Y, are independent if

### Independence example #1

joint distribution

marginal distributions

Are X and Y independent here?

NO.

### Independence example #2

joint distribution

marginal distributions

Are X and Y independent here?

YES.

### Conditional independence

• two random variables X and Y are conditionally independent given Z if

• “once you know the value of Z, knowing Y doesn’t tell you anything about X”

• alternatively

NO.

### Conditional independence example

Are Fever and Vomitconditionally independent given Flu:

YES.

### Chain rule of probability

• for two variables

• for three variables

• etc.

• to see that this is true, note that

### Bayes theorem

• this theorem is extremely useful

• there are many cases when it is hard to estimate P(x| y) directly, but it’s not too hard to estimate P(y| x) andP(x)

### Bayes theorem example

• MDs usually aren’t good at estimating P(Disorder| Symptom)

• they’re usually better at estimating P(Symptom| Disorder)

• if we can estimate P(Fever| Flu) and P(Flu) we can use Bayes’ Theorem to do diagnosis

### Expected values

• the expected value of a random variable that takes on numerical values is defined as:

this is the same thing as the mean

• we can also talk about the expected value of a function of a random variable

### Expected value examples

• Suppose each lottery ticket costs \$1 and the winning ticket pays out \$100. The probability that a particular ticket is the winning ticket is 0.001.

### The binomial distribution

• distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each)

• e.g. the probability of x heads in ncoin flips

p=0.5

p=0.1

P(X=x)

x

x

### The multinomial distribution

• k possible outcomes on each trial

• probability pifor outcome xi in each trial

• distribution over the number of occurrences xifor each outcome in a fixed number n of independent trials

• e.g. with k=6 (a six-sided die) and n=30

vector of outcome

occurrences

### Statistics of alignment scores

Q: How do we assess whether an alignment provides good evidence for homology?

A: determine how likely it is that such an alignment score would result from chance.

What is “chance”?

• real but non-homologous sequences

• real sequences shuffled to preserve compositional properties

• sequences generated randomly based upon a DNA/protein sequence model

### Model forunrelatedsequences

• we’ll assume that each position in the alignment is sampled randomly from some distribution of amino acids

• let be the probability of amino acid a

• the probability of an n-character alignment of x and y is given by

### Model forrelatedsequences

• we’ll assume that each pair of aligned amino acids evolved from a common ancestor

• let be the probability that evolution gave rise to amino acid a in one sequence and b in another sequence

• the probability of an alignment of x and y is given by

• taking the log, we get

### Probabilistic model of alignments

• How can we decide which possibility (U or R) is more likely?

• one principled way is to consider the relative likelihood of the two possibilities

### Probabilistic model of alignments

• the score for an alignment is thus given by:

• the substitution matrix score for the pair a, b should thus be given by:

### Scores from random alignments

• suppose we assume

• sequence lengths m and n

• a particular substitution matrix and amino-acid frequencies

• and we consider generating random sequences of lengths m and n and finding the best alignment of these sequences

• this will give us a distribution over alignment scores for random pairs of sequences

### The extreme value distribution

• but we’re picking thebest alignments, so we want to know what the distribution of max scores for alignments against a random set of sequences looks like

• this is given by an extreme value distribution

### Distribution of scores

• the expected number of alignments, E, with score at least S is given by:

• S is a given score threshold

• m and n are the lengths of the sequences under consideration

• K and are constants that can be calculated from

• the substitution matrix

• the frequencies of the individual amino acids

### Statistics of alignment scores

• to generalize this to searching a database, have n represent the summed length of the sequences in the DB (adjusting for edge effects)

• the NCBI BLAST server does just this

• theory for gapped alignments not as well developed

• computational experiments suggest this analysis holds for gapped alignments (but K and must be estimated from data)