- 43 Views
- Uploaded on
- Presentation posted in: General

Primer on Probability

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Primer on Probability

Sushmita Roy

BMI/CS 576

www.biostat.wisc.edu/bmi576/

Sushmita Roy

Sep 25th, 2012

BMI/CS 576

- frequentist interpretation: the probability of an event from a random experiment is the proportion of the time events of same kind will occur in the long run, when the experiment is repeated
- examples
- the probability my flight to Chicago will be on time
- the probability this ticket will win the lottery
- the probability it will rain tomorrow

- always a number in the interval [0,1]
0 means “never occurs”

1 means “always occurs”

- sample space: a set of possible outcomes for some event
- event: a subset of sample space
- examples
- flight to Chicago: {on time, late}
- lottery: {ticket 1 wins, ticket 2 wins,…,ticket n wins}
- weather tomorrow:
{rain, not rain} or

{sun, rain, snow} or

{sun, clouds, rain, snow, sleet} or…

- random variable: a function associating a value with an attribute of the outcome of an experiment
- example
- X represents the outcome of my flight to Chicago
- we write the probability of my flight being on time as P(X = on-time)
- or when it’s clear which variable we’re referring to, we may use the shorthand P(on-time)

- uppercase letters and capitalized words denote random variables
- lowercase letters and uncapitalized words denote values
- we’ll denote a particular value for a variable as follows
- we’ll also use the shorthand form
- for Boolean random variables, we’ll use the shorthand

0.3

0.2

0.1

sun

rain

sleet

snow

clouds

- if X is a random variable, the function given by P(X = x)for each x is the probability distribution of X
- requirements:

- joint probability distribution: the function given by P(X = x, Y = y)
- read “X equals xandY equals y”
- example

probability that it’s sunny

and my flight is on time

- the marginal distribution of X is defined by
“the distribution of X ignoring other variables”

- this definition generalizes to more than two variables, e.g.

joint distribution

marginal distribution for X

- the conditional distribution of Xgiven Y is defined as:
“the distribution of X given that we know the value of Y”

conditional distribution for X

givenY=on-time

joint distribution

- two random variables, X and Y, are independent if

joint distribution

marginal distributions

Are X and Y independent here?

NO.

joint distribution

marginal distributions

Are X and Y independent here?

YES.

- two random variables X and Y are conditionally independent given Z if
- “once you know the value of Z, knowing Y doesn’t tell you anything about X”

- alternatively

Are Fever andHeadache independent?

NO.

Are Fever and Vomitconditionally independent given Flu:

YES.

- for two variables
- for three variables
- etc.
- to see that this is true, note that

- this theorem is extremely useful
- there are many cases when it is hard to estimate P(x| y) directly, but it’s not too hard to estimate P(y| x) andP(x)

- MDs usually aren’t good at estimating P(Disorder| Symptom)
- they’re usually better at estimating P(Symptom| Disorder)
- if we can estimate P(Fever| Flu) and P(Flu) we can use Bayes’ Theorem to do diagnosis

- the expected value of a random variable that takes on numerical values is defined as:
this is the same thing as the mean

- we can also talk about the expected value of a function of a random variable

- Suppose each lottery ticket costs $1 and the winning ticket pays out $100. The probability that a particular ticket is the winning ticket is 0.001.

- distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each)
- e.g. the probability of x heads in ncoin flips

p=0.5

p=0.1

P(X=x)

x

x

- k possible outcomes on each trial
- probability pifor outcome xi in each trial
- distribution over the number of occurrences xifor each outcome in a fixed number n of independent trials
- e.g. with k=6 (a six-sided die) and n=30

vector of outcome

occurrences

Q: How do we assess whether an alignment provides good evidence for homology?

A: determine how likely it is that such an alignment score would result from chance.

What is “chance”?

- real but non-homologous sequences
- real sequences shuffled to preserve compositional properties
- sequences generated randomly based upon a DNA/protein sequence model

- we’ll assume that each position in the alignment is sampled randomly from some distribution of amino acids
- let be the probability of amino acid a
- the probability of an n-character alignment of x and y is given by

- we’ll assume that each pair of aligned amino acids evolved from a common ancestor
- let be the probability that evolution gave rise to amino acid a in one sequence and b in another sequence
- the probability of an alignment of x and y is given by

- taking the log, we get

- How can we decide which possibility (U or R) is more likely?
- one principled way is to consider the relative likelihood of the two possibilities

- the score for an alignment is thus given by:

- the substitution matrix score for the pair a, b should thus be given by:

- suppose we assume
- sequence lengths m and n
- a particular substitution matrix and amino-acid frequencies

- and we consider generating random sequences of lengths m and n and finding the best alignment of these sequences
- this will give us a distribution over alignment scores for random pairs of sequences

- but we’re picking thebest alignments, so we want to know what the distribution of max scores for alignments against a random set of sequences looks like
- this is given by an extreme value distribution

- the expected number of alignments, E, with score at least S is given by:

- S is a given score threshold
- m and n are the lengths of the sequences under consideration
- K and are constants that can be calculated from
- the substitution matrix
- the frequencies of the individual amino acids

- to generalize this to searching a database, have n represent the summed length of the sequences in the DB (adjusting for edge effects)
- the NCBI BLAST server does just this
- theory for gapped alignments not as well developed
- computational experiments suggest this analysis holds for gapped alignments (but K and must be estimated from data)