Basic Model For Genetic Linkage Analysis Lecture #5

1 / 31

# Basic Model For Genetic Linkage Analysis Lecture #5 - PowerPoint PPT Presentation

Basic Model For Genetic Linkage Analysis Lecture #5. Prepared by Dan Geiger. Using the Maximum Likelihood Approach. The probability of pedigree data Pr(data |  ) is a function of the known and unknown recombination fractions denoted collectively by .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Basic Model For Genetic Linkage Analysis Lecture #5

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Basic Model For Genetic Linkage AnalysisLecture #5

.

Prepared by Dan Geiger

Using the Maximum Likelihood Approach

The probability of pedigree data Pr(data |  ) is a function of the known and unknown recombination fractions denoted collectively by .

How can we construct this likelihood function ?

The maximum likelihood approach is to seek the value of  which maximizes the likelihood function Pr(data |  ) . This is the ML estimate.

Constructing the Likelihood function

First, we need to determine the variables that describe the problem. There are many possible choices. Some variables we can observe and some we cannot.

Lijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i.

Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) .

Xij= Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i).

As a starting point, We assume that the data consists of an assignment to a subset of the variables {Xij}. In other words some (or all) persons are genotyped at some (or all) loci.

Maternal allele at locus 1 of person 1

Paternal allele at locus 1 of person 1

L11f

L11m

P(L11m = a) is the frequency of allele a. We use lower case letters for states writing, in short, P(l11m).

Unordered allele pair at locus 1 of person 1 = data

X11

P(x11 | l11m, l11f) = 0 or 1 depending on consistency

What is the relationships among the variables across individuals ?

L11m

L11f

L12m

L12f

Mother

Father

X11

X12

L13f

L13m

Offspring

X12

P(l13m | l11m, l11f) = 1/2 if l13m = l11m or l13m = l11f

P(l13m | l11m, l11f) = 0 otherwise

First attempt: correct but not efficient as we shall see.

L11m

L11f

L12m

L12f

X11

X12

L13f

L13m

X13

Model for locus 1

L21m

L21f

L22m

L22f

X21

X22

L23m depends on whether

L13m got the value from L11m or L11f, whether a recombination occurred,

and on the values of L21m and L21f. This is quite complex.

L23f

L23m

X23

Model for locus 2

Probabilistic model for two loci

L11f

L11m

Selector of maternal allele at locus 1 of person 3

X11

S13m

P(s13m) = ½

L13m

Maternal allele at locus 1 of person 3 (offspring)

Selector variables Sijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j.

P(l13m | l11m, l11f,,S13m=0) = 1 if l13m = l11m

P(l13m | l11m, l11f,,S13m=1) = 1 if l13m = l11f

P(l13m | l11m, l11f,,s13m) = 0 otherwise

L11m

L11f

L12m

L12f

X11

S13m

X12

S13f

L13f

L13m

X13

L21m

L21f

L22m

L22f

X21

S23m

X22

S23f

L23f

L23m

X23

Model for locus 2

Probabilistic model for two loci

Model for locus 1

L21m

L11m

L21f

L11f

L12m

L22m

L12f

L22f

X21

X11

S13m

S23m

X22

X12

S13f

S23f

L13f

L23f

L13m

L23m

X13

X23

Probabilistic Model for Recombination

 is the recombination fraction between loci 2 & 1.

L11f

L11m

X11

S13m

L13m

P(l11m, l11f,, x11, s13m,l13m) =

Joint probability

P(l11m) P(l11f) P(x11 | l11m, l11f,) P(s13m)P(l13m | s13m, l11m, l11f)

Probability of data (sum over all states of all hidden variables)

Prob(data) = P(x11) =

l11m  l11f s13m  l13m P(l11m, l11f,, x11, s13m,l13m)

Constructing the likelihood function I

Observed variable

All other variables are not-observed (hidden)

Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =

Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = l11m, l11f … s23f [P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m |s13m, 2) P(s23m |s13m, 2) ]

Constructing the likelihood function II

P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13, l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23,

s13m,s13f,s23m,s23f, 2) =

Product over all local probability tables

= P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m |s13m, 2) P(s23m |s13m, 2)

The result is a function of the recombination fraction. The ML estimate is the 2 value that maximizes this function.

Modeling Phenotypes I

L11f

L11m

X11

S13m

Y11

L13m

Phenotype variables Yij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities:

P(y11 = sick | X11= (a,a)) = 1

P(y11 = sick | X11= (A,a)) = 0

P(y11 = sick | X11= (A,A)) = 0

Modeling Phenotypes II

L11f

L11m

X11

S13m

Y11

L13m

Note that in this model we assume the phenotype depends only on the alleles of one locus. Also we did not model levels of sickness. We did not model continuous phenotypic observations either.

L21m

L11m

L21f

L11f

L12m

L22m

L12f

L22f

X21

X11

S13m

S23m

X12

X22

S13f

S23f

L13f

L23f

L13m

L23m

X23

X13

Introducing a tentative disease Locus

Marker locus

Disease locus: assume

sick means xij=(a,a)

The recombination fraction 2 is unknown. Finding it can help determine whether a gene causing the disease lies in the vicinity of the marker locus.

Li1m

Li1f

Li2m

Li2f

Xi1

Si3m

Xi2

Si3f

Li3f

Li3m

Xi3

2

3

4

1

Locus-by-Locus Summation order

Sum over locus i vars before summing over locus i+1 vars

Sum over orange vars (Lijt) before summing selector vars (Sijt).

This order yields a Hidden Markov Model (HMM).

S1

S2

S3

Si-1

Si

Si+1

R1

R2

R3

Ri-1

Ri

Ri+1

X1

X1

X2

X2

X3

X3

Xi-1

Xi-1

Xi

Xi

Xi+1

Xi+1

Hidden Markov Models in General

Which depicts the factorization:

Application in communication: message sent is (s1,…,sm) but we receive (r1,…,rm) . Compute what is the most likely message sent ?

Application in speech recognition: word said is (s1,…,sm) but we recorded (r1,…,rm) . Compute what is the most likely word said ?

Application in Genetic linkage analysis: to be discussed now.

S1

S2

S3

Si-1

Si

Si+1

X1

X2

X3

Xi-1

Xi

Xi+1

X1

X1

X2

X2

X3

X3

Xi-1

Xi-1

Xi

Xi

Xi+1

Xi+1

Hidden Markov Model In our case

The compounded variable Si = (Si,1,m,…,Si,2n,f)is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data regarding locus i.

To specify the HMM we need to write down the transition matrices from Si-1 to Si and the matrices P(xi|Si). Note that these quantities have already been implicitly defined.

(The Kronecker product)

For n non-founders, the transition matrix is the n-fold Kronecker product:

The transition matrix

Recall that:

Therefore, in our example, where we have one non-founder (n=1), the transition probability table size is 4  4 = 22n 22n,encoding four options of recombination/non-recombination for the two parental meiosis:

L21m

L21f

L22m

L22f

X21

S23m

X22

S23f

=  P(l21m)P(l21f)P(l22m)P(l22f) P(x21 | l21m, l21f)

P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f)

L23f

L23m

l21m,l21f,l22m,l22f l22m,l22f

X23

Model for locus 2

Probability of data in one locus given an inheritance vector

P(x21, x22 , x23 |s23m,s23f) =

The five last terms are always zero-or-one, namely, indicator functions.

H1

H2

HL-1

HL

Hi

X1

X2

XL-1

XL

Xi

1. Compute the posteriori belief in Hi (specific i) given the evidence {x1,…,xL} for each of Hi’s values hi, namely, compute p(hi | x1,…,xL).

2. Do the same computation for every Hibut without repeating the first task L times.

The solution is called the forward-backward algorithm.

Posterior decoding

The standard query for HMM is belief update (also called posterior decoding).

H1

H2

HL-1

HL

Hi

X1

X2

XL-1

XL

Xi

Likelihood of evidence

To compute the likelihood of evidence P(x1,…,xL), which depends on the recombination fractions in our case, we will use either the forward or the backward algorithm to be described now.

H1

H2

HL-1

HL

Hi

X1

X2

XL-1

XL

Xi

= P(x1,…,xi,hi) P(xi+1,…,xL | hi)

 f(hi) b(hi)

Decomposing the computation

Answer: P(hi | x1,…,xL) = (1/K) P(x1,…,xL,hi)

where K= hiP(x1,…,xL,hi).

P(x1,…,xL,hi) = P(x1,…,xi,hi) P(xi+1,…,xL | x1,…,xi,hi)

Equality due to Ind({xi+1,…,xL}, {x1,…,xi} | Hi}

H1

H2

Hi

X1

X2

Xi

The task: Compute f(hi) = P(x1,…,xi,hi) for i=1,…,L (namely, considering evidence up to time slot i).

P(x1, h1) = P(h1) P(x1|h1) {Basis step}

P(x1,x2,h2) =  P(x1,h1,h2,x2) {Second step}

=  P(x1,h1) P(h2 | x1,h1) P(x2 | x1,h1,h2)

h1

h1

Last equality due to conditional independence

=  P(x1,h1) P(h2 | h1) P(x2 | h2)

h1

{step i}

P(x1,…,xi,hi) =  P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi)

hi-1

The forward algorithm

Hi

Hi+1

HL-1

HL

Xi+1

XL-1

XL

P(xL| hL-1) =  P(xL ,hL |hL-1) =  P(hL |hL-1) P(xL |hL-1,hL )=

hL

hL

Last equality due to conditional independence

=  P(hL |hL-1) P(xL |hL ) {first step}

hL

{step i}

P(xi+1,…,xL|hi) =  P(hi+1 | hi) P(xi+1 | hi+1) P(xi+2,…,xL| hi+1)

hi+1

=b(hi)=

=b(hi+1)=

The backward algorithm

The task: Compute b(hi) = P(xi+1,…,xL|hi) for i=L-1,…,1 (namely, considering evidence after time slot i).

H1

H2

HL-1

HL

Hi

X1

X2

XL-1

XL

Xi

1. To Compute the posteriori belief in Hi (specific i) given the evidence {x1,…,xL} run the forward algorithm and compute f(hi) = P(x1,…,xi,hi), run the backward algorithm to compute b(hi) = P(xi+1,…,xL|hi), the product f(hi)b(hi) is the answer (for every possible value hi).

2. To Compute the posteriori belief for every Hisimply run the forward and backward algorithms once, storing f(hi) and b(hi) for every i (and value hi). Compute f(hi)b(hi) for every i.

H1

H2

HL-1

HL

Hi

X1

X2

XL-1

XL

Xi

Likelihood of evidence revisited
• To compute the likelihood of evidence P(x1,…,xL), do one more step in the forward algorithm, namely,
•  f(hL) =  P(x1,…,xL,hL)
• 2. Alternatively, do one more step in the backward algorithm, namely,
•  b(h1) P(h1) P(x1|h1) =  P(x2,…,xL|h1) P(h1) P(x1|h1)

hL

hL

h1

h1

H1

H2

HL-1

HL

Hi

X1

X2

XL-1

XL

Xi

Time and Space Complexity of the forward/backward algorithms

Time complexity is linear in the length of the chain, provided the number of states of each variable is a constant. More precisely, time complexity is O(k2n) where k is the maximum domain size of each variable.

Space complexity is also O(k2n).

In our case: O(k2n) is really O(24nL). Next class we will see how to save on these computations using the special matrices we have. The savings have been implemented in GeneHunter – a leading software of linkage.

H1

H2

HL-1

HL

Hi

X1

X2

XL-1

XL

Xi

• Recall that the query asking likelihood of evidence is to compute P(x1,…,xL) =  P(x1,…,xL, h1,…,hL)
• Now we wish to compute a similar quantity:
• P*(x1,…,xL) = MAXP(x1,…,xL, h1,…,hL)

(h1,…,hL)

(h1,…,hL)

The Maximum APosteriori query

h1

h3

h2

= P(h1)P(x1|h1) b(h2)P(h1|h2)P(x2|h2)

h1

h2

=  b(h1) P(h1)P(x1|h1)

h1

Example: Revisiting likelihood of evidence

H1

H2

H3

X1

X2

X3

maximum = max P(h1)P(x1|h1) max P(h2|h1)P(x2|h2) max P(h3 |h2)P(x3|h3)

h1

h3

h2

x* (h1)

x* (h2)

h2

h3

= max P(h1)P(x1|h1) max b (h2)P(h1|h2)P(x2|h2)

h3

h1

h2

{Finding the maximum}

= max b (h1)P(h1)P(x1|h1)

h2

h1

{Finding the map assignment}

h1* = arg max b (h1)P(h1)P(x1|h1)

h2

h1

h2* = x* (h1*);

h3* = x* (h2*)

h2

h3

Example: Computing the MAP assignment

H1

H2

H3

X1

X2

X3

Replace sums with taking maximum:

H1

H2

HL-1

HL

Hi

Backward phase:

b (hL) = 1

hL+1

X1

X2

XL-1

XL

Xi

For i=L-1 downto 1 do

b (hi) = MAX P(hi+1 | hi) P(xi+1 | hi+1) b (hi+1)

hi+1

hi+2

hi+1

x* (hi) = ARGMAX P(hi+1 | hi) P(xi+1 | hi+1) b (hi+1)

hi+1

hi+2

hi+1

(Storing the best value as a function of the parent’s values)

Forward phase (Tracing the MAP assignment) :

h1* = ARG MAX P(h1) P(x1|h1) b (h1)

h2

h2

For i=1 to L-1 do

hi+1* = x* (hi *)

hi+1

Viterbi’s algorithm