3-page Detailed Project Outline & Preliminary Results
Download
1 / 14

- PowerPoint PPT Presentation


  • 545 Views
  • Updated On :

3-page Detailed Project Outline & Preliminary Results Due Tuesday, November 7. Lab Next week (Nov 2): help with projects. First, representation of motifs: Position-specific Weight Matrices (PWMs aka Position-Specific Scoring Matrix, PSSM). Site 1. A G A T G G A T G G

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - Solomon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

3-page Detailed Project Outline & Preliminary Results

Due Tuesday, November 7

Lab Next week (Nov 2): help with projects


Slide2 l.jpg

First, representation of motifs: Position-specific Weight Matrices (PWMs

aka Position-Specific Scoring Matrix, PSSM)

Site 1

A G A T G G A T G G

T G A T T G A T G T

T G A T G G A T G G

A G A T T G A T C G

T G A T G G A T T G

T G A T G G A T T G

A G A T G G A T T G

Site 2

Site 3

Site 4

Site 5

Site 6

Site 7

PWM represents frequencies of each base at each position in the motif *

G 0 1.0 0 0 0.71.0 0 0 0.40.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0

T 0.6 0 0 1.0 0.3 0 0 1.00.40.2

C 0 0 0 0 0 0 0 0 0.2 0

* These days, PWM/PSSM can correspond to the frequency matrix or a likelihood matrix


Slide3 l.jpg

Information content IC Matrices (PWMs

The least variable positions likely are important for specifying the protein-DNA interaction

Therefore high information content = low sequence variation at that position.

G 0 1.0 0 0 0.71.0 0 0 0.40.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0

T 0.6 0 0 1.0 0.3 0 0 1.00.40.2

C 0 0 0 0 0 0 0 0 0.2 0

IC 1.0 2.0 2.0 2.0 1.1 2.0 2.0 2.0 0.5 1.3 = bit score of 15.9

Information Profile:

bits

Position


Slide4 l.jpg

Pseudo-counts: protecting against overfitting due to small sample sizes

Add 1 count to each base at each position, then divide by n + 4

Site 1

A G A T G G A T G G

T G A T T G A T G T

T G A T G G A T G G

A G A T T G A T C G

T G A T G G A T T G

T G A T G G A T T G

A G A T G G A T T G

Site 2

Site 3

Site 4

Site 5

Site 6

Site 7

With pseudo-counts (rounded values):

G 0.1 0.7 0.1 0.1 0.40.7 0.1 0.1 0.3 0.7

A 0.3 0.1 0.7 0.1 0.1 0.1 0.7 0.1 0.1 0.1

T 0.4 0.1 0.1 0.70.25 0.1 0.1 0.70.30.2

C 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1

Various programs for finding instances of a matrix (PATSER, MAST, ScanAce)


Slide5 l.jpg

Finding matches to (instances of) a PWM sample sizes

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0

T 0.6 0 0 1.00.3 0 0 1.0 0.4 0.2

C 0 0 0 0 0 0 0 0 0.2 0

Is the sequenceA G A T T G A T C Ta match to this matrix?

Joint probability: assuming each position is independent,

P(motif) = PPb(i)

Background model:

P(G,A,T,C) = 0.25

b = G,A,T,C

i

P(sequence | matrix model ) = (0.4)(1.0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0.0048

P(sequence | background model ) = (0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25) = 6.8e-24


Slide6 l.jpg

Motif finding methods and algorithms sample sizes

Given a set of n promoters of n coregulated genes, find a motif common to the promoters.

Both the PWM and the motif sequences are unknown.

Common methods:

1. Enumeration:

Simplest case: look at the frequency of all n-mers

2. EM algorithms (MEME):

Iteratively hone in on the most likely motif model

3. Gibbs sampling methods (AlignAce, BioProspector)

Iteratively replace (‘sample’) sites to retrain the matrix


Slide7 l.jpg

Motif finding using the EM algorithm sample sizesMEME

(Bailey & Elkan 1995)

http://meme.sdsc.edu/meme/intro.html

  • EM algorithm: Expectation-Maximization

    • In one run, trains the matrix model and identifies examples of the matrix

MEME works by iteratively refining matrix and identifying sites:

1. Estimate motif model

a. Start with an n-mer seed (random or specified)

b. Build a matrix by incorporating some of background frequencies

2. Identify examples of the model

a. For every n-mer in the input set, identify its probability given the matrix model

3. Re-estimate the motif model

a. Calculate a new matrix, based on the weighted frequencies of all n-mers in the set

4. Iteratively refine the matrix and identify sites until convergence.


Slide8 l.jpg

Motif finding using the EM algorithm sample sizesMEME

(Bailey & Elkan 1995)

http://meme.sdsc.edu/meme/intro.html

  • EM algorithm: Expectation-Maximization

    • In one run, trains the matrix model and identifies examples of the matrix

Choice of parameters significantly affects the algorithm

-- motif width w

-- motif model:

- “zoops” = zero-or-one motif per promoter sequence*

- “oops” = one-or-more motif per promoter sequence*

- “tcm” = (“any number of sites”)

two-component mixture model (ie. Each w-mer sequence is

either an example of the background model or the motif model)

-- background model:

- simplest case: genomic nucleotide frequencies P(G,A,T,C)

- nth-order Markov chain

(eg. 2nd order Markov chain = P(A|C) = P(AC) = dinucleotide frequencies)

*These models keep track of which input sequence (promoter) the motif came from,

whereas tcm throws all “w-mers” into a bag


Slide9 l.jpg

Gibbs Sampling sample sizes

(AlignAce by Hughes et al. 2000 http://atlas.med.harvard.edu/download/index.html,

BioProspector by Liu et al. 2001 http://motif.stanford.edu/distributions/r

  • Start by randomly choosing sites and creates an initial matrix

  • Sample other sites

    • Remove some set of matrix examples

    • Randomly choose other sites and calculate P given matrix

    • If they have a high score to the matrix, keep the new site

  • Iterate to convergence


Slide10 l.jpg

Gibbs sampling: basic idea sample sizes

Current motif = PWM formed

by circled substrings

Slides generously and unknowingly provided by S. Sinha, Urbana-Chamaign CS Dept.


Slide11 l.jpg

Gibbs sampling: basic idea sample sizes

Delete one substring

Slides generously and unknowingly provided by S. Sinha, Urbana-Chamaign CS Dept.


Gibbs sampling basic idea l.jpg
Gibbs sampling: basic idea sample sizes

Try a replacement:

Compute its score,

Accept the replacement

depending on the score.

Slides generously and unknowingly provided by S. Sinha, Urbana-Chamaign CS Dept.


Slide13 l.jpg

Gibbs Sampling sample sizes

(AlignAce by Hughes et al. 2000 http://atlas.med.harvard.edu/download/index.html,

BioProspector by Liu et al. 2001 http://motif.stanford.edu/distributions/r

Start by randomly choosing sites and creates an initial matrix

Sample other sites

Remove some set of matrix examples

Randomly choose other sites and calculate P given matrix

If they have a high score to the matrix, keep the new site

Iterate to convergence

Gibbs sampling is less likely to get stuck in a local minimum, since it

randomly samples other sites, whereas MEME is more prone

to finding local optima (in theory, anyway)


Slide14 l.jpg

Assessing the biological relevance of identified motifs sample sizes

Keep an eye on these features:

1. Bit score (or normalized bit score)

Bit score = Information Content at each position

2. Information content profile

Real TF binding sites typically show smooth IC profiles

3. Number of input sequences that contain the motif

Overfitting: great looking motif but found in only few of the input sequences

4. Nucleotide frequencies

Eg. In yeast, AT rich sequences are common

… doesn’t necessarily mean they’re not real binding sites

5. Enrichment of motif in the training set compared to genomic bg

Our old friend, the hypergeometric distribution.

6. Any other nonrandom observation can give you confidence

(palindromic motif, nonrandom distribution of motifs in input sequences, etc)


ad