Lab 6 motif analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Lab 6 Motif Analysis PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Lab 6 Motif Analysis. March 5, 2013 Lin Liu Yang Li. EGR-1 (Early growth response protein 1) also known as Zif268 (zinc finger protein 225) or NGFI-A (nerve growth factor-induced protein A) is a protein that in humans is encoded by the EGR1 gene.

Download Presentation

Lab 6 Motif Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lab 6 motif analysis

Lab 6Motif Analysis

March 5, 2013

Lin Liu

Yang Li


Lab 6 motif analysis

EGR-1 (Early growth response protein 1) also known as Zif268 (zinc finger protein 225) or NGFI-A (nerve growth factor-induced protein A) is a protein that in humans is encoded by the EGR1 gene.

EGR-1 is a mammalian transcription factor. It was also named Krox-24, TIS8, and ZENK. It was originally discovered in mice.


Motif analysis basics

Motif AnalysisBasics

What is a Motif? A pattern common to a set of DNA, RNA or protein sequences that share a common biological property, such as functioning as binding sites for a particular protein.


Motif analysis basics1

Motif AnalysisBasics

ICk(b) =

- pk(b) * log2 [pk(b)/0.25]


Motif analysis basics2

Motif AnalysisBasics

  • Ways of representing a motif

    • • Consensus sequence

    • • Regular expression

    • • Weight matrix/PSPM/PSSM

    • • More complicated models


Motif analysis basics3

Motif AnalysisBasics

Where do motifs come from?

• Sequences of known common function

• Cross-linking/pulldown experiments

• SELEX experiments

• Multiple sequence alignments


Motif analysis basics4

Motif AnalysisBasics

Why are they important?

• Identify proteins that have a specific property

• Can be used to infer which factors regulate which genes

• Important for efforts to model gene expression


Motif analysis which methods have been proposed

Motif AnalysisWhich methods have been proposed?

• Enumerative (‘dictionary’)

- search for a k-mer/set of k-mers/regular expression that is over-represented

• Probabilistic Optimization (e.g., Gibbs sampler)

- stochastic search of the space of possible PSPMs

• Deterministic Optimization (e.g., MEME)

- deterministic search of space of possible PSPMs


Motif logos

Motif Logos


Related r package

Related R package

  • seqLogo

  • Toy example:

    • source(“http://bioconductor.org/biocLite.R”)

    • biocLite(seqLogo)

    • library(seqLogo)

    • m <- cbind(c(0.5, 0.2, 0.2, 0.1), c(0.2, 0.6, 0.1, 0.1), c(0.1, 0.05, 0.8, 0.05))

    • pwm <- makePWM(m)

    • seqLogo(pwm)


Decode gene regulation

Decode Gene Regulation

Look at genes always expressed together:

Upstream RegionsCo-expressed

Genes

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT


Challenges base substitutions

Challenges: Base substitutions

Sequences do not have to match the motif

perfectly, base substitutions are allowed

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATGTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG

CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT


De novo sequence motif finding

De novo Sequence Motif Finding

  • Goal: look for common sequence patterns enriched in the input data (compared to the genome background)

  • Regular expression enumeration

    • Pattern driven approach

    • Enumerate patterns, check significance in dataset

    • Oligonucleotide analysis, MobyDick

  • Position weight matrix update

    • Data driven approach, use data to refine motifs

    • Consensus, EM & Gibbs sampling

    • Motif score and Markov background


Expectation maximization and gibbs sampling model

Expectation Maximization and Gibbs Sampling Model

  • Objects:

    • Seq: sequence data to search for motif

    • 0: non-motif (genome background) probability

    • : motif probability matrix parameter

    • : motif site locations

  • Problem: P(, | seq, 0)

  • Approach: alternately estimate

    •  by P( | , seq, 0)

    •  by P( | , seq, 0)

    • EM and Gibbs differ in the estimation methods


Expectation maximization

E step:  | , seq, 0

TTGACGACTGCACGT

TTGACp1

TGACGp2

GACGAp3

ACGACp4

CGACTp5

GACTGp6

ACTGCp7

CTGCAp8

...

P1 = likelihood ratio =

P(TTGAC| )

P(TTGAC| 0)

Expectation Maximization

p0T  p0T  p0G  p0A p0C

= 0.3  0.3  0.2  0.3  0.2


Expectation maximization1

E step:  | , seq, 0

TTGACGACTGCACGT

TTGACp1

TGACGp2

GACGAp3

ACGACp4

CGACTp5

GACTGp6

ACTGCp7

CTGCAp8

...

M step:  | , seq, 0

p1 TTGAC

p2 TGACG

p3 GACGA

p4 ACGAC

...

Scale ACGT at each position,  reflects weighted average of 

Expectation Maximization


Em derivatives

EM Derivatives

  • First EM motif finder (C Lawrence)

    • Deterministic algorithm, guarantee local optimum

  • MEME (TL Bailey)

    • Prior probability allows 0-n site / sequence

    • Parallel running multiple

      EM with different seed

    • User friendly results


Gibbs sampling

Gibbs Sampling

  • Stochastic process, although still may need multiple initializations

    • Sample  from P( | , seq, 0)

    • Sample  from P( | , seq, 0)

  • Collapsed form:

    •  estimatedwith counts, not sampling from Dirichlet

    • Sample site from one seq based on sites from other seqs

  • Converged motif matrix  and converged motif sites  represent stationary distribution of a Markov Chain


Lab 6 motif analysis

Gibbs Sampler

nA1 + sA

nA1 + sA +nC1 + sC +nG1 + sG +nT1 + sT

 estimated with counts

pA1 =

1

11

2

21

31

3

4

41

51

5

Initial 1

  • Randomly initialize a probability matrix


Gibbs sampler

1 Without

11Segment

Gibbs Sampler

  • Take out one sequence with its sites from current motif

11

21

31

41

51


Gibbs sampler1

1 Without

11Segment

Gibbs Sampler

  • Score each possible segment of this sequence

Sequence 1

Segment (1-8)

21

31

41

51


Lab 6 motif analysis

Gibbs Sampler

1 Without

11Segment

  • Score each possible segment of this sequence

Sequence 1

Segment (2-9)

21

31

41

51


Segment score

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Segment Score

  • Use current motif matrix to score a segment


Scoring segments

Scoring Segments

Motif12345bg

A0.40.10.30.40.20.3

T0.20.50.10.20.20.3

G0.20.20.20.30.40.2

C0.20.20.40.10.20.2

Ignore pseudo counts for now…

Sequence: TTCCATATTAATCAGATTCCG… score

TAATC…

AATCA0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383

ATCAG0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185

TCAGA0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444

CAGAT…


Gibbs sampler2

12

Modified 1

 estimated with counts

Gibbs Sampler

  • Sample site from one seq based on sites from other seqs

21

31

41

51


How to sample

How to Sample?

  • Rand(subtotal) = X

  • Find the first position with subtotal larger than X


Gibbs sampler3

1 Without

21Segment

Gibbs Sampler

  • Repeat the process until motif converges

21

12

31

41

51


Gibbs sampler intuition

Gibbs Sampler Intuition

  • Beginning:

    • Randomly initialized motif

    • No preference towards any segment


Gibbs sampler intuition1

Gibbs Sampler Intuition

  • Motif appears:

    • Motif should have enriched signal (more sites)

    • By chance some correct sites come to alignment

    • Sites bias motif to attract other similar sites


Gibbs sampler intuition2

Gibbs Sampler Intuition

  • Motif converges:

    • All sites come to alignment

    • Motif totally biased to sample sites every time


Lab 6 motif analysis

Gibbs Sampler

1

2

3

4

5

1i

2i

3i

4i

5i

  • Column shift

  • Metropolis algorithm:

    • Propose * as  shifted 1 column to left or right

    • Calculate motif score u() and u(*)

    • Accept * with prob = min(1, u(*) / u())


Gibbs sampling derivatives

Gibbs Sampling Derivatives

  • Gibbs Motif Sampler (JS Liu)

    • Add prior probability to allow 0-n site / seq

    • Sample motif positions to consider

  • AlignACE (F Roth)

    • Look for motifs from both strands

    • Mask out one motif to find more different motifs

  • BioProspector (XS Liu)

    • Use background model with Markov dependencies

    • Sampling with threshold (0-n sites / seq), new scoring function

    • Can find two-block motifs with variable gap


Scoring motifs

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Scoring Motifs

  • Information Content (also known as relative entropy)

    • Suppose you have x aligned segments for the motif

    • pb(s1 from mtf) / pb(s1 from bg) *

      pb(s2 from mtf) / pb(s2 from bg) *…

      pb(sx from mtf) / pb(sx from bg)


Scoring motifs1

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Scoring Motifs

  • Information Content (also known as relative entropy)

    • Suppose you have x aligned segments for the motif

    • pb(s1 from mtf) / pb(s1 from bg) *

      pb(s2 from mtf) / pb(s2 from bg) *…

      pb(sx from mtf) / pb(sx from bg)


Scoring motifs2

Scoring Motifs

pb(s1 from mtf) / pb(s1 from bg) *

pb(s2 from mtf) / pb(s2 from bg) *…

pb(sx from mtf) / pb(sx from bg)

= (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2…

Take log of this:

= A1 log (pA1/pA0) + T1 log (pT1/pT0) +

T2 log (pT2/pT0) + G2 log (pG2/pG0) + …

Divide by the number of segments (if all the motifs have same number of segments)

= pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)…

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG


Scoring motifs3

=

Motif Conservedness: How likely to see the current aligned segments from this motif model

Bad

AGGCA

ATCCC

GCGCA

CGGTA

TGCCA

ATGGT

TTGAA

Good

ATGCA

ATGCC

ATGCA

ATGCA

TTGCA

ATGGA

ATGCA

Scoring Motifs

  • Original function: Information Content


Scoring motifs4

=

Motif Specificity:

How likely to see the current aligned segments from background

Scoring Motifs

  • Original function: Information Content

Good

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

Bad

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA


Scoring motifs5

=

Scoring Motifs

  • Original function: Information Content

    Which is better?

    (data = 8 seqs)

Motif 1

AGGCTAAC

AGGCTAAC

Motif 2

AGGCTAAC

AGGCTACC

AGGCTAAC

AGCCTAAC

AGGCCAAC

AGGCTAAC

TGGCTAAC

AGGCTTAC

AGGCTAAC

AGGGTAAC


Scoring motifs6

Specific (unlikely in genome background)

Motif Signal

Abundant

Positions

Conserved

Scoring Motifs

  • Motif scoring function:

  • Prefer: conserved motifs with many sites, but are not often seen in the genome background


Markov background increases motif specificity

Prefers motif segments enriched only in data, but not so likely to occur in the background

Segment ATGTA score =

p(generate ATGTA from )

p(generate ATGTA from 0)

3rd order Markov dependency

p( )

Markov Background Increases Motif Specificity

TCAGC = .25  .25  .25  .25  .25

.3  .18  .16  .22  .24

ATATA = .25  .25  .25  .25  .25

.3  .41  .38  .42  .30


  • Login