Comparative genomics to identify dna binding motifs
This presentation is the property of its rightful owner.
Sponsored Links
1 / 70

Comparative genomics to identify DNA binding motifs PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

Comparative genomics to identify DNA binding motifs. Saurabh Sinha Dept. of Computer Science University of Illinois, Urbana-Champaign. Outline. Binding sites and motifs The motif finding problem in one species Comparative genomics and alignment

Download Presentation

Comparative genomics to identify DNA binding motifs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Comparative genomics to identify dna binding motifs

Comparative genomics to identify DNA binding motifs

Saurabh Sinha

Dept. of Computer Science

University of Illinois, Urbana-Champaign


Outline

Outline

  • Binding sites and motifs

  • The motif finding problem in one species

  • Comparative genomics and alignment

  • The motif finding problem with comparative genomics


Motif finding in multiple species

Motif finding in multiple species

  • Footprinter : the approach without alignments

  • PhyloCon : The use of alignments

  • PhyME & PhyloGibbs : The use of alignments and an evolutionary model

  • MCS : Genome-wide motif finding from multiple species


Binding sites and motifs

Binding sites and motifs


Binding sites

Binding sites

  • A few binding sites of transcription factor “Bicoid” in the Drosophila (fruitfly) genome, collected experimentally


Comparative genomics to identify dna binding motifs

http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif


Comparative genomics to identify dna binding motifs

T A A T C C C

Motif

http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif


Comparative genomics to identify dna binding motifs

W A A T C C N

Motif

W = T or A

N = A,C,G,T

“Consensus

String”

http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif


Motif

Motif

  • Common sequence “pattern” in the binding sites of a transcription factor

  • A succinct way of capturing variability among the binding sites


Comparative genomics to identify dna binding motifs

Alternative way to represent motif

Position weight matrix (PWM)

Or simply, “weight matrix”


Motif representation

Motif representation

  • Consensus string

    • May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc.

    • Tractable search space, enumerative algorithms

  • Position weight matrix

    • More powerful representation

    • Probabilistic treatment, algorithms

    • More popular


The motif finding problem in one species

The motif finding problem(in one species)

  • Suppose a transcription factor (TF) regulates five different genes

  • Each of the five genes should have binding sites for TF in their promoter region

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Binding sites for TF


The motif finding problem

The motif finding problem

  • Now suppose we are given the promoter regions of the five genes G1, G2, … G5

  • Can we find the binding sites of TF, without knowing about them a priori ?

    • Binding sites are similar to each other, but not necessarily identical

  • This is the motif finding problem

  • To find a motif that represents binding sites of an unknown TF


Motif finding algorithms

Motif finding algorithms

  • Version 1: Given promoter regions of co-regulated genes, find the motif

  • Existing algorithms:

    • Gibbs sampling (MCMC) : Lawrence et al. 1993

    • MEME (Expectation-Maximization) : Bailey & Elkan 94

    • CONSENSUS (Greedy local search, beam search) : Hertz & Stormo

    • Word enumeration methods (with emphasis on statistical accuracy)

      • van Helden et al. 1998, Sinha & Tompa 2000

    • And a hundred others


Comparative genomics

Comparative Genomics


More data

species1

GCGTGATCGAGCTATAACGGAA

GCGTGATCGAGCTATAACGGAA

species2

CTGTGATCGTCGGGTAACGCCC

CTGTGATCGTCGGGTAACGCCC

species3

TGGTGATCGGAACCCCTAACGA

TGGTGATCGGAACCCCTAACGA

species4

AAGTGATCGATTATCCTAACGT

AAGTGATCGATTATCCTAACGT

EVOLUTIONARY TREE

BLOCKS OF

CONSERVATION

More Data

  • Genomes of multiple species available


Using multiple genomes

Using multiple genomes

  • Functional parts of the genome evolve more slowly than non-functional parts

  • Identify conserved parts by sequence alignment algorithms

  • Look for functional features in conserved regions – this improves the signal

Popular Paradigm in Computational Biology


Multiple sequence alignment

Multiple sequence alignment

  • Comparative genomics relies upon the ability to detect “similar” (evolutionarily related) regions in different genomes

  • The problem of multiple species alignment

  • A hard computational problem (“NP-hard”)

  • Several fast heuristics exist (Mlagan, TBA)

  • Assume this functionality exists …


Motif finding

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Binding sites for TF

Back To

Motif finding


Motif finding from multiple species data

Motif finding from multiple species data

  • Version 2: Given promoter regions of same gene

  • from multiple species, find the motif

Species 1

Species 2

Gene G

Species 3

Species 4

Species 5

Binding sites for TF


One approach

Blocks of conservation

One approach

  • Do multiple sequence alignment of upstream regions of gene

Species 1

Species 2

Gene G

Species 3

Species 4

Species 5

  • Look for recurring motifs in conserved blocks


Another approach alignment free

Blocks of conservation

Another approach (alignment-free)

  • What if binding sites are not entirely within conserved blocks?

Species 1

Species 2

Gene G

Species 3

Species 4

Species 5

  • Look for recurring motifs in entire upstream regions


Footprinter blanchette et al the method without alignments

Footprinter (Blanchette et al.)The method without alignments


Footprinter

Footprinter

  • The input sequences are promoter regions of the same gene, but from multiple species.

  • Such sequences are said to be “orthologous” to each other.


Footprinter1

Footprinter

Input sequences

Related by an

evolutionary tree

Find motif


A side note parsimony

A side note: Parsimony

  • A guiding principle in cross-species comparison

  • If the data can be explained in multiple ways, prefer the one with the fewer number of events (be parsimonious)

  • Parsimony score = number of evolutionary events (e.g., substitutions) on the tree

  • Maximum parsimony principle: minimize parsimony score


Phylogenetic footprinting formally speaking

Phylogenetic footprinting: formally speaking

Given:

  • phylogenetic tree T,

  • set of orthologous sequences at leaves of T,

  • length k of motif

  • threshold d

    Problem:

  • Find set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in Tis at most d.


Comparative genomics to identify dna binding motifs

AGTCGTACGTGAC...(Human)

AGTAGACGTGCCG...(Chimp)

ACGTGAGATACGT...(Rabbit)

GAACGGAGTACGT...(Mouse)

TCGTGACGGTGAT... (Rat)

Small Example

Size of motif sought: k = 4


Comparative genomics to identify dna binding motifs

AGTCGTACGTGAC...

AGTAGACGTGCCG...

ACGTGAGATACGT...

GAACGGAGTACGT...

TCGTGACGGTGAT...

ACGT

ACGT

ACGT

ACGG

Solution

Parsimony score: 1 mutation


An exact algorithm blanchette s algorithm

… ACGG: +ACGT: 0

...

… ACGG:ACGT :0 ...

… ACGG:ACGT :0 ...

… ACGG:ACGT :0 ...

… ACGG: 1 ACGT: 0 ...

4k entries

AGTCGTACGTG

ACGGGACGTGC

ACGTGAGATAC

GAACGGAGTAC

TCGTGACGGTG

… ACGG: 2ACGT: 1...

… ACGG: 1ACGT: 1...

… ACGG: 0ACGT: 2

...

… ACGG: 0 ACGT: +...

An Exact Algorithm(Blanchette’s algorithm)

Wu [s] =best parsimony score for subtree rooted at node u,

if u is labeled with string s.


Comparative genomics to identify dna binding motifs

  • Wu [s] =  min ( Wv [t] + d(s, t) )

  • A post-order traversal algorithm

v:child t ofu

Recurrence


Comparative genomics to identify dna binding motifs

Wu [s] =  min ( Wv [t] + d(s, t) )

v:child t ofu

Running Time

O(k 42k )timeper node


Footprinter features

Footprinter: features

  • One of the earliest motif-finding algorithms based on comparative genomics

  • Simple formulation of motif score, algorithm efficient in practice

  • Cannot combine evolutionary conservation information with overrepresentation information

    • two motifs, equally conserved, but one occurs in many co-regulated genes (promoters)


Phylocon stormo lab the method with alignments

PhyloCon (Stormo lab)The method with alignments


The underlying single species algorithm consensus

The underlying single-species algorithm: CONSENSUS

Final goal: Find a set of

substrings, one in each input

sequence

Set of substrings define a PWM.

Goal: This PWM should have

high information content.

High information content means

that the motif “stands out”.


The underlying single species algorithm consensus1

The underlying single-species algorithm: CONSENSUS

Start with a substring in one

input sequence

Build the set of substrings

incrementally, adding one

substring at a time

The current set of substrings.


The underlying single species algorithm consensus2

The underlying single-species algorithm: CONSENSUS

Start with a substring in one

input sequence

Build the set of substrings

incrementally, adding one

substring at a time

The current set of substrings.

The current motif.


The underlying single species algorithm consensus3

?

?

?

?

The underlying single-species algorithm: CONSENSUS

Start with a substring in one

input sequence

Build the set of substrings

incrementally, adding one

substring at a time

The current set of substrings.

The current motif.

Consider every substring in the next sequence, try adding it to current motif and scoring resulting motif


The underlying single species algorithm consensus4

The underlying single-species algorithm: CONSENSUS

Start with a substring in one

input sequence

Build the set of substrings

incrementally, adding one

substring at a time

The current set of substrings.

The current motif.

Pick the best one ….


The underlying single species algorithm consensus5

The underlying single-species algorithm: CONSENSUS

Start with a substring in one

input sequence

Build the set of substrings

incrementally, adding one

substring at a time

The current set of substrings.

The current motif.

… and repeat

Pick the best one ….


The key scoring a motif

The key: Scoring a motif

The current motif.

Scoring a motif:


The key scoring a motif1

The key: Scoring a motif

The current motif.

Scoring a motif:

Build a PWM

Compute information content of PWM:

For each column,

Compute relative entropy relative to

a “background” distribution

Sum over all columns

Key: to align the sites of a motif, and score the alignment


Extending consensus to multiple species

Extending CONSENSUS to multiple species

Final goal: Find a set of

substrings, one in each input

sequence


Extending consensus to multiple species1

Extending CONSENSUS to multiple species

Final goal: Find a set of

“profiles”, one in each set of

orthologous input sequences


Extending consensus to multiple species2

Extending CONSENSUS to multiple species

“Profiles”


Extending consensus to multiple species3

Extending CONSENSUS to multiple species

“Profiles”


Extending consensus to multiple species4

Extending CONSENSUS to multiple species


Aligning two profiles

Aligning two “profiles”

  • Compare two profiles column by column

  • Each column of a profile is (nA,nC,nG,nT), and equivalently, (fA,fC,fG,fT)

  • Probabilistic score to capture if two columns {nbi,fbi}b and {nbj,fbj}b are from the same distribution (and different from background)

  • ALLR: Avg. Log Likelihood Ratio

where pb is background frequency of base b


One cool feature of allr

One cool feature of ALLR

  • Expected value is negative, means very long profiles will not automatically give large ALLR scores

  • Therefore, can automatically detect the “right” motif length


Phylocon features

PhyloCon: features

  • One of the first algorithms to find motifs that are conserved across species and occur in multiple co-regulated gene promoters

  • Does not consider the evolutionary relationships among species (all species weighted equally)


Phyme sinha et al a method with alignments and an evolutionary model

PhyME (Sinha et al.) A method with alignments and an evolutionary model


Comparative genomics to identify dna binding motifs

Promoter 1

Promoter 2

Promoter 3

Promoter 4

species1

species2

species3

species4

  • Input

  • A set of promoter with many matches to unknown motif W

  • For some promoters, sequence from other species also given

  • Output : The motif W


Comparative genomics to identify dna binding motifs

Promoter 1

Promoter 2

Promoter 3

Promoter 4


Comparative genomics to identify dna binding motifs

Promoter 1

Promoter 2

Promoter 3

Promoter 4

Step 1: Use alignment program (LAGAN) to find ungapped

blocks of conservation


Comparative genomics to identify dna binding motifs

For each promoter, maximize

Pr (promoter + orthologs | Model with motif as parameter)

Model = “Hidden Markov Model”

Find motif (parameter) that maximizes the likelihood.

We’ll study the model in detail, today evening


Comparative genomics to identify dna binding motifs

A key component of the likelihood computation:

Pr (site s | motif W)

Evolutionarily unrelated sites


Comparative genomics to identify dna binding motifs

Evolutionarily related sites

A key component of the likelihood computation:

Pr (site s | motif W)

Given by evolutionary model


Evolutionary model

a

Timet

s1

s2

Evolutionary model

  • Two species, sites s1 and s2 in a conserved block

  • Pr (s1,s2 | W)

  • Short time limit (t ~ 0):

    • a = s1 = s2

    • Pr (s1,s2 | W) = Pr (a | W)

  • Long time limit (t ~ )

    • Pr (s1,s2 | W) = Pr (s1 | W)  Pr (s2 | W)

  • Interpolate between these two limits


Model of binding site evolution

a

(i) Depends on time t

(ii) Depends on motif W

Time t

s1

s2

Model of binding site evolution

  • Evolving binding site must bind the same protein

  • Pr (s1,s2 | W) = aPr(a |W) i Pr (si | a, W, t)

  • Can be generalized to more than two species (recursively)


Training the motif

Training the motif

  • Given a motif, we can compute

    • Pr (promoter + orthologs | model with motif)

  • But we have to find the motif that maximizes this probability

  • Expectation-maximization algorithm

  • Local search, not guaranteed to find global maximum

  • More on E-M in evening lecture


Phylogibbs siddharthan et al

PhyloGibbs(Siddharthan et al.)

  • Problem formulation very similar to PhyME (alignments, evolutionary model)

  • Gibbs sampling approach to find motif

    • A special MCMC strategy

    • E-M (PhyME) prone to local optima

  • Can find multiple motifs simultaneously


Phyme phylogibbs

PhyME & PhyloGibbs

  • Algorithms that consider the phylogenetic tree relating the species

    • Another algorithm of same genre: MONKEY (Moses et al. 2004)

  • Allow binding sites to occur in conserved (aligned) as well as unconserved regions

  • Designed to find motifs in sets of co-regulated genes (and their orthologs)

  • Not designed to find motifs from whole-genome analysis


Mcs kellis lab genome wide motif finding from multiple species

MCS (Kellis lab.)Genome-wide motif finding from multiple species


Algorithm

Algorithm

  • Align four mammalian species genomes

    • human, mouse, rat, dog

  • Focus on all promoter regions and 3’ UTRs

  • For every possible motif (consensus string model)

    • Count the number of occurrences in the human genome

    • Count how many of these are completely conserved in all four species (obvious from alignment)

    • Evaluate statistical significance


Statistical significance

Statistical Significance

  • k = # of conserved occurrences of motif

  • n = # of occurrences of motif

  • p = k/n = “conservation rate” of motif

  • For 100 random motifs of the same type, compute average conservation rate p0

  • Compute

  • n occurrences, p0 rate of being conserved, significance of k conserved occurrences ?

  • Exact p-value: Binomial(n, p0, k)

  • Binomial mean = np0, variance = np0(1-p0)


Mcs score

MCS score

  • The “z” score is called the MCS score

  • Output all motifs with MCS > 6

  • Post-process this list of motifs, to remove similar looking motifs (“clustering”)

  • A final list of 174 motifs from promoters

  • 69 match known motifs

  • 105 potential new regulatory motifs


Conclusion

Conclusion

  • Comparative genomics has infused new life into the motif-finding community

  • A variety of algorithms geared towards various assumptions

    • Footprinter: no alignments (2000)

    • PhyloCon: alignments, but no tree (2003)

    • PhyME: alignments, tree, and evolutionary model (2004)

    • MCS: genome-wide motif discovery from very closely related species (2005)


Questions

QUESTIONS ?


References single species motif finding

References: single species motif finding

  • Timothy L. Bailey and Charles Elkan, "Unsupervised Learning of Multiple Motifs in Biopolymers using EM", Machine Learning, 21(1-2):51-80, October, 1995

  • Lawrence, C. E., S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton (1993, October). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208--214.

  • Hertz GZ, Hartzell GW 3rd, Stormo GDIdentification of consensus patterns in unaligned DNA sequences known to be functionally related.CABIOS(now Bioinformatics) 1990. 6(2):81-92.

  • van Helden,J., Andre,B. and Collado-Vides,J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol., 281, 827-842.

  • Sinha, S. and Tompa, M., A statistical method for finding transcription factor binding sites, Proc. Int. Conf. Intell. Syst. Mol. Biol., 8:344--354, 2000.


References multiple species motif finding

References: multiple species motif finding

  • Blanchette, M., Schwikowski, B. and Tompa, M. (2000). An exact algorithmto identify motifs in orthologous sequences from multiple species.・・Proceedings of the Eight International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), pp. 37-45.

  • Wang T, Stormo GD. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. 2003 Dec 12;19(18):2369-80.

  • Sinha S, Blanchette M, Tompa M.PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences.BMC Bioinformatics. 2004 Oct 28;5:170.

  • Siddharthan R, Siggia ED, van Nimwegen E.: a Gibbs sampling motif finder that incorporates phylogeny.PLoS Comput Biol. 2005 Dec;1(7):e67. Epub 2005 Dec 9

  • Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB. MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol. 2004;5(12):R98. Epub 2004 Nov 30.

  • Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals.Nature. 2005 Mar 17;434(7031):338-45. Epub 2005 Feb 27.


  • Login