comparative genomics to identify dna binding motifs n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Comparative genomics to identify DNA binding motifs PowerPoint Presentation
Download Presentation
Comparative genomics to identify DNA binding motifs

Loading in 2 Seconds...

play fullscreen
1 / 70

Comparative genomics to identify DNA binding motifs - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Comparative genomics to identify DNA binding motifs. Saurabh Sinha Dept. of Computer Science University of Illinois, Urbana-Champaign. Outline. Binding sites and motifs The motif finding problem in one species Comparative genomics and alignment

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Comparative genomics to identify DNA binding motifs


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Comparative genomics to identify DNA binding motifs Saurabh Sinha Dept. of Computer Science University of Illinois, Urbana-Champaign

    2. Outline • Binding sites and motifs • The motif finding problem in one species • Comparative genomics and alignment • The motif finding problem with comparative genomics

    3. Motif finding in multiple species • Footprinter : the approach without alignments • PhyloCon : The use of alignments • PhyME & PhyloGibbs : The use of alignments and an evolutionary model • MCS : Genome-wide motif finding from multiple species

    4. Binding sites and motifs

    5. Binding sites • A few binding sites of transcription factor “Bicoid” in the Drosophila (fruitfly) genome, collected experimentally

    6. http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

    7. T A A T C C C Motif http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

    8. W A A T C C N Motif W = T or A N = A,C,G,T “Consensus String” http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

    9. Motif • Common sequence “pattern” in the binding sites of a transcription factor • A succinct way of capturing variability among the binding sites

    10. Alternative way to represent motif Position weight matrix (PWM) Or simply, “weight matrix”

    11. Motif representation • Consensus string • May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc. • Tractable search space, enumerative algorithms • Position weight matrix • More powerful representation • Probabilistic treatment, algorithms • More popular

    12. The motif finding problem(in one species) • Suppose a transcription factor (TF) regulates five different genes • Each of the five genes should have binding sites for TF in their promoter region Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF

    13. The motif finding problem • Now suppose we are given the promoter regions of the five genes G1, G2, … G5 • Can we find the binding sites of TF, without knowing about them a priori ? • Binding sites are similar to each other, but not necessarily identical • This is the motif finding problem • To find a motif that represents binding sites of an unknown TF

    14. Motif finding algorithms • Version 1: Given promoter regions of co-regulated genes, find the motif • Existing algorithms: • Gibbs sampling (MCMC) : Lawrence et al. 1993 • MEME (Expectation-Maximization) : Bailey & Elkan 94 • CONSENSUS (Greedy local search, beam search) : Hertz & Stormo • Word enumeration methods (with emphasis on statistical accuracy) • van Helden et al. 1998, Sinha & Tompa 2000 • And a hundred others

    15. Comparative Genomics

    16. species1 GCGTGATCGAGCTATAACGGAA GCGTGATCGAGCTATAACGGAA species2 CTGTGATCGTCGGGTAACGCCC CTGTGATCGTCGGGTAACGCCC species3 TGGTGATCGGAACCCCTAACGA TGGTGATCGGAACCCCTAACGA species4 AAGTGATCGATTATCCTAACGT AAGTGATCGATTATCCTAACGT EVOLUTIONARY TREE BLOCKS OF CONSERVATION More Data • Genomes of multiple species available

    17. Using multiple genomes • Functional parts of the genome evolve more slowly than non-functional parts • Identify conserved parts by sequence alignment algorithms • Look for functional features in conserved regions – this improves the signal Popular Paradigm in Computational Biology

    18. Multiple sequence alignment • Comparative genomics relies upon the ability to detect “similar” (evolutionarily related) regions in different genomes • The problem of multiple species alignment • A hard computational problem (“NP-hard”) • Several fast heuristics exist (Mlagan, TBA) • Assume this functionality exists …

    19. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF Back To Motif finding

    20. Motif finding from multiple species data • Version 2: Given promoter regions of same gene • from multiple species, find the motif Species 1 Species 2 Gene G Species 3 Species 4 Species 5 Binding sites for TF

    21. Blocks of conservation One approach • Do multiple sequence alignment of upstream regions of gene Species 1 Species 2 Gene G Species 3 Species 4 Species 5 • Look for recurring motifs in conserved blocks

    22. Blocks of conservation Another approach (alignment-free) • What if binding sites are not entirely within conserved blocks? Species 1 Species 2 Gene G Species 3 Species 4 Species 5 • Look for recurring motifs in entire upstream regions

    23. Footprinter (Blanchette et al.)The method without alignments

    24. Footprinter • The input sequences are promoter regions of the same gene, but from multiple species. • Such sequences are said to be “orthologous” to each other.

    25. Footprinter Input sequences Related by an evolutionary tree Find motif

    26. A side note: Parsimony • A guiding principle in cross-species comparison • If the data can be explained in multiple ways, prefer the one with the fewer number of events (be parsimonious) • Parsimony score = number of evolutionary events (e.g., substitutions) on the tree • Maximum parsimony principle: minimize parsimony score

    27. Phylogenetic footprinting: formally speaking Given: • phylogenetic tree T, • set of orthologous sequences at leaves of T, • length k of motif • threshold d Problem: • Find set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in Tis at most d.

    28. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACGT...(Rabbit) GAACGGAGTACGT...(Mouse) TCGTGACGGTGAT... (Rat) Small Example Size of motif sought: k = 4

    29. AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGT ACGG Solution Parsimony score: 1 mutation

    30. … ACGG: +ACGT: 0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG: 1 ACGT: 0 ... 4k entries AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2ACGT: 1... … ACGG: 1ACGT: 1... … ACGG: 0ACGT: 2 ... … ACGG: 0 ACGT: +... An Exact Algorithm(Blanchette’s algorithm) Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s.

    31. Wu [s] =  min ( Wv [t] + d(s, t) ) • A post-order traversal algorithm v:child t ofu Recurrence

    32. Wu [s] =  min ( Wv [t] + d(s, t) ) v:child t ofu Running Time O(k 42k )timeper node

    33. Footprinter: features • One of the earliest motif-finding algorithms based on comparative genomics • Simple formulation of motif score, algorithm efficient in practice • Cannot combine evolutionary conservation information with overrepresentation information • two motifs, equally conserved, but one occurs in many co-regulated genes (promoters)

    34. PhyloCon (Stormo lab)The method with alignments

    35. The underlying single-species algorithm: CONSENSUS Final goal: Find a set of substrings, one in each input sequence Set of substrings define a PWM. Goal: This PWM should have high information content. High information content means that the motif “stands out”.

    36. The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings.

    37. The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif.

    38. ? ? ? ? The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Consider every substring in the next sequence, try adding it to current motif and scoring resulting motif

    39. The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Pick the best one ….

    40. The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. … and repeat Pick the best one ….

    41. The key: Scoring a motif The current motif. Scoring a motif:

    42. The key: Scoring a motif The current motif. Scoring a motif: Build a PWM Compute information content of PWM: For each column, Compute relative entropy relative to a “background” distribution Sum over all columns Key: to align the sites of a motif, and score the alignment

    43. Extending CONSENSUS to multiple species Final goal: Find a set of substrings, one in each input sequence

    44. Extending CONSENSUS to multiple species Final goal: Find a set of “profiles”, one in each set of orthologous input sequences

    45. Extending CONSENSUS to multiple species “Profiles”

    46. Extending CONSENSUS to multiple species “Profiles”

    47. Extending CONSENSUS to multiple species

    48. Aligning two “profiles” • Compare two profiles column by column • Each column of a profile is (nA,nC,nG,nT), and equivalently, (fA,fC,fG,fT) • Probabilistic score to capture if two columns {nbi,fbi}b and {nbj,fbj}b are from the same distribution (and different from background) • ALLR: Avg. Log Likelihood Ratio where pb is background frequency of base b

    49. One cool feature of ALLR • Expected value is negative, means very long profiles will not automatically give large ALLR scores • Therefore, can automatically detect the “right” motif length

    50. PhyloCon: features • One of the first algorithms to find motifs that are conserved across species and occur in multiple co-regulated gene promoters • Does not consider the evolutionary relationships among species (all species weighted equally)