1 / 38

Regulatory Motif Finding

Regulatory Motif Finding. Statistical Models for Biological Sequence Motif Discovery, Liu J, Gupta, Liu X, Mayerhofere, Lawrence . Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002). “Regulatory Motif Finding”.

arnie
Download Presentation

Regulatory Motif Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regulatory Motif Finding Statistical Models for Biological Sequence Motif Discovery, Liu J, Gupta, Liu X, Mayerhofere, Lawrence Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting,Blanchette & Tompa (2002)

  2. “Regulatory Motif Finding” • What is being regulated? • What is a “Motif?” • Why do we want to find them?

  3. Central Dogma of Genetics • It’s “TRUE,” right?! (pict by Andrew Hughes, Rice University) • Yes, but…

  4. Every Protein in Every Cell? • Clearly, there are complicated mechanisms at work • Rhodopsin • But, we have the same DNA in all cells…

  5. Transcriptional Regulation • It is transcription (DNA  RNA) that is being regulated. • RNA Polymerase II, aided by Transcription Factors (TFs) • Where do TFs bind?

  6. Promoter Regions • TATA box – usually ~ 30 bp upstream of gene (pict by Andrew Hughes, Rice University) • But, there are others...Where? What Sequence?

  7. Promoter Sequence • Many different possible locations, sometimes extremely far from the start of transcription! • What Sequence? THAT is the $64k (or $1B) Question…

  8. Motifs • Many different promoter sequences found • Basal: TATA-box (-20), CCAAT-box (-100) • Additional transcriptional regulatory domains • Activators and inhibitors use these domains

  9. Motifs (2) • Not exact sequences – that would be too easy  • Strength of Binding Affects level of promotion/inhibition (C/G vs A/T) • Often are Palindromic (GATATC) • Described either probabilistically with motif logos or with extended single-letter nucleotide codes

  10. Symbol Meaning A Adenine G Guanine C Cytosine T Thymine U Uracil Y pYrimidine (C or T) R puRine (A or G) W "Weak" (A or T) S "Strong" (C or G) K "Keto" (T or G) M "aMino" (C or A) B not A (C or G or T) D not C (A or G or T) H not G (A or C or T) V not T (A or C or G) X,N,? unknown (A or C or G or T) Extended Single-Letter Codes • Letters represent possible bases in each position: • TGASTMA – Promoter Sequence for several oncogenes

  11. Motif Logos • Height of letters represents probability of being found in that location in the motif

  12. Why do we care? • Gene regulation  transcriptional regulation • Can teach us about our complex signaling pathways • Drugs and Money

  13. So…Finding Regulatory Motifs • Statistical Models paper (Liu et al) • Assumes: We have located genes that we expect to be co-regulated (microarrays, co-expression)

  14. So…Finding Regulatory Motifs • Experimental methods of determining TF binding sites (Gel Shift assay, DNA Protection Assay) • Statistical models

  15. Single-Site Model • Assumes: - Each sequence contains 1 motif - Sequences are generated by random draws from {A,C,G,T} with given prior probabilities - Motif has a frequency matrix for each position • Use Gibbs site sampler: Missing Data Problem. Randomly choose motif locations. Then move the motif locations based on P(ak)

  16. Gibbs Sampling Sampling: For every K-long word xj,…,xj+k-1 in x: • Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1) • Pi = Prob[ word | background ] B(xj)…B(xj+k-1) Let Sample a random new position ai according to the probabilities A1,…, A|x|-k+1. Prob 0 |x|

  17. Repetitive Block-Motif Model • View K sequences as one long sequence of length n. Model probability of a motif starting at each position ‘i’. • Problems: - Lose evolutionary relationship between sequences - Allows multiple copies of motif in each sequence - Total number of occurrences unknown

  18. The Rest of the Statistical Models Paper… • Much math: • Scoring motif candidates • Using potential motif dictionaries • Bayesian Prior Probabilities • Finding motifs with insertions in them (“gapped” motifs) • On to: Phylogenetic Footprinting

  19. Phylogenetic Footprinting • Most of paper spent describing background, results • Methods are brief, not too deep

  20. Let Evolution Be Your Guide • Phylogenetic Footprinting – “Identifying regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species”

  21. Orthologs and Paralogs Gene duplicate within species: Paralog Same gene in species with common ancestor: Ortholog

  22. Advantages • Doesn’t rely on reliably determining co-regulated genes (single-genome approach, non-trivial!) • Can be used to find regulatory elements specific to one single gene (caveat: conserved across species)

  23. Standard Methods • Usually start with MSA (ProbCons,clustalw) • But, this can lose signal (short regulatory elements ~20bp, long promoter regions ~1000 bp) • Also, if species are evolutionarily close, nonfunctional regions may also be well conserved • Can start with general motif discovery algs (MEME, Consensus, AlignAce, DIALIGN …) • But, these don’t take into account relative phylogenetic relationships of sequences. Will weight closely related sequences too highly

  24. The PF Algorithm Given: • phylogenetic tree T, • set of orthologous sequences at leaves of T, • length k of motif • threshold d Problem: • Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.

  25. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACGT...(Rabbit) GAACGGAGTACGT...(Mouse) TCGTGACGGTGAT... (Rat) Small Example (merci, CS262) Size of motif sought: k = 4

  26. AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGT ACGG Solution Parsimony score: 1 mutation

  27. ACGG: +ACGT: 0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG: 1 ACGT: 0 ... 4k entries … ACGG: 2ACGT: 1 ... … ACGG: 1ACGT: 1 \... … ACGG: 0ACGT: 2 ... … ACGG: 0 ACGT: + ... An Exhaustive Algorithm Wu[s] = best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG

  28. Wu[s] =  min ( Wv[t] + h(s, t) ) v:children t ofu Simple Recurrence Words Good: K-mer score at a node is the sum of its children’s best parsimony scores for that k-mer

  29. Wu[s] =  min ( Wv[t] + h(s, t) ) v:children t ofu Average sequence length Number of species Total time O(n k(42k + l)) Motif length Running Time O(k 42k )timeper node

  30. FootPrinterhttp://bio.cs.washington.edu/software.html • Avoids pitfalls of using MSA or general-purpose Motif-finding algorithms • Identifies all DNA motifs that appear to have evolved more slowly than the surrounding sequence • Allows motifs to not appear in all sequences (LexA in gram +/- bacteria)

  31. FootPrinter (2) • “Given n orthologous input sequences and the phylogenetic tree T relating them, [footprinter] is guaranteed to produce every set of k-mers, one from each input sequence, that have a parsimony score at most d with respect to T, where k and d are parameters specified by the user.

  32. Parameters • Can set minimum threshold on fraction of the phylogeny that must be spanned for motifs with each parsimony score ‘s’.

  33. Results • Examine 9 sets of orthologous or paralogous (works for duplicated genes that have since evolved as well) sequences. • Found: many old, + some highly conserved motifs of unknown function (time for the experimentalists!)

  34. One example: Metallothionein Gene Family • Good test family: • Large number of promoter sequences • Wide variety of species • Large number of regulatory elements experimentally verified in several species. • Most binding sites are within 300 bp of start codon (ATG)

  35. Inputs Sequences: 590 bp upstream of the start codon • Most found were present in multiple isoform families – gained accuracy by considering the paralogs, not just the orthologs

  36. But, FootPrinter isn’t Perfect • Some known regulatory binding sites were missed. Why? • Ultimately, must be because the motifs were not well-enough conserved to be detected (but we can discuss more…)

  37. FootPrinter Error (1) • Some binding sites not well matched in other species. Example: Thyroid hormone receptor T3R is conserved within rodents, but not beyond. Would need many closely related species to detect this motif.

  38. FootPrinter Error (2-5) • Some motifs well conserved, but too short • InDels in middle of motif – could allow them, but would get many false +s • Some barely fail to meet statistical thresholds (close but no cigar) • Dimer TFs like two conserved regions with variable internal seq.

More Related