1 / 37

Motif Search

Motif Search. What are Motifs. Motif (dictionary) A recurrent thematic element, a common theme. Find a common motif in the text. Find a short common motif in the text. Motifs in biological sequences.

primo
Download Presentation

Motif Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif Search

  2. What are Motifs • Motif (dictionary) A recurrent thematic element, a common theme

  3. Find a common motif in the text

  4. Find a short common motif in the text

  5. Motifs in biological sequences Sequence motifs represent a short common sequence (length 4-20) which is highly represented in the data

  6. Motifs in biological sequences What can we learn from these motifs? • Regulatory motifs on DNA or RNA • Functional sites in proteins

  7. Regulatory Motifs on DNA • Transcription Factors (TF) are regulatory protein that bind to regulatory motifs near the gene and act as a switch bottom (on/off) • TF binding motifs are usually 6 – 20 nucleotides long • located near target gene, mostly upstream the transcription start site Transcription Start Site TF2 TF1 Gene X TF1 motif TF2 motif

  8. What can we learn from these motifs? About half of all cancer patients have a mutation in a gene called p53 which codes for a key Transcription factors. The mutations are in the DNA binding region and allows tumors to survive and continue growing even after chemotherapy severely damages their DNA P53 Transcription Factor Target Gene Binding sites (moifs)

  9. Why is P53 involved in so many cancer types? p53 regulated over 100 different genes (hub) We are interested to identify the genes regulated by p53

  10. Can we find TF targets using a bioinformatics approach?

  11. Finding TF targets using a bioinformatics approach? Scenario 1 : Binding motif is known (easier case) Scenario 2 : Binding motif is unknown (hard case)

  12. Scenario 1 : Binding motif is known • Given a motif find the binding sites in an input sequence

  13. Challenges in biological sequences Motifs are usually not exact words …….

  14. How to present non exact motifs?

  15. 0.1 0.7 0.2 0.6 0.5 0.1 0.7 0.1 0.5 0.2 0.2 0.8 0.1 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 How to present non exact motifs? • Consensus string NTAHAWT May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; H=not G; S = C/G; R = A/G; Y = T/C etc. • Position Specific Scoring Matrix (PSSM) Probability for each base in each position 2 3 4 5 6 1 A T G C

  16. Given a consensus : For each position l in the input sequence, check if substring starting at position l matches the motif. Example: find the consensus motif NTAHAWT in the promoter of a gene >promoter of gene A ACGCGTATATTACGGGTACACCCTCCCAATTACTACTATAAATTCATACGGACTCAGACCTTAAAA…….

  17. Given a PSSM: Starting from a set of aligned motifs Seq 1 AAAGCCC Seq 2 CTATCCA Seq 3 CTATCCC Seq 4 CTATCCC Seq 5 GTATCCC Seq 6 CTATCCC Seq 7 CTATCCC Seq 8 CTATCCC Seq 9 TTATCTG

  18. Given a string s of length l = 7 • s = s1s2…sl • Pr(s | W) = • Example: Pr(CTAATCCG) = 0.67 x 0.89 x 1 x 1 x 0.89 x 1 x 0.89 x 0.11 Given a PSSM: W Probability of each base In each column Counts of each base In each column Wk = probability of base  in column k

  19. Given a PSSM: • Given sequence S (e.g., 1000 base-pairs long) • For each substring s of S, • Compute Pr(s|W) • If Pr(s|W) > some threshold, call that a binding site • In DNA sequences we need to search both strands AGTTACACCA TGGTGTAACT (reverse complement) Seq1 :AAAACGTGCGTAGCAGTTACACCAACTCTA TTTTGCACGCATCGTCAATGTGGTTGAGAT Seq2 :ACTTACTACTGGTGTAACTATATATTTTCG TGAATGATGACCACATTGATATATAAAAGC

  20. Scenario 2 : Binding motif is unknown “Ab initio motif finding”

  21. Ab initio motif finding: Expectation Maximization • Local search algorithm - Start from a random PWM • Move from one PWM to another so as to improve the score which fits the sequence to the motif • Keep doing this until no more improvement is obtained : Convergence to local optima

  22. Expectation Maximization • Let W be a PWM . Let S be the input sequence . • Imagine a process that randomly searches, picks different strings matching W and threads them together to a new PWM

  23. Expectation Maximization • Find W so as to maximize Pr(S|W) • The “Expectation-Maximization” (EM) algorithm iteratively finds a new motif W that improves Pr(S|W)

  24. PWM 1. Start from a random motif 2. Scan sequence for good matches to the current motif. Build a new PWM out of these matches, and make it the new motif 3. Expectation Maximization

  25. The final PSSM represents the motif which is mostly enriched in the data The PSSM can be also represented as a sequence logo -A letter’s height indicates the information it contains

  26. Presenting a sequence motif as a logo PWM PSSM TTCACG TACATG TACAGG TACAAG Divide each score by background probability 0.25 Letter Height Log2S T position 1=Log24=2 T position 5=Log21=0

  27. חידה • מהו המקסימום גובה שנוכל לקבל בלוגו שמתאר מוטיב שהתקבל מרצפי חלבונים??

  28. Are common motifs the right thing to search for ?

  29. ?

  30. Solutions: -Searching for motifs which are enriched in one set but not in a random set - Use experimental information to rank the sequences according to their binding affinity and search for enriched motifs at the top of the list

  31. ChIP-Seq Sequencing the regions in the genome to which a protein (e.g. transcription factor) binds to.

  32. Finding the p53 binding motif in a set of p53 target sequences which are ranked according to binding affinity Best Binders ChIP –SEQ Weak Binders

  33. a word search approach to search • for enriched motif in a ranked list CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA CTGTGC CTGTGA CTGTGA CTGTGA Candidate k-mers CTGTGC CTATGC CTACGC CTGTGA ACTTGA CTGTAC ACGTGA ATGTGC ACGTGC ATGTGA Ranked sequences list

  34. uses the minimal hyper geometric statistics (mHG) to find enriched motifs CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA Ranked sequences list The number of sequences containing the motif among the top sequences The number of sequences containing the motif The total number of input sequences The number of sequences at the top of the list

  35. The enriched motifs are combined to get a PSSM which represents the binding motif

  36. Protein Motifs Protein motifs are usually 6-20 amino acids long and can be represented as a consensus/profile: P[ED]XK[RW][RK]X[ED] or as PWM

More Related