1 / 24

Finding Motifs

Finding Motifs. Vasileios Hatzivassiloglou University of Texas at Dallas. Motif consensus. The consensus is the true underlying motif, that is expressed imperfectly in real genes because of mutations across organisms

genna
Download Presentation

Finding Motifs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas

  2. Motif consensus • The consensus is the true underlying motif, that is expressed imperfectly in real genes because of mutations across organisms • A motifinstance is a particular realization of the motif consensus in a given gene; it will differ from the consensus in a small number of positions

  3. Motif data example (made up) • Motif instances: • AAAAACAC • CAAAACAA • ACACAAAA • CAAAAAAC • AAAGAACA • GACAAAAA • AAGAGAAA • Motif consensus: AAAAAAAA

  4. Motif data example (real) • Positions 3-9 (out of about 22) of the cyclic AMP receptor protein transcription factor binding site in 20 samples • TTGTGGC • TTTTGAT • AAGTGTC • ATTTGCA • CTGTGAG • ATGCAAA • GTGTTAA • ATTTGAA • TTGTGAT • ATTTATT • ACGTGAT • ATGTGAG • TTGTGAG • CTGTAAC • CTGTGAA • TTGTGAC • GCCTGAC • TTGTGAT • TTGTGAT • GTGTGAA

  5. Phylogenetic footprinting • A phylogenetic tree organizes related (orthologous)sequences from different species • The sequences appear as leaves • Internal nodes indicate evolutionary divergence between species • A footprint is a highly conserved region across species

  6. Identifying footprints • Main assumption: Functional DNA changes more slowly than other DNA • Therefore, closely related regions in different species are • more likely to be functional sequences • a basis for grouping species together • Footprints are DNA motifs

  7. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example

  8. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example

  9. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT

  10. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACGG

  11. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACG[TG] ACGG

  12. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACGT T→G mutation ACGT ACGG

  13. Finding motifs • Start with a number of related genes (or proteins) • In regulatory motif finding, • the related genes are co-expressed • Recall our discussion of DNA micro-arrays

  14. Finding motifs: Start The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) . . .

  15. Finding motifs: Goal The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) . . .

  16. How does this relate to what we have discussed before? • Motif finding a clear instance of a data mining problem • Motif finding is equivalent to local alignment across multiple sequences • Typically hundreds of sequences are aligned, sometimes thousands • There are also corresponding biological problems for global alignment of multiple sequences

  17. Multiple sequence alignment • Protein families • Sets of proteins with similar structure (3D shape), function, or evolutionary history • Usually the above properties are correlated • Given several families, where to assign a new protein? • DNA repeating sequences • ALU sequence in humans (300bp, appears more than 1 million times – 10% of our DNA) • Estimated 60% of the “junk” in human genome consists of such sequences

  18. Optimal alignment • We define the multiple global alignment as an extension of strings S1, S2, ..., Sk to S′1, S′2, ..., S′k that may contain spaces with • |S′1| = |S′2| = ... = |S′k| • Removing all spaces from each S′i leaves Si • No position has a space in all S′i • We need to extend our similarity function to handle multiple strings • The optimal alignment is the one that maximizes the similarity function

  19. Multiple string similarity • Many ways to do so. Most common: Sum of pairwise similarities • Assumes symmetric similarity • We need to account for σ(-,-) (usually 0) • Alternatively, we can use distances between strings and minimize the sum of the pairwise distances

  20. Dynamic programming for multiple sequence alignment • In pairwise alignment, we used a two-dimensional matrix to record three choices at each cell: {01}, {10}, and {11} where 1 means consume a character from the corresponding string

  21. DP for multiple alignment • For k stringswe need a k-dimensional table • Each dimension has as many elements as the length of the corresponding string plus one (for gaps at the start) • Assuming the same length n, the matrix has (n+1)kcells • At each cell, we consider 2k – 1 choices

  22. Multiple alignment complexity • (n+1)k = O(nk) entries need to be filled, each in O(2k) time • Total time O(nk2k) = O((2n)k) • Total space O(nk) • Typically n is a few thousand, k a few hundred making this approach impractical • Independently of whether DP is used, for the sum of pairwise similarities the problem is provably NP-complete

  23. What to do for NP-complete problems? • Use exact methods (such as DP) for small inputs only • Use approximate methods with polynomial time and a provable error bound • Use heuristic approaches that follow plausible choices but have no guaranteed error bound • specific to the problem (such as FASTA) • general (optimization, estimation via statistical sampling such as MCMC)

  24. Center star algorithm for multiple sequence global alignment • T is the set of strings that we want to align • Pick ST that minimizes • The initial alignment starts with S (≡S1) • Suppose we have already aligned S1, S2, ..., Si as S′1, S′2, ..., S′i. Then we add the remaining strings one at a time by aligning Si+1 with S′1, obtaining S′i+1 and S′′1. We replace S′1 with S′′1 and add spaces to S′2, ..., S′i wherever spaces were added to S′1.

More Related