1 / 18

Whole genome alignments

Whole genome alignments. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas. Review. What a score matrix is and how to calculate and use one. Why an affine gap penalty is desirable. How to align sequences using dynamic programming.

Download Presentation

Whole genome alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

  2. Review • What a score matrix is and how to calculate and use one. • Why an affine gap penalty is desirable. • How to align sequences using dynamic programming. • How to calculate and interpret p-values and E-values for pair alignments and database searches.

  3. Whole genome alignments Why?

  4. UCSC Browser track individual genome alignments, darker = higher scoring averaged conservation for 17 genomes alignment discontinuity (e.g. translocation break point) known gap in assembly questionable alignment segment sequence present but unalignable

  5. GQSQVGQGPPCPHHRCTTCCPDGCHFEPQVCMCDWESCCEEG GQSEVRQGPQCPYHKCIKCQPDGCHYEPTVCICREKPCDEKG

  6. How are genome-wide alignments made? • mouse and human genomes are each about 3x109 nucleotides. • how many calculations would a dynamic programming alignment have to make? • at a minimum - 3 integer additions and 3 inequality tests for each DP matrix position (by the way, there are other problems too, including assuming colinearity)

  7. Making large searches faster • Most common method is the BLAST search (Basic Local Alignment Search Tool). Only the initial step is substantially different from dynamic alignment. • Search sequence is broken into small words (usually 3 residues long for proteins). 20 * 20 * 20 = 8,000 words. These act as seeds for searches. • The target dataset is pre-indexed to indicate the positions in the database sequences that match each search word above some score threshold (using a global score matrix such as BLOSUM62).

  8. For example, the search sequence word “WVH” might score above threshold with these indexed sequences: Indexed word Score WVH 23 WIH 22 WVY 17 WIY 16 • Target sequences around each indexed word hit are retrieved and the initial match is extended in both directions: your sequence ...VFEWVHLLP... WIY database (many sites) BLAST searches (cont.)

  9. Schematic of indexed matches Result – instead of aligning these 3 amino acids to everything, they are aligned only with the tiny fraction of sequence regions that are good candidates for a valid alignment. (note- blast actually looks for two such matches close to each other)

  10. Extension and scoring Match Score: 16 -3 -2 -1 Total Score: 16 13 11 10 ...QSVFEWVHLLPGA... ..WIY.. ...QSVFEWVHLLPGA... ..WIYQ.. ...QSVFEWVHLLPGA... ..WIYQK.. ...QSVFEWVHLLPGA... ..WIYQKA.. [mention gap variant]

  11. Extension termination • Extension is continued until the cumulative score drops below some threshold (usually 0). • This permits the match to cross a region of marginal similarity or frank mismatching (e.g. a small intron in tblastn) if it flanks a region of high similarity. • Extensions whose maximal cumulative score is above some threshold are kept for reporting to user. • For web interfaces, various formatting, links, and overviews are added and reported according to user settings (it is also fairly easy to download and run your own blast).

  12. Key to speed: word matching and prior indexing • Though gapped blast local alignment is slow (like dynamic programming), only a very small part of total search space is analyzed. • Because the positions of all database word matches are indexed and stored prior to the blast search, the relevant parts of search space are reached quickly. • Tradeoff is in accuracy and certainty – occasionally matches will be missed (when they are distant enough and dispersed enough that no local word pairs match well enough).

  13. Dynamic programming after BLAST matching genome A BLAST matches DP alignment region genome B M x N manageable

  14. time Defining what a “tree” means unrooted tree (used when the root isn’t known): rooted tree (all real trees are rooted): sequences (leaves or tips) branch points root ancestral sequence branches time vaguely radiates out from somewhere near the center …divergence time is the sum of (horizontal) branch lengths

  15. A tree has topology and distances Are these different trees?

  16. The number of tree topologies grows extremely fast 3 leaves 3 branches 1 internal node 1 topology (3 insertions) 4 leaves 5 branches 2 internal nodes 3 topologies (x3) (5 insertions) In general, an unrooted tree with N leaves has: 2N – 3 branches N – 2 internal nodes ~ O(N!) topologies 5 leaves 7 branches 3 internal nodes 15 topologies (x5) (7 insertions)

  17. There are many rooted trees for each unrooted tree For each unrooted tree, there are 2N - 3 times as many rooted trees, where N is the number of leaves (# internal branches = 2N – 3). 20 leaves - 564,480,989,588,730,591,336,960,000,000 topologies

More Related