1 / 38

MAVID: Constrained Ancestral Alignment of Multiple Sequence

MAVID: Constrained Ancestral Alignment of Multiple Sequence. Author: Nicholas Bray and Lior Pachter. Outline. AVID MAVID Progressive alignment Constraints Tree Building Experimental Results. AVID: A Global Alignment Program. Fast Memory efficient

keona
Download Presentation

MAVID: Constrained Ancestral Alignment of Multiple Sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

  2. Outline • AVID • MAVID • Progressive alignment • Constraints • Tree Building • Experimental Results

  3. AVID: A Global Alignment Program • Fast • Memory efficient • Practical for sequence for alignments of large genomic region • Sensitive in finding homologous regions • Specific and avoids the false-positive problems

  4. Algorithm • Repeat Masking (Optional) • Finding Matches Using Suffix Trees • Anchor Selection • Recursion

  5. Repeat Masking Match finding Recursion Anchor selection Enough anchors? Base pair alignment Split sequences using anchors

  6. Repeat Masking (Optional) • RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html) • Repeat matches • Clean matches Clean matches Repeat matches

  7. Finding Matches Using Suffix Trees

  8. Finding Matches Using Suffix Trees • Maximal repeated substring (Match) • Every subsequence that contains it is not repeated in the string • Maximal matches between two sequence • Pairs of matching subsequences whose flanking bases are mismatches • Transform

  9. Maximal repeated substring Maximal matches between two sequence Transform

  10. Anchor Selection • Eliminate noisy matches (those less than half the length of the longest match) • The left matches are ordered by • Long clean -> short clean -> long repeat -> short repeat

  11. Anchor Selection • A variant of Smith-Waterman algorithm (no overlapping) • Gap score: 0 • Mismatch score: ∞ • Match score: 10 bp

  12. Recursion

  13. Condition • There are still significant matches • The anchor set is >50% of the length of the sequence • Recursion • Otherwise • Needleman-Wunsch algorithm • No significant matches • Short sequence (<4kb) • Needleman-Wunsch algorithm • Long sequence • Trivial alignment (gap)

  14. MAVID • Rapidly aligning multiple large genomic regions • Incorporating biologically meaningful heuristics • Sound alignment strategies

  15. Method • Core: progressive ancestral alignment, which incorporate preprocessed constraint • Terminology • Match • Similar (may not exactly match) region between two sequences • Constraint • The order of positions of alignment

  16. Standard progressive alignment • Compute the distance matrix by aligning all pairs of sequences • Build a phylogenetic tree (guide tree) from the distance matrix • Cluster • Midpoint method • Progressively align the sequence according to the branching order in the guide tree • Aligning two alignments • An alignment is viewed as a sequence

  17. Method

  18. Key difference • Instead of aligning alignments, we first infer ancestral sequences of alignments using maximum-likelihood estimation within a probabilistic evolutionary model • maximum-likelihood estimation • a popular statistical method used to make inferences about parameters of the underlying probability distribution of a given data set

  19. Key difference • The ancestral sequences are then aligned with AVID • The scores of the Smith-Waterman step are assigned according to the branch length of the two alignments • The alignment of the ancestral sequences is then used to glue two alignments. Gaps in the ancestral sequences lead to gaps in the multiple alignment

  20. Alignment A Ancestral A Ancestral B Alignment B AVID

  21. AVID with preprocessed data • Gene predictions using GENSCAN • Protein alignments using BLAT • Finding exon matches without using suffix tree • In addition, the exon matches can be used shape the final multiple alignment

  22. MAVID(Constraints, Tree building, and Experimental results) Speaker: 羅正偉 2005/12/07

  23. Constraints(1/3) • Notation: ai ≤ bj This means that position i in sequence a must appear before position j in sequence b in the multiple sequence alignment.

  24. Constraints(2/3) ai a cy c cx b bj If x ≤ y, then ai ≤ cx≤ cy ≤ bj ,and so ai ≤ bj by transitivity.

  25. Constraints(3/3) • The above information can be used in the alignment of the ancestral sequences by requiring potential anchors between the sequences to satisfy the constraints.

  26. Prime Constraints(1/4) • Consider every triplet of sequences (a, b, c) with a in u, b in v, and c not in x. • Every triplet can provide potential constraints for the alignment. • If there are n sequences, there are O(n3) such triplets. x Too many constraints! u v

  27. Prime Constraints(2/4) • Actually, we don’t need to find all possible constraints, many of which will be redundant. • Instead, we wish to find a set of prime constraints • In this set, no constraint is implied by the others. • Such a set can be inferred from the homology map.

  28. Illustration

  29. Prime Constraints(3/4) • If there are m sets of orthologous exons, then at node x there can be at most O(m) prime constraints. • The sets of all prime constraints can be found in O(mk2), where k is the number of leaves below x.

  30. Prime Constraints(4/4) • Matches between the ancestral sequences that are inconsistent with this set of constraints can be filtered out in time O(N logm), where N is the total number of matches. • For typical values of m and k, the time taken computing and utilizing the constraints is negligible.

  31. Tree Building(1/3) • Most multiple alignment programs require pairwise alignments of all the sequences to build in initial guide tree. (Quadratic number of sequence alignments) • We utilize an iterative method to obtain a guide tree using only linear number of alignments.

  32. Tree Building(2/3) • The initial guide tree is selected randomly from the set of complete binary trees. • The sequences are aligned using this random tree, and then a phylogenetic tree is inferred from the resulting multiple alignment. • The above process is iterated until the alignment and tree are satisfactory.

  33. Tree Building(3/3) • Instead of computing all pairwise alignments, only O(nk) alignments are necessary to perform n iterations with k sequences. • We found that for typical alignment problems, only a small number of iterations were necessary.

  34. Experimental Results 1 • A human, mouse, and rat whole-genome multiple alignment. • A homology map for the genomes was built by C. Dewey, and was used to generate gene anchors and constraints. • Chromosome 20 was chosen because it aligns almost completely with mouse chromosome 2.

  35. Experimental Results 1 (cont.) Coverage of human chromosome 20 RefSeq exons by the MAVID alignments. Of a total of 3927 exons, only six were not in the homology map. A total of 53.5% of the exons were covered by precomputed exon anchors in either mouse or rat. The remaining exons are mostly aligned by MAVID, resulting in 93.6% of the exons covered by alignment in either mouse or rat.

  36. Experimental Results 2 • Alignment of 21 Organisms • We aligned 1.8 Mb of human sequence together with the homologous regions from 20 other organisms of a total 23 Mb of sequence. • Baboon, cat, chicken, chimp, cow, dog, dunnart, fugu, hedgehog, horse, lemur, macaque, mouse, opossum, pig, platypus, rabbit, rat, tetraodon, and zebra-fish.

  37. Experimental Results 2(cont.) • The MAVID alignments were compared with MLAGAN, version 1.1(Brudno et al. 2003). • MLAGAN is the only other program we know of that is able to align the 21 sequences in a reasonable period of time.

  38. Experimental Results 2(cont.) • MAVID and MLAGAN both aligned sequences correctly. • MAVID took 40 min, while MLAGAN took roughly 6h.

More Related