1 / 28

Multiple sequence Alignment

In this lecture, Dr. Emad Nabil explains the concepts and importance of multiple sequence alignment (MSA) in computational biology and bioinformatics. He discusses scoring functions, algorithms for MSA, and the tasks involved in creating an alignment.

jadaw
Download Presentation

Multiple sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Biology and Bioinformatics Multiple sequence Alignment Lecture #6 – By: Dr. Emad Nabil Fall 2018 FCI-CU

  2. At the end of this lecture you will be able to develop a program with this input and produce the output below input >Rosalind_18 GACATGTTTGTTTGCCTTAAACTCGTGGCGGCCTAGCCGTAAGTTAAG >Rosalind_23 ACTCATGTTTGTTTGCCTTAAACTCTTGGCGGCTTAGCCGTAACTTAAG >Rosalind_51 TCCTATGTTTGTTTGCCTCAAACTCTTGGCGGCCTAGCCGTAAGGTAAG >Rosalind_7 CACGTCTGTTCGCCTAAAACTTTGATTGCCGGCCTACGCTAGTTAGTTA >Rosalind_28 GGGGTCATGGCTGTTTGCCTTAAACCCTTGGCGGCCTAGCCGTAATGTTT output phylogenetic tree http://www.ebi.ac.uk/goldman-srv/webprank/ More MSA Tools : http://www.ebi.ac.uk/Tools/msa/

  3. Agenda • What is MSA? • what is its importance? • Scoring function: • Entropy based • Sum of pairs • How to align many sequences ? Algorithms • Progressive alignment • Star • Dependent upon a center • Keep adding all pairs of aligned sequences with the current alignment • Tree • Create an approximate guide tree • Use tree to align the sequences • Iterative alignment • Don’t commit to the fixed ordering, revisit the alignment until score does not change

  4. What is Multiple sequence alignment  • A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally  • protein • DNA • RNA

  5. Why MSA is important? • Build phylogenetic trees, Determine evolutionary relationships between sequences • A multiple sequence alignment can represent a family of proteins with similar function, Compare new sequence to a “family” of known proteins • Discover common signatures or protein domains among a group of proteins • Identify genetic variation among individuals of a population.

  6. Why MSA is important? • A low (and statistically insignificant) similarity between two sequences becomes significant if it is present in many other sequences. • Multiple alignments can reveal subtle (precise) similarities that pairwise alignments do not reveal. What is the most similar set of DNAs from the above group.

  7. Alignment of Three A-domains A success story of MSA Identification of Non-ribosomal code YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA -AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

  8. Agenda • What is MSA? • what is its importance? • Scoring function: • Entropy based • Sum of pairs • How to align many sequences ? Algorithms • Progressive alignment • Star • Dependent upon a center • Keep adding all pairs of aligned sequences with the current alignment • Tree • Create an approximate guide tree • Use tree to align the sequences • Iterative alignment • Don’t commit to the fixed ordering, revisit the alignment until score does not change

  9. The tasks in Multiple Sequence Alignment Algorithms for creating an alignment Scoring an alignment

  10. How to align more than two sequences?

  11. Generalizing Pairwise to Multiple Alignment • Alignment of 2 sequences is a 2-row matrix. • Alignment of 3 sequences is a 3-row matrix AT - G C G - A - C G T - A ATC A C - A • Our scoring function should score alignments with conserved columns higher.

  12. Analogy • Think of the k=2 case Every alignment is a path through a 2D matrix • The three possible directions (down, right, down-right) conform/fit to the three possible permutations in a column (XX, X_, _X) • With growing paths, we align growing prefixes of both sequences

  13. Multiple Alignment: Dynamic Programming • Assume k=3 , Think of a 3-dimensional cube with the three sequences giving the values in each dimension • Now, we have paths aligning growing prefixes of three sequences • Every column has seven possible alternatives (XXX, XX_, X_X, _XX, X_ _, _ X_, _ _X) 2D 3D matrix matrix Dynamic Programming in 2D ,(x, y) is an entry in the 2-D scoring matrix. Dynamic Programming in 3D, (x, y, z) is an entry in the 3-D scoring matrix. Alignment path in 3D Alignment path in 2D

  14. Multiple Alignment: Dynamic Programming (x, y, z) is an entry in the 3-D scoring matrix. (x, y) is an entry in the 2-D scoring matrix.

  15. Multiple Alignment: Running Time For 3 sequences of length n: – There are 3 variables so you need cube for each cell, so you need n3 cubes matrix for the full space – For each cell (bottom-right-front corner), we need to look at 7 corners – Together: O(7*n3) computations =(7=23-1)* n3 • For k sequences of length n – There are nkcell corners in the cube – For each corner, we need to look at 2k-1 other corners – Together: O(2k* nk) computations The problem is NP-complete

  16. Find a Highest-Scoring Multiple Sequence Alignment the score of an alignment column is 1 if all three symbols are identical and 0 otherwise. Note : The backtracking matrix is 3D and each cell has values from 0 to 6 orfrom 1 to 7 http://rosalind.info/problems/ba5m/

  17. Scoring a Multiple Sequence Alignment (MSA) Entropy Sum of pairs

  18. Some notations Row • Let m denote a Multiple Sequence Alignment • mi is the ith column of the alignment m • mij is the ith column and jth row • ciacount of residue a in column i column G A R F I E L D T H E F A T C A T G A R F I E L D T H E - - - C A T G A R F I E L D T H A T - - C A T G A R R Y - L I K E D A - - C A T

  19. Scoring a Multiple Sequence Alignment (MSA) • Key issue: how do we score a multiple sequence alignment? • Usually, we assume that columns of an alignment are independent • For now, we will simplify the score by assuming a linear gap penalty • Linear gap penalty can be incorporated into the substitution matrix • S(a,-)=-s=S(-,a) • S(-,-)=0

  20. Sum of pairs

  21. Scoring of a column: Sum of Pairs • Compute the sum of the pairwise scores Example Iterate over all pairs of rows in the column Substitution score from a substitution matrix such as BLOSUM or PAM Scoring of a column= S(A,C)+S(A,G)+A(A,T)+ S(C,G)+S(C,T)+ (G,T) • combinations = = =6

  22. Entropy is a measure of the uncertainty of a probability distribution (p1, …, pN): Entropy for a multiple alignment is the sum of entropies of its columns: gap will be treaded as a base pair.

  23. = 0.9503 Entropy=

  24. =.45

More Related