Multiple Sequence Alignment

Multiple Sequence Alignment

Definition • Homology: related by descent • Homologous sequence positions  ATTGCGC ATTGCGC ATTGCGC  AT-CCGC ATTGCGC  ATCCGC C

Reasons for aligning sets of sequences • Organise data to reflect sequence homology • Infer phylogenetic trees from homologous sites • Highlight conserved sites/regions • Highlight variable sites/regions • Uncover changes in gene structure • Summarise information

Alignments help to Organise Visualise Analyze Sequence Data

The process of aligning sequences is a game involving playing off gaps and mismatches

Ways of aligning multiple sequences • By hand • Automated • Combination

Definition Optimality criteria: some kind rule or scoring scheme to help you to decide what you consider to be the best alignment

Pairwise vs Multiple Sequences • Pairs of sequences typically aligned using exhaustive algorithms (dynamic programming) • complexity of exhaustive methods is O(2n mn) n = number of sequences • Multiple sequence alignment using heuristic methods

ATTGCGC ATTGCGC   AT-CCGC ATC-CGC The Correct Alignment  ATTGCGC ATTGCGC ATTGCGC  ATCCGC C

The Correct Alignment

Sequence alignment is easy with sufficiently closely related sequences • Below a certain level of identity sequence alignment may become meaningless • twilight zone for aa sequences ~ 30% • In the twilight zone it is good to make use of additional information if possible (e.g. structure)

Consensus Sequences • Simplest Form:A single sequence which represents the most common amino acid/base in that position Y D D G A V - E A L Y D G G - - - E A L F E G G I L V E A L F D - G I L V Q A V Y E G G A V V Q A L Y D G G A/I V/L V E A L

Multiple Alignment Formats e.g. Clustal, Phylip, MSF, MEGA etc. etc.

Clustal Format CLUSTAL X (1.81) multiple sequence alignment CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNEN- CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNEN- CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE-------- CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE-------- CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERK CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSE CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ------------- *:***: **.*.*:* : . :

Phylip Format (Interleaved) 7 100 SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_RAT_P -MAADSQTPW LLTFSLLCLL WPQEAGAFPA MPLSSLFANA VLRAQHLHQL SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA MPLSSLFSNA VLRAQHLHQL SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA MPLSSLFANA VLRAQHLHQL SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA MPLSSLFANA VLRAQHLHQL SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT IPLSRLFDNA MLRAHRLHQL AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDV AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC FSESIPTPSN REETQQKSNL

Phylip Format (Sequential) 3100 Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG

Mega Format #mega TITLE: No title #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #OppossumATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

Progressive Multiple Alignment • Heuristic • Perform pairwise alignments • Align sequences to alignments or alignments to existing alignments (profile alignments • Do the alignments in some sensible order

Iterative methods • Several progressive alignment methods can be iterated • e.g. Barton-Sternberg, ClustalX

ClustalX Algorithm • Perform alignments and calculate distances for all pairs of sequences • Construct guide tree (dendrogram) joining the most similar sequences using Neighbour Joining • Align sequences, starting at the leaves of the guide tree. This involves the pair-wise comparisons as well as comparison of single sequence with a group of seqs (Profile)

ClustalX is not optimal • There are known areas in which ClustalX performs badly e.g. • errors introduced early cannot be corrected by subsequent information • alignments of sequences of differing lengths cause strange guide trees and unpredictable effects • edges: ClustalX does not penalise gaps at edges • There are alternatives to ClustalX available

Using ClustalX • Start with sequences in FASTA format (or an existing alignment in Clustal format • [Do Alignment] on the alignment menu

ClustalX Parameters • Scoring Matrix • Gap opening penalty • Gap extension penalty • Protein gap parameters • Additional algorithm parameters • Secondary structure penalties

Score Matrices • Pairwise matrices and multiple alignment matrix series • PAM (Dayhoff), BLOSUM (Hennikof), GONNET (default), user defined • Transition (A<->G)/Transversion (C<-T) ratio – low for distantly related sequences

Gap Penalties • Linear gap penalties – Affine gap penalties p = (o + l.e) • Gap opening • Gap extension • Protein specific penalties (on by default) • Increase the probability of gaps associated with certain residues • Increase the chances of gaps in loop regions (> 5 hydrophilic residues)

Algorithm parameters • Slow-accurate pair-wise alignment • Do alignment from guide tree • Reset gaps before aligning (iteration) • Delay Divergent sequences (%)

Additional displays • Column Scores • Low quality regions • Exceptional residues

Multiple Alignment Strategies • Align pairs of sequences using an optimal method • Choose representative sequences to align carefully • Choose sequences of comparable lengths • Progressive alignment programs such as ClustalX for multiple alignment • Progressive alignment programs may be combined • Review alignment by eye and edit

Alignment of coding regions • Nucleotide sequences much harder to align accurately than proteins • Protein coding sequences can be aligned using the protein sequences

Multiple Alignments and Phylogenetic Trees • You can make a more accurate multiple sequence alignment if you know the tree already • A good multiple sequence alignment is an important starting point for drawing a tree • The process of constructing a multiple alignment (unlike pair-wise) needs to take account of phylogenetic relationships

Editing a multiple sequence alignment • It is NOT fraud to edit a multiple sequence alignment • Incorporate additional knowledge if possible • Alignment edititors help to keep the data organised and help to prevent unwanted mistakes

Alignment Editors • e.g. GDE, Bioedit, Seaview, Jalview etc. • Alignment editors can function as an organisational tool (analyses tools on BioEdit) • Construct sub-sequences (GDE, Seaview) • Annotate sequences (Seaview)

Aligning weakly similar sequences

Sequence contains conserved regions • e.g. DIALIGN (Morgenstern, Dress, Werner) • re-aligns regions between conserved blocks http://bibiserv.techfak.uni-bielefeld.de/ useful if sequences contains consistent conserved blocks • Block Maker – searches for conserved words that may be inconsistent http://blocks.fhcrc.org/

Profile Alignment Gribskov et al. 1987 • Position specific scores • Allows alignment of alignments • Gaps introduced as whole columns in the separate alignments • Optimal alignment in time O(a2l2) a = alphabet size, l = sequence length • Information about the degree of conservation of sequence positions is included

Good reasons to use profile alignments • Adding a new sequence to an existing multiple alignment that you want to keep the same(align sequence to profile) • Searching a database for new members of your protein family(pfsearch) • Searching a database of profiles to find out which one your sequence belongs to(pfscan) • Combining two multiple sequence alignments(profile to profile)

Profile Alignment Using ClustalX • Profile Alignment Mode • Align sequence to profile • Align profile 1 to profile 2 • Secondary structure parameters

Profile searching using PSI-BLAST • Position Specific Iterative • Perform search – construct profile – perform search • Convergence (hopefully…) • Increased sensitivity for distantly related sequences • Available on-line (NCBI)

Databases of Aligned Sequences • Hovergen http://pbil.univ-lyon1.fr/databases/hovergen.html (vertebrate alignments) • Pfam http://www.sanger.ac.uk/Software/Pfam/ (protein domain alignments and profile HMMs) • BLOCKS http://blocks.fhcrc.org/ • Ribosomal Database Project http://rdp.cme.msu.edu/html/ alignments and trees derived from rRNA sequences • Interpro – combines information from other sources • Many more…

Probabilistic Models of Sequence Alignment • Hidden Markov Models • sequence of states and associated symbol probabilities • Produces a probabilistic model of a sequence alignment • Align a sequence to a Profile Hidden Markov Model • Algorithms exist to find the most efficient pathway through the model

Markov Chain: A chain of things. The probability of the next thing depends only on the current thing Hidden Markov Model: A sequence of states which form a Markov Chain. The states are not observable. The observable characters have “emission” probabilities which depend on the current state.

Multiple Sequence Alignment