Sequence alignment

Sequence alignment School B&I TCD Bioinformatics May 2010

What is an alignment? • CENTRAL concept in bioinformatics • Easy if straight-forward, similar seqs • THISTHESAME or THISTHESAME • | |||| ||| | ||| || • TOSSTHEGAME TRANTHELPME • Hard and CPU-intensive if seqs v. diff. • THISTHESAME vs THATGAMETHE • THISTHESAME--- or THIS----THESAME • || ||| || ||| • THAT---GAMETHE THATGAMETHE < better

Why align? • Trying to establish homology by similarity • Homology – having a common ancestor • whale fin, bat wing, human hand (Cuvier) • human beta globin, dog beta globin • human beta globin, human alpha globin • You can have % similarity, % identity • Can’t have % homology ortholog paralog

Why homology? • homologous structures/molecules have similar function. • related by evolution. • more similar seqs = more recent common ancestor = more likely similar function • human hand not for locomotion • bricolage – evolutionary tinkering

Define terms • Indel • Insertion or deletion • May get a better alignment if you put a gap in one sequence • Implies a mutation in one of the seqs • Not clear if insert in one or delete in the other

Optimal alignment • Best guess at evolutionary relationship • Which residues/bases are homologous • Depends on model of evolution and parameters of alignment • Is a gap more likely than a substitution • Is one substitution more likely than another • Transition (Y-Y or R-R) vs transversion (R-Y) • Similar shape amino acid or different • No “correct” answer.

Global alignment • Needleman & Wunsch • Tries to align two sequences from 5’ to 3’ or C terminus to N terminus • Assumes (only works well if) seqs are similar over their entire length • So less good if there are large indels (but can identify such features) • Assesses overall (functional) similarity) LARGGHYFGKISTGREFDN L FGKI T E LNAHILSFGKISTSLEDA • Identify (and count) every difference/mutation

Local alignment BLAST:Basic local alignment search tool • Smith & Waterman • Ignores whole and focuses on region or domain • Use to make high quality alignments • …that has good similarity ----------FGKI---------- |||| ----------FGKI----------

Algorithm • Both local and global alignment programs use “dynamic programming” (wikipedia that) • … to make optimal alignment • the alignment that tells evolutionary story • True story unknown without time-travel • the alignment that has the highest score • Choose/change parameters to maximise score

2 sequence alignment aligning GARFIELDTHECAT &GARFIELDTHERAT is easy GARFIELDTHECAT ||||||||||| || GARFIELDTHERAT

Scoring systems DNA • In an alignment add 1 if bases identical 0 if they are different • Transition/transversion? • AG purines CUT pyrimidines A T C G A 1 0 0 0 T 0 1 0 0 C 0 0 1 0 G 0 0 0 1 A T C G A 2 0 0 1 T 0 2 1 0 C 0 1 2 0 G 1 0 0 2

Scoring comparison DNA • CTAGCGATGC • CGAACGACAC • 1010111001 1/0 Score = 6/10 • 2021222112 Ts/Tv score = 15/20 • transitions 5x more common that tranversions

Insert gaps Sometimes, you can get a better overall alignment if you insert gaps GARFIELDTHECAT |||||||| ||| GARFIELDA--CAT is better (scores higher) than GARFIELDTHECAT |||||||| GARFIELDACAT

No gap penalty But there must be some sort of a gap-penalty or you can align ANY two sequences: G-R--E------AT | | | || GARFIELDTHECAT

Gap penalty • Could set a –ve score for each indel • Linear gap penalty • But mutation could be point or deletion • latter is a single event • Advise to use affine (open + extend) • Open –10, extend -0.05 • How choose penalty? • Start with program defaults • Use good judgment - trial and error • Investigate statistical distribution of indels

Scoring for similarities: proteins • Gap penalty? • Traded vs positive scores for matches in aligned residues • Could, as with DNA, use • match=1 mismatch=0 • Or …

Scoring system proteins Where put the gap? • When doing a similarity search against a database you are trying to decide which of many sequences is the CLOSEST match to your search sequence. Which of the following alignment pairs is better?: FGDERTHHS FGDD--HRS FGDERTHHS FGD--DHRS FGDERTHHS FGD-D-HRS

3 Garfield relatives GARFIELDTHECAT |||| ||||||| GARFRIEDTHECAT GARFIELDTHECAT ||| ||| ||||| GARWIELESHECAT GARFIELDTHECAT || ||||||| || GAVGIELDTHEMAT

Symmetrical! Substitution matrices conservative subst. Top left part of a BLOSUM 90 matrix A R N D C Q E G H I L A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2 R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3 N -2 -1 7 1 -4 0 -1 -1 0 -4 -4 D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5 C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2 Q -1 1 0 -1 -4 7 2 -3 1 -4 -3 E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4 G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5 H -2 0 0 -2 -5 1 -1 -3 8 -4 -4 I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5

Willie Taylor’s AA Venn Diagram

Substitution matrices • Plenty of choice • Identical = 1.0; similar (K/R, F/Y) = 0.5; rest 0.0 • PAM series, BLOSUM series, others • Based on observations and counting in real seqs • Blosum 90 made from aligned seqs 90% identical • Main diagonal elements positive • Some more positive than others • More highly conserved (C, F etc.) • Off-diagonal elements mostly negative • Some more negative than others (less likely) • Some positive score (K-R, D-E etc.)

Dotplot theory Another way of comparing 2 sequences Task: align ATGATATTCTT and ATTGTTC A T G A T A T T C T T A . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .

Go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to ATT (the first 3 bases in the vertical sequence) A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . . Windowsize = 3 Threshold = 2

Then go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to TTG (the next 3 in the vertical sequence). A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .

Iterate until A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . + . . . . . + . . T . . . + . . . . . + . T . . . . . . . + . . . C . . . . . . . . . . .

A T G A T A T T C T T A T + + + + T + + G + + T + + T + C The human eye is particularly good at picking up structure from the pattern of dots. You might see a hint of a duplicated region in the horizontal sequence that is not so clear from the sequence itself

Jurassic Dotplot Mark Boguski 1st smartass

Dinosaur DNA 2 • New seq published in Jurassic Park II • Search database with “dinosaur” DNA • Top hit (GAT1_CHICK sw:P17678 Erythroid Transcription Factor)scoring matrix: BLOSUM50, gap penalties: -12/-295.6% identity; Global alignment score: 2144 But alignment not perfect – gaps inserted

Dinosaur Boguski Alignment Aligning the “dinosaur” DNA (upper) with the chicken (lower) TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT ::::::::::::::::::::::::::::::::::::::::::::::::::: ::::: TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT :::::::::::::::: ::::::::::::::::::::::::::::::::: :::: ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG ::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG

Catalytic When global fails Two blood clotting genes Factor 12 and Plasminogen Activator have F, E, K and Catalytic domains typical of pathway F12 F2 E F1 E K Catalytic PLAT F1 E K K The alignment doesn’t recognise the second K domain in PLAT but forces an alignment to the other sequence By aligning PLAT’s F1 domain with F12’s F2 domain, you miss a better alignment (in grey) between the two F1 domains The alignment doesn’t recognise the second E domain in F12 but just puts a gap in the other sequence

Alignment protocol • What should real biologists do? • Dotplot against self to identify internal repeats • Dotplot against other sequence • Alter windowsize and stringency • If similarity along whole seq do global alignment • Take default parameters • Then change parameters to check effect • If local/domain similarity only then do local alignment • If in doubt do local alignment • LOOK at the alignment and see if you can improve it: by hand – use good judgment

LALIGN • Internal repeats really confuse global alignment • Local alignment reports only BEST alignment • What about sub-optimal, second best hits? • If you do a dotplot repeats will be clear • Use LALIGN to report not only the best alignment but also any other repeated elements • And show you the aligned sequences there

2 sequence alignment Finally, some sequences are similar even if they have no recent common ancestor. Huntington's disease is caused by repeated CAG tracks in the DNA which results in polyGlutamine (Gln, Q) tracks in the protein. If you do a homology search with QQQQQQQQQQ you get hits to other proteins that have a lot of glutamines but have totally different function.

2 sequence alignment Huntingtin: MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA Search against database hits:>MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%), Positives = 25/65 (38%), Gaps = 2/65 (3%): FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPP F Q + + Q Q+ PP PPP LP PP P P+ P PP FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPP But not because it is involved in microtubule mediated transport! PRPs (proline-rich protein) have same problem

Sequence alignment