Bioinformatics Unit 1: Data Bases and Alignments

BioinformaticsUnit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments

Overview • Introduction/review • Reading alignment outputs • Scoring (substitution) matrices • More on alignment algorithms and dynamic programming • Useful alignment algorithms • Examples

Introduction Sequence alignment is a useful tool with many, diverse applications. • Examples of sequence alignments: • Compare a new sequence against an established sequence from a database • In sequencing a new gene one usually sequences both strands and then aligns (reversing one of them, of course!). This ensures accuracy.

Examples of Sequence Alignments (cont.) • Compare the sequence homology to look for evolutionary relatedness. • To identify the sites of mutations • To find regions of overlapping sequence (cosmids or YACs for example) • To identify conserved functional domains in gene products • Others to be sure!

Understanding Alignment Outputs • One sequence is placed above another and the aligned vertical pairs are compared (scored) • Matching pairs are joined with a bar ( | ) to indicate identity. • A colon ( : ) is used to identify similar but nonidentical pairs. • IUB ambiguity codes are used (e.g. N pairs with G, C, T or A). • Nonidentical amino acids with similar physical properties can also be reported as similar.

Example 330 CCTTNATTTCCTTTTTGACA 349 ||||:||| ||||||||||| 991 CCTTAATTCCCTTTTTGACA 972 • Only 20 bases of each sequence aligned (a local alignment) • The numbers at each end of the alignment corresponds to the nucleotide number in the original sequence. • There was a 329 nucleotide non-identical prefix in the top query sequence and a 971 non-identical prefix in the lower query sequence. • There may have been non-identical suffixes too, or the entered sequences may only have been 341 and 991 bases long, respectfully.

Example (cont.) 330 CCTTNATTTCCTTTTTGACA 349 ||||:||| ||||||||||| 991 CCTTAATTCCCTTTTTGACA 972 • The lower sequence has been reversed (complement) • There are two non-identical pairs • Nucleotides number 334 and 987 are paired by a colon (:). The nucleotide at this position on the upper strand is an N indicating that the sequencer was unable to determine the nucleotide identity. • The nucleotide pair between numbers 338 (top) and 983 (bottom) comprises a T and a C. These do not match and no line has been drawn between them. This may be the result of a point mutation, or a mistake in determining or entering the sequence.

Scoring Alignments • Positive values are given for each identical match • Smaller positive values are given for “conservative substitutions” • Negative values are given for non-identical, non-conservative pairs • Gaps are penalized • Total score is the sum of the individual pair wise scores • Longer alignments give higher scores than shorter ones

Gaps and Scoring • Gaps may be caused by insertion in one sequence or deletion in the other (“indel” events). We don’t know which. • Gaps in an alignment are indicated by a ‘-’ in one or both of the sequences • Gaps are penalized in scoring an alignment in two ways • Origination penalty - the scoring penalty for creating a gap of any length (larger) • Length penalty - based on the length of the gap (smaller)

A Simple Example of Gap Scoring If scoring matrix says: Match = +1 Mismatch = 0 Gap origination penalty = -2 Gap length penalty = -1 (for each base) Calculate the scores for each alignment. Which alignment is best and why?

A Simple Example of Gap Scoring Score = -3 Score = -1 Score = 1 If scoring matrix says: Match = +1 Mismatch = 0 Gap origination penalty = -2 Gap length penalty = -1 (for each base) The third alignment is best. From an evolutionary standpoint only one genetic event (indel spanning 2 bases).

Scoring Matrices: How values are assigned for each pair in an alignment • DNA scoring matrices are fairly simple

Scoring Matrices: How values are assigned for each pair in an alignment • Protein matrices are far more complex • There are 20 “letters” v. only 4 in DNA • Far greater opportunity for conservative substitutions • Some are based on “observed” substitutions • Others are based on chemical/physical properties of the amino acids • Others are based on the genetic code (how easily could a codon specifying one amino acid be changed to a codon specifying a different amino acid?)

Two Common Protein Scoring Matrices • The Point Accepted Mutation (PAM) matrix • Based on observed substitution rates • Different variations are used based on assumptions of the length of time since the sequences diverged • PAM-1 may be best for comparing two closely related sequences • Pam-1000 may be best for comparing sequences with distant relationships • PAM-250 is a suitable compromise

A PAM250 Scoring Matrix

Two Common Protein Scoring Matrices(cont.) • BLOSUM matrices are also commonly used • Constructed by analyzing substitution rates for sequences that cluster by phylogenetic analysis • Also appended with numbers (but different meaning) • BLOSUM-62 is best for comparing sequences with approximately 62% similarity • BLOSUM-80 is best for comparing sequences with approximately 80% similarity

Alignment Algorithms and Dynamic Programming • Computer trickery! • The straightforward approach is too intense • For 2 sequences of 95 and 100 nucleotides there are ~ 55 million possible alignments! • (imagine a database search in this context!) • Dynamic programming breaks the problem into a series of small steps and adds the results of these small steps to answer the problem

Dynamic Programming (cont.) When you run an alignment a dynamic programming matrix is formed with the two sequences on the sides. Scores for each pair are placed in the matrix. If the sequences match, you would start in the lower right corner and proceed diagonally to the upper left corner. AC--TCG ACAGTAG Alignment score = 2 Vertical arrows indicate internal gaps

Graphical Output: Dot plots and Path Graphs

Dot Plots Have been popular Reveal complex relationships involving multiple regions Difficult to interpret as they (may) show many alignments Hard to see gaps and visualize “best” alignment Path Diagrams More simple to interpret Show only one alignment (Some can show more) Gaps appear as horizontal or vertical segments of the path line Comparison

Example 1 X Y 3’ Y 5’ 3’ 5’ X

Example 2 X Y 3’ Y 5’ 3’ 5’ X

Example 3 X Y 3’ Y 5’ 3’ 5’ X

Some Useful Alignment Programs • BLAST 2 Sequences (NCBI) • CLUSTALW (Biology Workbench) • MAP (Multiple Alignment Program) at Baylor, TX • Many others

A Nice BLAST 2 Sequences Example at: http://www.ncbi.nlm.nih.gov/blast/

Bioinformatics Unit 1: Data Bases and Alignments