Sequence Alignment by Dynamic Programming

Bios 560R, Lecture 2 Genome sequencing (very brief) Overview of HGP Shotgun sequencing Next generation sequencing Sequence Alignment by Dynamic Programming Overview The difficulty The scoring system Dynamic programming for pairwise alignment

Human genome project (HGP) – some milestones Science 9 February 2001: Vol. 291. no. 5507, p. 1195 • 1977. Allan Maxam and Walter Gilbert at Harvard University and Frederick Sanger at the U.K. Medical Research Council (MRC) independently develop methods for sequencing DNA. • 1988. NIH establishes the Office of Human Genome Research. • 1995. Patrick Brown of Stanford and colleagues publish first paper using a printed glass microarray of complementary DNA (cDNA) probes • Researchers at Whitehead and Généthon (led by Lander and Thomas Hudson at Whitehead) publish a physical map of the human genome containing 15,000 markers. • 1996. NIH funds six groups to attempt large-scale sequencing of the human genome. • Affymetrix makes DNA chips commercially available. • An international consortium publicly releases the complete genome sequence of the yeast • 1998. a new company named Celera. • NIH announces a new project to find SNPs • The HGP consortium publishes its working draft in Nature (15 February), Celera publishes its draft in Science (16 February). • 2006. Sequence of all chromosomes finalized.

Genome sequencing strategies Shotgun sequencing. Shotgun: fires a number of small spherical pellets. The genome DNA (~ 3 billion base pairs, in 22 autosomes and 2 sex chromosomes) is too long for sequencing directly. The DNA is randomly broken is to small pieces. The pieces are sequenced. The short sequences are assembled into long sequences based on the overlapping of fragments. http://www.shotgun-insight.com/expressFibreVsPlastic/proFibre5_25yds80percent.jpg Genome Reads Contigs (islands)

Genome sequencing strategies The Lander-Waterman model. Genomics. 1988 Apr;2(3):231-9. Important result about coverage (direct quotes from the paper):

Genome sequencing strategies Proc NatlAcadSci U S A. 2002 March 19; 99(6): 3712–3716. HGP Celera “additional preliminary work” “a greater risk of long-range misassembly”

Whole genome using next generation sequencing ? Next generation sequencing generates much shorter reads at higher throughput. nature biotechnology 26(10):1135. Sanger method Next Generation

Whole genome using next generation sequencing ? Application of next-generation sequencing (NGS) in various ways: Nature Reviews Drug Discovery 12, 358–369 (2013)

Between protein and gene sequences • The gene sequence (DNA/RNA)are made of 4 characters (nucleotides) • The protein sequence is made of 20 characters (amino acids) • The DNA codons: • When discussing alignment, we consider both DNA sequence and protein sequence.

Sequence alignment • Alignment answers the question of whether two or more sequences are (evolutionarily) related. • Example: A new gene is found in human. We wish to study its properties. To get a hint, we try to find its corresponding part in mouse. Among the tens of thousands of genes in mouse, which is the one that’s most related to this human gene? (BLAST.) • Sequence segments more important to biological functions are more conserved. Align proteins sharing a function to find peptide sequences more important to the function. Align DNA sequences upstream to (functionally or evolutionarily) related genes to find segments that bind transcription factors.

Sequence alignment Kapitonov VV, Jurka J PLoS Biology Vol. 3, No. 6, e181

Sequence alignment http://pevsnerlab.kennedykrieger.org/wiley/bestfit2.gif

Sequence alignment • Alignment serves as the basis for determining evolutionary relationships. • Example: http://www.computational-genomics.net/case_studies/sabertooth_demo_37.png

Sequence alignment • In the design and/or data interpretation of sequence-based high-throughput technologies, sequence comparison is necessary: • To find microarray probes that do not have similar sequence with other genes. • To match sequences in high-throughput sequencing data to genome. • To find motifs based on ChIP-chip/ChIP-seqdata. • ……

Difficulty in alignment The huge search space (the number of possible alignments) ! Sequence 1: GSAQVK Sequence 2: GNPKVK GSAQVK----- GSAQVK---- -----GNPKVK ----GNPKVK GSAQ--VK -----GSAQVK -G-NPKVK GNPKVK----- ………………… How to pick the best??

The scoring system • A quasi-statistical model log-likelihood ratio: x: sequence segment 1, y: sequence segment 2 i: aligned position Pxiyi: Prob(xi and yi are in aligned position) Pxi: Prob(xi occurs) Pyi: prob(yi occurs) M: this aligned model R: x, y are independent sequences

The scoring system Note: this kind of “model” is for illustration purposes only! There are many possible models: M1, M2, ……, Mn According to the Bayes Rule, This is MLE also since we are using an uninformative prior.

The scoring system • Given the above, the best alignment model is the one that maximizes Notice that there are a limited number of possible xi, yi pairs. For amino acids, there are 20 choose 2 pairs, plus any amino acid against Gap (all equal values : the “gap penalty”). How do we get these values ????

The scoring system • These values are obtained from some high-confidence alignments: GAPVKFC GAPKKFC The log likelihood ratio of the pair occurring as an aligned pair as opposed to an unaligned pair. It can be obtained in multiple ways. Generally, such a matrix describes the rate that one character (amino acid or nucleotide) is replaced by another one during evolution.

The scoring system PAM (Point Accepted Mutation) matrix is calculated from closely related sequences. Finding aligned positions is simple as there are only a few positions where the two sequences are different. The number following “PAM” denotes evolutionary distance. PAM1 matrix considers only direct mutations. Taking this matrix to N-th power with proper normalization yields PAM-N matrix.

The scoring system http://www.brc.dcs.gla.ac.uk/~drg/courses/bioinformatics_mscIT/slides/slides3/img033.gif

The scoring system BLOSUM (BLOcks of Amino Acid SUbstitutionMatrix). It comes from blocks of aligned sequences with less similarity than those used by PAM. The log-odds ratio are calculated directly from the observed alignment. There is no extrapolation as PAM.

The scoring system http://en.wikipedia.org/wiki/Image:BLOSUM62.gif

Dynamic programming • Now, for every pair of amino acids, we have a score that can be considered the log likelihood ratio. So what? Do we still need to perform endless computations to compare all models? • The answer is of course NO. • Different models share sub-models. We should not re-calculate the score • For non-optimal models, we do not need to know their exact score. GSAQVK G-SA-QVK GSA-QVK GNPKVK GN-PK-VK G-NPKVK

Dynamic programming “Two sledgehammers of the algorithms craft, dynamic programming and linear programming” Dynamic programming: Breaking the overall optimization problem into overlapping smaller problems; Solve each sub-problem once, and reuse the results, thus reducing the computing cost (dramatically); Often working backward.

Dynamic programming A simple example: Find the shortest path from S to E in the directed acyclic graph below. Take node D as an example. The way to get to D is through B or C. So, Algorithms, Dasgupta et al.

Dynamic programming Linearization: Algorithm: Algorithms, Dasgupta et al.

Dynamic programming Compare: Exhaustive approach SABE: 1+6+2 SCABE: 2+4+6+2 SCDE: 2+3+1 SCABDE: 2+4+6+1+1 (2) DP approach Dist(A)=min(1, 2+4)=1 Dist(C)=2 Dist(B)=min(dist(A)+6)=1+6=7 Dist(D)=min(dist(B)+1, dist(C)+3)=min(7+1, 2+3)=5 Dist(E)=min(dist(B)+2, dist(D)+1)=min(7+2, 5+1)=6 11 additions. Complexity grows exponentially with the size of graph 6 additions. Complexity grows linearly with the size of graph

Dynamic programming Another example: A game of picking up matches. There are 30 matches on the table. You start by picking up 1~3 matches, then your opponent picks up 1~3 matches. This goes on until the last match is picked up. The person picking up the last is the loser. 1, 5, 9, 13, 17, 21, 25, 29 One step back: this is what you want to leave to your opponent Two steps back: this is what you want to leave to your opponent The first step: this is what you want to leave to your opponent 29x Last step: you want to leave the last match to your opponent Operations Research: mathematical programming, Winston WL

Dynamic programming Two critical components of dynamic programming sequence alignment: • The assumption that the score of an aligned pair of amino acids is independent of other parts of the alignment. • Due to the additive nature of the likelihood, any sub-alignment of the optimal overall alignment must also be optimal. Thus the search for the global optimum becomes a sequential search for local optima.

Dynamic programming Any sub-model of the best model must be optimal. Example. if this is the best model, GS-AQVK G-NPKVK Then any part of it, for example the middle part -AQ NPK Must be better than any other possible sub-model. Otherwise, say if it is not as good as A-Q NPK Then, GSA-QVK G-NPKVK Is better than the “best” model.

Dynamic programming Dynamic programming is used when the overall solution can be achieved by recursively solving sub-problems. It saves computing time by re-using results of submodels. GSAQVK G-SA-QVK GSA-QVK GNPKVK GN-PK-VK G-NPKVK In contrast, the “brute-force” method that simply lists all possible alignments and re-calculate the scores for every model solves the same sub-problem over and over again.

Needleman-Wunsch algorithm for global alignment Create a matrix of dimension (m+1)x(n+1), where m and n are the lengths of the two sequences. Let M(0,0)=0 Iteratively calculate other values in the matrix, while recording which way (align, gap-x, gap-y) led to the maximum for each cell. Mi,j = MAXIMUM[ Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2) ]

Gap penalty: -8

Needleman-Wunsch algorithm http://www.matfys.kvl.dk/databehandling/f2004/eksempler/word/needleman-wunsch.png

Smith-Waterman algorithm The previous algorithm looks for the best OVERALL alignment between two sequences. A more common task is to find best SUB-SEQUENCE alignments between two sequences. Mi,j = MAXIMUM[ 0, Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2) ]

Smith-Waterman algorithm http://www.bi.a.u-tokyo.ac.jp/~shimizu/bioinfo/s-w.gif

Comments In biological sequences, a gap is often longer than 1. For example, the insertion of a new functioning domain. A long gap may be penalized too much by the previous methods. - assign two different penalties: gap initiation penalty. gap extension penalty. Now at every position, 3 values, instead of 1, are kept. Significance of an alignment? - Bayesian approach model comparison. - significance derived from extreme value distribution.

Difficulty in higher-dimensions When we have two sequences to align, we can use dynamic programming with o(n^2) calculations. Can we use the same idea when we have more sequences? It is hard. For example, when we have 3 sequences, instead of operating in a square, we will be operating in a cube. The number of calculations explodes with more sequences. http://www.bioinf2.leeds.ac.uk/b/MA_sect.gif

Sequence Alignment by Dynamic Programming