Pairwise sequence alignments

Pairwise sequence alignments contains essentially the same algorithms as water (local) and needle (global) from Bio import pairwise2 from Bio import SeqIO seq1 = SeqIO.read("alpha.faa", "fasta”) seq2 = SeqIO.read("beta.faa", "fasta") alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq) print(pairwise2.format_alignment(*alignments[0])) global alignment 2 letter code (scoring matches, gaps) x … match scores 1, x … no cost for gaps

Pairwise sequence alignments contains essentially the same algorithms as water (local) and needle (global) from Bio import pairwise2 from Bio import SeqIO seq1 = SeqIO.read("alpha.faa", "fasta”) seq2 = SeqIO.read("beta.faa", "fasta") alignments = pairwise2.align.globalmx(seq1.seq, seq2.seq, 2, -1) print(pairwise2.format_alignment(*alignments[0])) match, mismatch score

Pairwise sequence alignments contains essentially the same algorithms as water (local) and needle (global) from Bio import pairwise2 from Bio import SeqIO seq1 = SeqIO.read("alpha.faa", "fasta”) seq2 = SeqIO.read("beta.faa", "fasta") alignments = pairwise2.align.globalms(seq1.seq, seq2.seq, 2, -1, -1, -0.5) print(pairwise2.format_alignment(*alignments[0])) match, mismatch score gap opening, extending

Pairwise sequence alignments from Bio import pairwise2 from Bio import SeqIO rom Bio.SubsMat.MatrixInfo import blosum62 seq1 = SeqIO.read("alpha.faa", "fasta”) seq2 = SeqIO.read("beta.faa", "fasta") alignments = pairwise2.align.localds(seq1.seq, seq2.seq, blosum62, -10, -0.5) print(pairwise2.format_alignment(*alignments[0])) costs for opening, extending gaps substitution matrix, Blosum, PAM local alignment

Pairwise sequence alignments parameters <alignment type> X X gap match global, local Match parameters x … No parameters. Identical characters have score of 1, otherwise 0. m… A match score is the score of identical chars, otherwise mismatch d … A dictionary returns the score of any pair of characters. c… A callback function returns scores. Gap parameters x … no gap penalties. s … same open and extend gap penalties for both sequences. d … the sequences have different open and extend gap penalties. c … A callback function returns the gap penalties.

Problem with N-W and S-W They are exhaustive, with stringent statistical thresholds As the databases got larger, these began to take too long – computational constraints Need a program that runs an algorithm that is based on dynamic programming (substrings) but uses a heuristic approach (takes assumptions, quicker, not as refined). Therefore can deal with comparing a sequence against 100 million others in a relatively short time… the bigger the database, the more optimal the results.

Two heuristic approaches Each is based on different assumptions and calculations of probability thresholds (ie the statistical significance of results) • FASTA • All proteins in the db are equally likely to be related to the query •  probability multiplied by the number of sequences in database • 2. BLAST • Query more likely to be related to a sequence its own length or shorter •  probability multiplied byN/n • N: total number of residues in database • n: length of subject sequence

The BLAST algorithm Basic Local Alignment Search Tool Breaks down your sequence into smaller segments (words) Does the same for all sequences in the database Looks for exact matches, word by word, and expands those up- and down- stream one base at a time, allowing for a certain number of mismatches Stops when sequence ends or statistical significance becomes too low (too many mismatches) Can find more than one area of similarity between two sequences

The BLAST algorithm query sequence STEP 1: k-mers in query sequence … PQGEFGFAT… PQG QGE GEF EFG words … STEP 2: Score the words: Match all 3-mers obtained in step 1 with all 203 matches. With a threshold T, keep the best scoring words.

The BLAST algorithm STEP 3: organize the search words in a tree for efficient retrieval STEP 4: repeat step 2-3 for each 3-mer STEP 5: scan the database sequences to find exact matches of words, extend the exact matches to high-scoring segment pair (HSP). query sequence … PQGEFGFAT… database sequence … AQGEFEAQT… ... -2 67 2 6 1 -2 -3 7 … HSP, score: 22

The BLAST algorithm STEP 6: evaluate the significance of the HSP score Karlin-Altschul statistics: Using the HSPs, we calculate the expected number of HSPs with score at least S: K, l… constant m, n length of query and database sequence

The BLAST algorithm Bit scores: Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and l. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years.Therefore we normalize to a Bitscore so that which is independent of K and l.

The BLAST algorithm P-values: The number of random HSPs with score ≥Sis described by a Poisson distribution(i.e. the probability of finding exactly a HSPs with score ≥Sis given by The chance of finding zero HSP with score ≥S is e-E. So the chance to find at least one HSP with score ≥S is

The BLAST algorithm So far our considerations were for pairwise alignments. What are we doing if we search a database? Correction for multiple testing, because a priori all proteins in the database are a priori equally likely to be related to the query. Therefore in the simplest case, E-value is corrected by the number of sequences In the database: E’= E*N An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains. If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues.

BLAST Using online BLAST database type from Bio.Blast import NCBIWWW result_handle= NCBIWWW.qblast("blastn", "nt", "8332116") sequence Blast program or from Bio.Blast import NCBIWWW from Bio import SeqIO record = SeqIO.read("m_cold.fasta", format="fasta") result_handle= NCBIWWW.qblast("blastn", "nt", record.seq) result_handle = NCBIWWW.qblast("blastn", "nt", record.format(“fasta”) Just sequence whole seq. information

BLAST Using local BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download - First we create a command line prompt (for example): blastp–query alphaProtein.faa–db HS –out out.txt–evalue 0.01 - then we use the BLAST Python wrappers from BioPython: evalue threshold Blast program query sequence output file database from Bio.Blast.Applications import NcbiblastpCommandline blastp_cline = NcbiblastpCommandline(cmd ="~/ncbi-blast-2.7.1+/bin/blastp", query="alphaProtein.faa", db="HS", evalue=0.01, out="out.txt") blastp_cline() use Bio.Blast.NCBIStandalone.BlastParser() to parse results in the output file

BLAST Parsing BLAST output from Bio.Blast import NCBIStandalone result_handle = open(”out.txt") blast_parser = NCBIStandalone.BlastParser() blast_record = blast_parser.parse(result_handle) for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < 0.001: print (alignment.title) print (alignment.length) print (hsp.expect) print (hsp.query) print (hsp.match) print (hsp.sbjct) list of alignments holds the information about Blast output E-value query sequence alignment matched db seq.

Algorithm evolution Smith Waterman Localalignment algorithm – finds small, locally similar regions (substrings), matrix-based, each cell in the matrix defined the end of a potential alignment. BLAST – start with highest scoring short pairs and extend and down the sequence. Great, but when you’re talking about millions of reads…

Pairwise sequence alignments