BLAST, PSI-BLAST and position-specific scoring matrices

BLAST, PSI-BLAST and position-specific scoring matrices Prof. William Stafford Noble Department of Genome SciencesDepartment of Computer Science and Engineering University of Washington thabangh@gmail.com

Outline • Responses from last class • Revision • BLAST • PSI-BLAST • Position specific scoring matrices (PSSMs) • Python

One-minute responses • Please explain the null and alternative hypothesis again. • Liked giving examples on the statistical concepts. • Sometimes the class is boring because you are using only the projector. • For Python, we learn more by practicing than just looking at your code. • Python session was good, but too fast. • More Python examples, please. • The Python is difficult because it is different from what we learned before. • The problem is how to use sys in Python. I hope you give lots of examples for the sys command. • Please be available for consultation over the weekend on the assignment. • Does BLAST use p-values to decide which alignments to consider?

Revision • What is a distribution? • A mathematical function whose values sum to 1. • If you roll a single die many times and make a histogram of the resulting values, what kind of distribution will you observe? • Uniform • If you compare a protein sequence to many, randomly shuffled protein sequences and make a histogram of the resulting scores, what kind of distribution will you observed? • Extreme value distribution • What is the definition of “null hypothesis”? • A statistical model of the situation that we are not interested in. • What is the opposite of the null hypothesis? • The alternative hypothesis. • What is the name of the estimated probability of observing the data, assuming that it was generated according to the null hypothesis? • p-value • How do you decide what p-value threshold to use? • Consider the costs associated with making a mistake.

Significance of scores HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT Sequence alignment algorithm 45 Low score = unrelated High score = homologs How high is high enough? LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE

Database searching Sequence database Query Targets ranked by score Sequence comparison algorithm

Target sequence of length m Dynamic programming matrix Query sequence of length n How long does DP take? There are nm entries in the matrix. Each entry requires a constant number c of operations. The total number of required operations is approximate nmc. We say that the algorithm is “order nm” or “O(nm).”

How long does DP take? • Say that your query is 200 amino acids long. • You are searching a database that contains a million proteins. • If their average length is 200, then you have to fill in 200  200  1,000,000 = 4  1010 DP entries. • If it takes only 10 operations to fill in each cell, then you still have to do 4 1011 operations.

BLAST • DP is O(nm); BLAST is O(m). • Fundamental innovation: employ a data structure to index the query sequence. • The data structure allows you to look up entries in a table in O(1) time. Naive method: scan the sequence O(n) Does my length-n sequence contain the subsequence “GTR”? Improved method: hash table or search tree lookup O(1)

BLAST Query Query sequence List of words in query and similar words Target sequence

BLAST Query Query sequence List of words in query and similar words Target sequence “Does this target word appear in the query word list?”

BLAST Query Query sequence List of words in query and similar words x “Yes, at position 34 in the query sequence.” Target sequence

BLAST Query x x x x Query sequence x List of words in query and similar words x x x x Target sequence

BLAST Query x x x These two hits are on the diagonal and close to each other, so let’s try to connect them. x Query sequence x List of words in query and similar words x x x x Target sequence

BLAST Query x x x x Query sequence x List of words in query and similar words x x x x Target sequence

BLAST Assign a score to each hit Query x 0.005 Query sequence x List of words in query and similar words 0.27 x x Target sequence

BLAST • “The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.” • The initial word threshold T is the most important parameter. • Low T = high sensitivity, long compute. • High T = low sensitivity, quick compute.

When does BLAST fail? ERDCRVSSFRVKENFDKARFAGTWYAMAKKDPEGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDT E R F E K A Y K E L I F E M A V N V M F ECEIRQFLFIQRESARKEACATGTYREKKMDPELIVLVIWICPQFEQLEMRAMWIHAKJEVIUENAQCVIYTMQEPFCII • BLAST works by joining together short regions of high similarity. • Therefore, BLAST will fail to detect long regions of low similarity.

Summary of BLAST • Dynamic programming is O(nm), where n is the length of the query and m is the size of the database. • BLAST is O(m). • BLAST produces an index of the query sequence that allows fast matching to the database. • Relative to Smith-Waterman, BLAST can produce false negatives; i.e., homologs that BLAST fails to detect.

BLAST Query Homologs Sequence database BLAST

Position-specific iterated BLAST Position-specific scoring matrix (PSSM) Query Statistical model of protein family Homologs Sequence database BLAST

Position-specific scoring matrix Position in query sequence • A PSSM is an n by m matrix, where n is the size of the alphabet, and m is the length of the sequence. • The entry at (i, j) is the score assigned by the PSSM to letter i at the jth position. “K” at position 3 gets a score of 2.

Position-specific scoring matrix • This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12.

What score does this PSSM assign to KRPGHFLA? • 2 + 0 + -2 + 6 + 0 + 6 + -4 + -2 = 6

How PSI-BLAST makes PSSMs

Position-specific iterated BLAST ? Query PSSM Multiple alignment Sequence database BLAST

Creating a PSSM from 1 sequence R L RNRGQFGH R BLOSUM62 matrix 20 by 20 20 by L

Position-specific iterated BLAST ? Query PSSM Multiple alignment Sequence database BLAST

Creating a PSSM from multiple sequences • Discard columns that contain gaps in the query. • For each column C • Compute relative sequence weights • Compute PSSM entries, taking into account • Observed residues in this column • Sequence weights • Substitution matrix

Discard query gap columns EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA

Low weights are assigned to redundant sequences. High weights are assigned to unique sequences. Compute sequence weights EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3

Compute PSSM entries EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 PSSM BLOSUM62 matrix

Position-specific iterated BLAST Query PSSM Multiple alignment Sequence database BLAST

Summary of PSI-BLAST • PSI-BLAST builds a model of the query sequence and its close homologs. • Instead of comparing a target sequence to the query, each target is compared to the model. • The PSI-BLAST model is called a position-specific scoring matrix (PSSM). • The PSSM can be constructed from a collection of targets aligned to the query sequence. • PSI-BLAST is more accurate than BLAST.

Sample problem #1 • Given: • a file containing a sequence of amino acids • Return: • the amino acid counts ./compute-counts.py seq1.txt Read 68 amino acids from seq1.txt. A 5 C 2 D 3 E 1 F 6 G 0 H 0 I 2 K 2 L 8 M 1 N 5 P 7 Q 1 R 1 S 2 T 5 V 6 W 3 Y 8

Sample problem #2 • Given: • a pseudocount weight • a file containing amino acid frequencies • a file containing a sequence of amino acids • Return: • the summed amino acid counts and pseudocounts

Sample problem #3 • Given: • a pseudocount weight • a file containing amino acid frequencies • a file containing a sequence of amino acids • Return: • the normalized summed amino acid counts and pseudocounts

BLAST, PSI-BLAST and position-specific scoring matrices