Multiple Sequence Alignments Advanced BLAST searches

Multiple Sequence AlignmentsAdvanced BLAST searches June 17, 2014

Topics • Overview of MSA • MSA methods • Practical aspects • MSA to Profiles • PSI-BLAST • PHI-BLAST

Overview of MSA • Alignment of ≥ 3 sequences to bring as many similar characters into register as possible • Hypothetical model of mutations (substitutions, insertions & deletions) • Best represents most likely evolutionary scenario. • Cannot be unambiguously established

MSA: Motivation • Correspondence. Find out which parts “do the same thing” • Similar genes are conserved across widely divergent species, often performing similar functions • Structure prediction • Use knowledge of structure of one or more members of a protein MSA to predict structure of other members • Structure is more conserved than sequence • Create “profiles” for protein families • Allow us to search for other members of the family • Genome assembly: • Automated reconstruction of “contig” maps of genomic fragments such as ESTs • MSA is the starting point for phylogenetic analysis

MSA: Approaches • Optimal Global Alignments -Dynamic programming • Find alignment that maximizes a score function • Computationally expensive: Time grows as product of sequence lengths • Global Progressive Alignments - Match closely-related sequences first using a guide tree (CLUSTALW) • Global Iterative Alignments - Multiple re-building attempts to find best alignment (MUSCLE) • Local alignments • Profiles, Blocks, Patterns

MSA algorithms • ClustalW • Hierachical & progressive (NOT iterative) • Uses guide trees • Can propagate errors made early in the alignment • Most common • Webserver & local (Bioedit, Genious) • MUSCLE • Progressive & iterative • Faster than CLUSTALW, especially on larger sequence sets • Command line & in Genious Pro, MacVector and MEGA5

Overview of hierarchical method • Do a pair-wise comparison of all sequences • Create a guide tree of the most to least similar • Align 2 most similar, then next 2 most similar • Add sequences progressively in decreasing order of similarity • Gaps that are introduced are never removed

Step 1-pairwise alignments Compare each sequence with each other and calculate a distance matrix. A - B .87 - C .59 .60 - Each number represents the number of exact matches divided by the sequence length (ignoring gaps). Thus, the higher the number the more closely related the two sequences are. Different sequences A B C In this distance matrix sequence A is 87% identical to sequence B

Step 2-Create Guide Tree Use the Distance Matrix to create a Guide Tree to determine the “order” of the sequences. 0.87 (0.13) A - B .87 - C .59 .60 - A B C Different sequences 0.60 (0.40) A B C Guide Tree Branch length proportional to estimated divergence between A and B (0.13)

A B C Step 3-Progressive Alignment First, align A and B Then add sequence C to the previous alignment. In the closely aligned sequences, gaps are given a heavier weight than in more divergent sequences. Guide Tree

Amino acid weight matrices • Series of scoring matrices that one can use depending on the relatedness of the proteins aligned. • As the alignment proceeds in CLUSTALW the AA weight matrices are changed to more divergent scoring matrices. • Length of the branch is used to determine which matrix to use and contributes to the alignment score.

Globin alignment • Starting with a group of 7 globin-related sequences from different species • Do pairwise alignments between all 7 sequences • Calculate similarity between each pair; higher score indicates more similar

Cluster the sequences by similarity to create a guide tree • Branch length is proportional to estimated divergence between the two sequences

Globin alignment

ClustalWAlignment * identity : high similarity . low similarity - gap in sequence Amino acids often color coded based on physical -chemical properties

ClustalWvs Muscle ClustalW alignment MUSCLE alignment

Practical aspects • Identify & download sequences in correct format • Should meet criteria for MSA: • Closely related (E < 1e-10) • Similar length and number of domains • Same domain order • If necessary, extract regions of similar length • Name them appropriately • Short, descriptive names that fit on the output

Alignment viewers • Edit and prepare for publication • Different coloring schemes • Jalview -- Java based interactive viewer (free)

MSA -> Profiles • Profile: A table that lists the frequencies of each amino acid in each position of protein sequence. • Frequencies are calculated from a MSA containing a domain of interest • Allows us to identify consensus sequence • Derived scoring scheme allows us to align a new sequence to the profile • Profile can be used in database searches • Find new sequences that match the profile • Profiles also used to compute multiple alignments heuristically • Progressive alignment

Why not just use BLAST? • Database searches using a profile or position-specific scoring matrices (PSSM) are much more sensitive for detecting weak or distant relationships than are database searches using a single sequence as query • Information content higher in a PSSM

Pairwise alignment

Position Specific Scoring Matrix (PSSM)

MSAs -> PSSM POS 123456 Seq1 ATGTCG Seq2 AAGACT Seq3 TACTCA Seq4 CGGAGG Seq5 AACCTG

ATGTCG AAGACT TACTCA CGGAGG AACCTG Convert MSA to raw frequency table

Normalize by dividing by overall frequencies

Convert the values to log to the base of 2 PSSM

Match the string “AACTCG” to the matrix SUM: 1.0 + 1.0 + 0.8 + 1.0 + 1.38 + 1.15 = 6.33

Match the string “AACTGG” to the matrix SUM: 1.0 + 1.0 + 0.8 + 1.0 - 0.43 + 1.15 = 4.52

PSI-BLAST • Position-Specific Iterated BLAST • Can generate a position-specific scoring matrix staring from a single sequence against a single database • Builds the PSSM iteratively • Increases sensitivity of search with each iteration

Steps in PSI-BLAST • Single protein sequence compared to database using BLASTP • Construct a multiple alignment and profile (PSSM) from any significant local alignments • query sequence is template • lengths all identical to query • Profile or PSSM is compared to database, making local alignments • Estimate statistical significance of local alignments • Iterate an arbitrary number of times or until convergence (no new sequences added)

Practical uses of PSI-BLAST • Can create a PSSM using PSI-BLAST against 1 database • i.e. NR • Use the PSSM in a search of database for a more sensitive search • i.e. Refseq or NR restricted to taxonomic group • Does not have to run to convergence to create a PSSM useful for finding remote homologues, usually 2 or 3 iterations is sufficient • SLOW – use when there is no domain in your protein

Delta-BLAST Domain enhanced lookup time accelerated BLAST Works when you are looking for proteins with a known protein domain

Sma4 protein • 570 aa from C. elegans • Domain structure What homologs exist in C. briggsae? C. briggsae& C. elegans are both nematodes that diverged ~80-100 million years ago.

Sma4 from C. elegans BLASTP against Refseq limited to C. briggsae DeltaBlastagainst Refseq limited to C. briggsae

Sma4 vs TAG-68

Similar functional role? • Sma4 (520 aa) • TAG-68 (415 aa)

Where Sma4 homologs are not... Sma4 BLASTP against Refseq (fungi) Sma4 PSSM PSI-BLAST against Refseq (fungi)

PHI-BLAST • Pattern-hit initiated BLAST • Enforces the presence of a motif in addition to the usual PSI-BLAST criteria for matching • Uses protein domain signatures from PROSITE database • Initiate a PSI-BLAST search, but include a signature pattern from PROSITE to limit search to sequences which contain that motif or signature

PHI-BLAST example • PHI-BLAST • Query: E3 ubiquitin ligase ARIH2 (human) • Database: Refseq (Aspergillus) • Signature of ZF_RING_1, Zinc finger RING-type: C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]

PHI-BLAST, cont BLASTP of ARIH2 against Refseq (Aspergillus): PHI-BLAST of ARIH2 against Refseq (Aspergillus), with 1_ZN_RING signature:

C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA] Prosite pattern: Human ARIH2 protein: Top Aspergillus match: CkHdFCwmCL CkHeFCwmCM

This week in lab • Using BLASTP & MSA to predict functional homologs in other species • Compare results of BLASTP, PHI-BLAST and Delta-BLAST to identify homologs in other species • Using PSI-BLAST to identify remote homologs of proteins with no known domains

Multiple Sequence Alignments Advanced BLAST searches