Sequence-based database searching Unit 9

Sequence-based database searchingUnit 9 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD

Sequence-based Search for: • Retrieving and comparing DNA sequences in Databases. • Identification of related sequences and subsequences. • Exploring frequently occurring patterns of protein and nucleotide sequences • Finding informative elements in protein and DNA sequences. • Reconstruction of DNA, personal genomics, etc.

Chapter 11 Basis for search: ALIGNMENT =Matching =Positioning to maximize similarity ACTGTGGGAACCTTTGCACCGAAAC ACTGTGGGA ACCTATGCACCGAAAC

From previous lecture: • Similarity vs Homology (measure and conclusion) • Speciation vs Duplication (orthologs and paralogs) • Dotplots • Dotlet: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • Dotter • Dottup: http://bioweb.pasteur.fr/seqanal/interfaces/dottup.html also at: www.emboss.org

Today’s Lab What do the terms “homolog,” “ortholog,” and “paralog“ mean? Go to the NCBI BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/) and choose “Protein-protein BLAST.” Paste your protein sequence (human) into the “Search” box. Can you find a homologous sequence from yeast? Note you can use same search term as in Entrez Databases.

The dotplot A simple picture that gives an overview of the similarities between two sequences

The dotplot

Today’s Lab: Here are some BLOSUM-62 scores: Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 Given these, what would your guess be for the best global alignment of these two sequences: AWAP APP

Dotplotvs Scoring Matrices Visualization vs Quantitative measure of how close the sequences are Dotplots: to visualize tandem repeats (repeating small diagonals), regions of local alignments (major diagonals), low complexity regions (solid boxes)

Scoring Matrices Empirical weighting schemes for comparisons Most commonly used matrices take 3 major biological factors into account: Conservation – which residues capable of substituting Frequency Evolution

Scoring Matrices • PAM (MDM/Dayhoff) - Point Accepted Mutation • BLOSUM - BLOcksSUbstitution Matrix • BLOSUM 62 is the default matrix in BLAST

Scoring Matrices BLOSUM62 – no more than 62% identity

Blosum62 – a fragment

Nucleotide Scoring Matrices a fragment

Gaps and Gaps Penalties Introduced to compensate for in-dels • Affine gap penalty: G+L*N • G - Gap opening penalty • L - Gap extension penalty (G>L) • N – the length of the gap • Nonaffine, or Linear: G=0

ALIGNMENT servers • BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ BLAST = Basic Local Alignment Search Tool , BLAT, BLAST like AT • BLAT: http://genome.cse.ucsc.edu/ • FastA (Fast Alignment) : fasta.bioch.virginia.edu/ • ClustalW: http://www.ebi.ac.uk/clustalw/ • SIM http://www.expasy.ch/tools/sim-prot.html • Pfam: http://pfam.wustl.edu/ • String: http://string.embl.de/ • ALION: http://motif.stanford.edu/alion/

Today’s Lab In what case percent match is the lowest when aligning the two sequences below by ALION: http://motif.stanford.edu/alion/? (using Smith Waterman algorithm) AAGCCGGCGCTCGGCAAGTTCTCCCAGGAGAAAGCCATGTTCAGTTCGAGCGCCAAGATCGTGAAGCCCA AAAAAAAGCCGGCGCTCGGTTTTTTTTCTCCCAGGAGAAAGCCATGTTCAGTTCGAGCGCCAAGATCGTGAAGCCCAAAAAA

Today’s Lab: Use the BLAST online tutorial (http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html) to discover the meaning of the Score and E value. What is the difference between an identity and a conservative substitution (see Section 5-4A)

BLAST

Entrez BLAST Same syntax as in entrez Composition-based stat – adjusting blast e-values AA composition Complexity box – mask out low-complexity sequences

BLAST/BLAT/FASTA Megablast – long and highly similar sequences PSI-BLAST – distantly related proteins, PSSM BLAT – more on other slides FASTA – begins search by looking for exact matches of words, BLAST allows for conservative substitutions from the very beginning

BLAST/BLAT/FASTA BLAST allows for automatic masking FASTA – one hit (Univ of Virginia server?) FASTA – more rigorous SW method, better result in the end – for less similar sequences FASTA better for translated sequences – alowsframeshift FASTA is the slowest, most computationally intensive

ALIGNMENT ALGORITHMS • Smith-Waterman ACTGTCTATAACCTTTGCGGCCAAAC ACTGTCTATACCTAT GCGGCGAAAC ACTGTGGGAACCTATGCGGCGAAAC • Needleman-Wunsch

Needleman-Wunsch Algorithm • General algorithm for sequence comparison • Maximise a similarity score, to give ‘maximum match’ • Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions • Finds the best GLOBAL alignment of any two sequences

Needleman-WunschAlgorithm • Three main steps 1. Assign similarity values 2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment

Smith-Waterman Algorithm • Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximise the similarity measure • For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions and deletions

Needleman-Wunsch 1. Global alignments 2. Requires alignment score for a pair of residues to be >=0 3. No gap penalty required 4. Score cannot decrease between two cells of a pathway Smith-Waterman 1. Local alignments 2. Residue alignment score may be positive or negative 3. Requires a gap penalty to work effectively 4. Score can increase, decrease or stay level between two cells of a pathway Differences

What is BLAT & why we need it there exist many alignment tools-SmithWaterman'salgorithm :solves two short sequence alignment problem -FASTA,NCBI BLAST,MegaBLAST, WU-BLASTprovides flexible & fast alignment involving large database -Sim4does a fine job with cDNAalignment-SAM,PSI-BLAST:slowlybut surely find remote homology

BLAT -BLAT(compared with existing tools) -more accurate for similar sequences -500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment -BLAT’s steps 1.using non-overlapping k-mers to create index 2.using index to find homologous region 3.aligning these regions separately 4.stiches these aligned region into larger alignment 5.revisit small internal exons possibly missed in first stage and adjusts large gap boundaries that have canonical splice sites where feasible

BLAT -BLAT’s speed & sensitivity are decided by 1.k-mer size (finding hits step) 2.mismatch scheme (aligning step) 3.number of required index matches (find hits step)

BLAT's similarity & difference compare with BLAST Similarity:-scans relative short matchs(hits) ie.buildindex then find hits-extend hits into high-scoring pairs (HSPs)

BLATDifference:-BLAST builds index for query sequence but BLAT builds index for database-BLAST scans linearly through database but BLAT scans linearly through query sequence -BLAST triggers an extension when one or two hits occur in proximity to each other but BLAT can trigger extensions on any number of perfect or near-perfect hits

BLATDifference:-BLAST returns each area of homology between two sequence but BLAT stitches them together into a larger alignment-BLAT has special code to handle introns in RNA/DNA alignments i.e. BLAT unsplices mRNA onto the genome

BLAT-BLAT is a very effective tool for doing nucleotide alignments between mRNA and DNA in same species-it is more accurate and faster than Sim4-BLAT's strategy for nucleotide alignments becomes less effective below 90% sequence identity but it can efficiently sequence divergence introduced by sequencing error twilight zone: 20-35% sequence identity

BLAT For search stage: -BLAT indexes database rather than query sequence so it only scan the short query sequence -A program “SSAHA” also indexes the database and it is an extremely effective tool for aligning genomic regions from same organism against each other -but “SSAHA” does not implement “unsplicing”,and always uses a single perfect match as a seed BLAT is more flexible in this aspect

Sequence Alignment in Matlab Pairwise sequence alignment — standard algorithms such as the Needleman-Wunsch (nwalign) and Smith-Waterman (swalign) Standard scoring matrices such as the PAM and BLOSUM families of matrices (blosum, dayhoff, gonnet, nuc44, pam). Visualize sequence similarities with seqdotplot and sequence alignment results with showalignment.

Sequence Alignment in Matlab Multiple sequence alignment — Functions for multiple sequence alignment (multialign, profalign) and functions that support multiple sequences (multialignread, fastaread, showalignment). There is also a graphical interface (multialignviewer) for viewing the results of a multiple sequence alignment and manually making adjustment.

Sequence Alignment in Matlab Multiple sequence profiles —multiple alignment and profile hidden Markov model algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof). Other useful Biological codes —aminolookup, baselookup, geneticcode, revgeneticcode.

Next Unit Chapter 2 Genome Analysis, Databases and Servers

Sequence-based database searching Unit 9

Sequence-based database searching Unit 9

Presentation Transcript

Techniques for Protein Sequence Alignment and Database Searching

Sequence Similarity Searching

Sequence Alignment and Database Searching

Evidence Based Practice: Unit III Database Searching at MSMC

Biological Sequence Comparison / Database Homology Searching

Sequence Alignment and Approaches to Database Searching

Pairwise Alignments and Sequence Similarity-Based Searching

Sequence Database Searching

Biological Sequence Comparison / Database Homology Searching

Searching Sequence Databases

Heuristic Methods for Sequence Database Searching

Searching Sequence Databases

Sequence Alignment and Database Searching

Sequence Searching Strategies

Sequence Alignments and Database Searching 08/20/07

Previous Lecture: Sequence Database Searching

Sequence based searching

SEQUENCE DATABASE

Heuristic Methods for Sequence Database Searching

Pairwise Sequence Alignment and Database Searching

Techniques for Protein Sequence Alignment and Database Searching

Sequence based searching