420 likes | 717 Views
Sequence-based database searching Unit 9. BIOL221T : Advanced Bioinformatics for Biotechnology. Irene Gabashvili, PhD. Sequence-based Search for:. R etrieving and comparing DNA sequences in Databases. Identification of related sequences and s ubsequences.
E N D
Sequence-based database searchingUnit 9 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD
Sequence-based Search for: • Retrieving and comparing DNA sequences in Databases. • Identification of related sequences and subsequences. • Exploring frequently occurring patterns of protein and nucleotide sequences • Finding informative elements in protein and DNA sequences. • Reconstruction of DNA, personal genomics, etc.
Chapter 11 Basis for search: ALIGNMENT =Matching =Positioning to maximize similarity ACTGTGGGAACCTTTGCACCGAAAC ACTGTGGGA ACCTATGCACCGAAAC
From previous lecture: • Similarity vs Homology (measure and conclusion) • Speciation vs Duplication (orthologs and paralogs) • Dotplots • Dotlet: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • Dotter • Dottup: http://bioweb.pasteur.fr/seqanal/interfaces/dottup.html also at: www.emboss.org
Today’s Lab What do the terms “homolog,” “ortholog,” and “paralog“ mean? Go to the NCBI BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/) and choose “Protein-protein BLAST.” Paste your protein sequence (human) into the “Search” box. Can you find a homologous sequence from yeast? Note you can use same search term as in Entrez Databases.
The dotplot A simple picture that gives an overview of the similarities between two sequences
Today’s Lab: Here are some BLOSUM-62 scores: Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 Given these, what would your guess be for the best global alignment of these two sequences: AWAP APP
Dotplotvs Scoring Matrices Visualization vs Quantitative measure of how close the sequences are Dotplots: to visualize tandem repeats (repeating small diagonals), regions of local alignments (major diagonals), low complexity regions (solid boxes)
Scoring Matrices Empirical weighting schemes for comparisons Most commonly used matrices take 3 major biological factors into account: Conservation – which residues capable of substituting Frequency Evolution
Scoring Matrices • PAM (MDM/Dayhoff) - Point Accepted Mutation • BLOSUM - BLOcksSUbstitution Matrix • BLOSUM 62 is the default matrix in BLAST
Scoring Matrices BLOSUM62 – no more than 62% identity
Nucleotide Scoring Matrices a fragment
Gaps and Gaps Penalties Introduced to compensate for in-dels • Affine gap penalty: G+L*N • G - Gap opening penalty • L - Gap extension penalty (G>L) • N – the length of the gap • Nonaffine, or Linear: G=0
ALIGNMENT servers • BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ BLAST = Basic Local Alignment Search Tool , BLAT, BLAST like AT • BLAT: http://genome.cse.ucsc.edu/ • FastA (Fast Alignment) : fasta.bioch.virginia.edu/ • ClustalW: http://www.ebi.ac.uk/clustalw/ • SIM http://www.expasy.ch/tools/sim-prot.html • Pfam: http://pfam.wustl.edu/ • String: http://string.embl.de/ • ALION: http://motif.stanford.edu/alion/
Today’s Lab In what case percent match is the lowest when aligning the two sequences below by ALION: http://motif.stanford.edu/alion/? (using Smith Waterman algorithm) AAGCCGGCGCTCGGCAAGTTCTCCCAGGAGAAAGCCATGTTCAGTTCGAGCGCCAAGATCGTGAAGCCCA AAAAAAAGCCGGCGCTCGGTTTTTTTTCTCCCAGGAGAAAGCCATGTTCAGTTCGAGCGCCAAGATCGTGAAGCCCAAAAAA
Today’s Lab: Use the BLAST online tutorial (http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html) to discover the meaning of the Score and E value. What is the difference between an identity and a conservative substitution (see Section 5-4A)
Entrez BLAST Same syntax as in entrez Composition-based stat – adjusting blast e-values AA composition Complexity box – mask out low-complexity sequences
BLAST/BLAT/FASTA Megablast – long and highly similar sequences PSI-BLAST – distantly related proteins, PSSM BLAT – more on other slides FASTA – begins search by looking for exact matches of words, BLAST allows for conservative substitutions from the very beginning
BLAST/BLAT/FASTA BLAST allows for automatic masking FASTA – one hit (Univ of Virginia server?) FASTA – more rigorous SW method, better result in the end – for less similar sequences FASTA better for translated sequences – alowsframeshift FASTA is the slowest, most computationally intensive
ALIGNMENT ALGORITHMS • Smith-Waterman ACTGTCTATAACCTTTGCGGCCAAAC ACTGTCTATACCTAT GCGGCGAAAC ACTGTGGGAACCTATGCGGCGAAAC • Needleman-Wunsch
Needleman-Wunsch Algorithm • General algorithm for sequence comparison • Maximise a similarity score, to give ‘maximum match’ • Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions • Finds the best GLOBAL alignment of any two sequences
Needleman-WunschAlgorithm • Three main steps 1. Assign similarity values 2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment
Smith-Waterman Algorithm • Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximise the similarity measure • For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions and deletions
Needleman-Wunsch 1. Global alignments 2. Requires alignment score for a pair of residues to be >=0 3. No gap penalty required 4. Score cannot decrease between two cells of a pathway Smith-Waterman 1. Local alignments 2. Residue alignment score may be positive or negative 3. Requires a gap penalty to work effectively 4. Score can increase, decrease or stay level between two cells of a pathway Differences
Needleman-Wunsch 1. Global alignments 2. Requires alignment score for a pair of residues to be >=0 3. No gap penalty required 4. Score cannot decrease between two cells of a pathway Smith-Waterman 1. Local alignments 2. Residue alignment score may be positive or negative 3. Requires a gap penalty to work effectively 4. Score can increase, decrease or stay level between two cells of a pathway Differences
What is BLAT & why we need it there exist many alignment tools-SmithWaterman'salgorithm :solves two short sequence alignment problem -FASTA,NCBI BLAST,MegaBLAST, WU-BLASTprovides flexible & fast alignment involving large database -Sim4does a fine job with cDNAalignment-SAM,PSI-BLAST:slowlybut surely find remote homology
BLAT -BLAT(compared with existing tools) -more accurate for similar sequences -500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment -BLAT’s steps 1.using non-overlapping k-mers to create index 2.using index to find homologous region 3.aligning these regions separately 4.stiches these aligned region into larger alignment 5.revisit small internal exons possibly missed in first stage and adjusts large gap boundaries that have canonical splice sites where feasible
BLAT -BLAT’s speed & sensitivity are decided by 1.k-mer size (finding hits step) 2.mismatch scheme (aligning step) 3.number of required index matches (find hits step)
BLAT's similarity & difference compare with BLAST Similarity:-scans relative short matchs(hits) ie.buildindex then find hits-extend hits into high-scoring pairs (HSPs)
BLATDifference:-BLAST builds index for query sequence but BLAT builds index for database-BLAST scans linearly through database but BLAT scans linearly through query sequence -BLAST triggers an extension when one or two hits occur in proximity to each other but BLAT can trigger extensions on any number of perfect or near-perfect hits
BLATDifference:-BLAST returns each area of homology between two sequence but BLAT stitches them together into a larger alignment-BLAT has special code to handle introns in RNA/DNA alignments i.e. BLAT unsplices mRNA onto the genome
BLAT-BLAT is a very effective tool for doing nucleotide alignments between mRNA and DNA in same species-it is more accurate and faster than Sim4-BLAT's strategy for nucleotide alignments becomes less effective below 90% sequence identity but it can efficiently sequence divergence introduced by sequencing error twilight zone: 20-35% sequence identity
BLAT For search stage: -BLAT indexes database rather than query sequence so it only scan the short query sequence -A program “SSAHA” also indexes the database and it is an extremely effective tool for aligning genomic regions from same organism against each other -but “SSAHA” does not implement “unsplicing”,and always uses a single perfect match as a seed BLAT is more flexible in this aspect
Sequence Alignment in Matlab Pairwise sequence alignment — standard algorithms such as the Needleman-Wunsch (nwalign) and Smith-Waterman (swalign) Standard scoring matrices such as the PAM and BLOSUM families of matrices (blosum, dayhoff, gonnet, nuc44, pam). Visualize sequence similarities with seqdotplot and sequence alignment results with showalignment.
Sequence Alignment in Matlab Multiple sequence alignment — Functions for multiple sequence alignment (multialign, profalign) and functions that support multiple sequences (multialignread, fastaread, showalignment). There is also a graphical interface (multialignviewer) for viewing the results of a multiple sequence alignment and manually making adjustment.
Sequence Alignment in Matlab Multiple sequence profiles —multiple alignment and profile hidden Markov model algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof). Other useful Biological codes —aminolookup, baselookup, geneticcode, revgeneticcode.
Next Unit Chapter 2 Genome Analysis, Databases and Servers