1 / 40

Sequence-based database searching Unit 9

Sequence-based database searching Unit 9. BIOL221T : Advanced Bioinformatics for Biotechnology. Irene Gabashvili, PhD. Sequence-based Search for:. R etrieving and comparing DNA sequences in Databases. Identification of related sequences and s ubsequences.

grady
Download Presentation

Sequence-based database searching Unit 9

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence-based database searchingUnit 9 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD

  2. Sequence-based Search for: • Retrieving and comparing DNA sequences in Databases. • Identification of related sequences and subsequences. • Exploring frequently occurring patterns of protein and nucleotide sequences • Finding informative elements in protein and DNA sequences. • Reconstruction of DNA, personal genomics, etc.

  3. Chapter 11 Basis for search: ALIGNMENT =Matching =Positioning to maximize similarity ACTGTGGGAACCTTTGCACCGAAAC ACTGTGGGA ACCTATGCACCGAAAC

  4. From previous lecture: • Similarity vs Homology (measure and conclusion) • Speciation vs Duplication (orthologs and paralogs) • Dotplots • Dotlet: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • Dotter • Dottup: http://bioweb.pasteur.fr/seqanal/interfaces/dottup.html also at: www.emboss.org

  5. Today’s Lab What do the terms “homolog,” “ortholog,” and “paralog“ mean? Go to the NCBI BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/) and choose “Protein-protein BLAST.” Paste your protein sequence (human) into the “Search” box. Can you find a homologous sequence from yeast? Note you can use same search term as in Entrez Databases.

  6. The dotplot A simple picture that gives an overview of the similarities between two sequences

  7. The dotplot

  8. Today’s Lab: Here are some BLOSUM-62 scores: Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4  Given these, what would your guess be for the best global alignment of these two sequences: AWAP APP

  9. Dotplotvs Scoring Matrices Visualization vs Quantitative measure of how close the sequences are Dotplots: to visualize tandem repeats (repeating small diagonals), regions of local alignments (major diagonals), low complexity regions (solid boxes)

  10. Scoring Matrices Empirical weighting schemes for comparisons Most commonly used matrices take 3 major biological factors into account: Conservation – which residues capable of substituting Frequency Evolution

  11. Scoring Matrices • PAM (MDM/Dayhoff) - Point Accepted Mutation • BLOSUM - BLOcksSUbstitution Matrix • BLOSUM 62 is the default matrix in BLAST

  12. Scoring Matrices BLOSUM62 – no more than 62% identity

  13. Blosum62 – a fragment

  14. Nucleotide Scoring Matrices a fragment

  15. Gaps and Gaps Penalties Introduced to compensate for in-dels • Affine gap penalty: G+L*N • G - Gap opening penalty • L - Gap extension penalty (G>L) • N – the length of the gap • Nonaffine, or Linear: G=0

  16. ALIGNMENT servers • BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ BLAST = Basic Local Alignment Search Tool , BLAT, BLAST like AT • BLAT: http://genome.cse.ucsc.edu/ • FastA (Fast Alignment) : fasta.bioch.virginia.edu/ • ClustalW: http://www.ebi.ac.uk/clustalw/ • SIM http://www.expasy.ch/tools/sim-prot.html • Pfam: http://pfam.wustl.edu/ • String: http://string.embl.de/ • ALION: http://motif.stanford.edu/alion/

  17. Today’s Lab In what case percent match is the lowest when aligning the two sequences below by ALION: http://motif.stanford.edu/alion/? (using Smith Waterman algorithm) AAGCCGGCGCTCGGCAAGTTCTCCCAGGAGAAAGCCATGTTCAGTTCGAGCGCCAAGATCGTGAAGCCCA AAAAAAAGCCGGCGCTCGGTTTTTTTTCTCCCAGGAGAAAGCCATGTTCAGTTCGAGCGCCAAGATCGTGAAGCCCAAAAAA

  18. Today’s Lab: Use the BLAST online tutorial (http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html) to discover the meaning of the Score and E value. What is the difference between an identity and a conservative substitution (see Section 5-4A)

  19. BLAST

  20. Entrez BLAST Same syntax as in entrez Composition-based stat – adjusting blast e-values AA composition Complexity box – mask out low-complexity sequences

  21. BLAST/BLAT/FASTA Megablast – long and highly similar sequences PSI-BLAST – distantly related proteins, PSSM BLAT – more on other slides FASTA – begins search by looking for exact matches of words, BLAST allows for conservative substitutions from the very beginning

  22. BLAST/BLAT/FASTA BLAST allows for automatic masking FASTA – one hit (Univ of Virginia server?) FASTA – more rigorous SW method, better result in the end – for less similar sequences FASTA better for translated sequences – alowsframeshift FASTA is the slowest, most computationally intensive

  23. ALIGNMENT ALGORITHMS • Smith-Waterman ACTGTCTATAACCTTTGCGGCCAAAC ACTGTCTATACCTAT GCGGCGAAAC ACTGTGGGAACCTATGCGGCGAAAC • Needleman-Wunsch

  24. Needleman-Wunsch Algorithm • General algorithm for sequence comparison • Maximise a similarity score, to give ‘maximum match’ • Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions • Finds the best GLOBAL alignment of any two sequences

  25. Needleman-WunschAlgorithm • Three main steps 1. Assign similarity values 2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment

  26. Smith-Waterman Algorithm • Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximise the similarity measure • For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions and deletions

  27. Needleman-Wunsch 1. Global alignments 2. Requires alignment score for a pair of residues to be >=0 3. No gap penalty required 4. Score cannot decrease between two cells of a pathway Smith-Waterman 1. Local alignments 2. Residue alignment score may be positive or negative 3. Requires a gap penalty to work effectively 4. Score can increase, decrease or stay level between two cells of a pathway Differences

  28. Needleman-Wunsch 1. Global alignments 2. Requires alignment score for a pair of residues to be >=0 3. No gap penalty required 4. Score cannot decrease between two cells of a pathway Smith-Waterman 1. Local alignments 2. Residue alignment score may be positive or negative 3. Requires a gap penalty to work effectively 4. Score can increase, decrease or stay level between two cells of a pathway Differences

  29. What is BLAT & why we need it there exist many alignment tools-SmithWaterman'salgorithm :solves two short sequence alignment problem -FASTA,NCBI BLAST,MegaBLAST, WU-BLASTprovides flexible & fast alignment involving large database -Sim4does a fine job with cDNAalignment-SAM,PSI-BLAST:slowlybut surely find remote homology

  30. BLAT -BLAT(compared with existing tools) -more accurate for similar sequences -500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment -BLAT’s steps 1.using non-overlapping k-mers to create index 2.using index to find homologous region 3.aligning these regions separately 4.stiches these aligned region into larger alignment 5.revisit small internal exons possibly missed in first stage and adjusts large gap boundaries that have canonical splice sites where feasible

  31. BLAT -BLAT’s speed & sensitivity are decided by 1.k-mer size (finding hits step) 2.mismatch scheme (aligning step) 3.number of required index matches (find hits step)

  32. BLAT's similarity & difference compare with BLAST Similarity:-scans relative short matchs(hits) ie.buildindex then find hits-extend hits into high-scoring pairs (HSPs)

  33. BLATDifference:-BLAST builds index for query sequence but BLAT builds index for database-BLAST scans linearly through database but BLAT scans linearly through query sequence -BLAST triggers an extension when one or two hits occur in proximity to each other but BLAT can trigger extensions on any number of perfect or near-perfect hits

  34. BLATDifference:-BLAST returns each area of homology between two sequence but BLAT stitches them together into a larger alignment-BLAT has special code to handle introns in RNA/DNA alignments i.e. BLAT unsplices mRNA onto the genome

  35. BLAT-BLAT is a very effective tool for doing nucleotide alignments between mRNA and DNA in same species-it is more accurate and faster than Sim4-BLAT's strategy for nucleotide alignments becomes less effective below 90% sequence identity but it can efficiently sequence divergence introduced by sequencing error twilight zone: 20-35% sequence identity

  36. BLAT For search stage: -BLAT indexes database rather than query sequence so it only scan the short query sequence -A program “SSAHA” also indexes the database and it is an extremely effective tool for aligning genomic regions from same organism against each other -but “SSAHA” does not implement “unsplicing”,and always uses a single perfect match as a seed BLAT is more flexible in this aspect

  37. Sequence Alignment in Matlab Pairwise sequence alignment — standard algorithms such as the Needleman-Wunsch (nwalign) and Smith-Waterman (swalign) Standard scoring matrices such as the PAM and BLOSUM families of matrices (blosum, dayhoff, gonnet, nuc44, pam). Visualize sequence similarities with seqdotplot and sequence alignment results with showalignment.

  38. Sequence Alignment in Matlab Multiple sequence alignment — Functions for multiple sequence alignment (multialign, profalign) and functions that support multiple sequences (multialignread, fastaread, showalignment). There is also a graphical interface (multialignviewer) for viewing the results of a multiple sequence alignment and manually making adjustment.

  39. Sequence Alignment in Matlab Multiple sequence profiles —multiple alignment and profile hidden Markov model algorithms (gethmmprof, gethmmalignment, gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate, hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof). Other useful Biological codes —aminolookup, baselookup, geneticcode, revgeneticcode.

  40. Next Unit Chapter 2 Genome Analysis, Databases and Servers

More Related