1 / 36

From Pairwise Alignment to Database Similarity Search

From Pairwise Alignment to Database Similarity Search. Global vs Local Alignment. Global Alignment. ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT. Local Alignment. CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT. Global vs. Local alignment.

callia
Download Presentation

From Pairwise Alignment to Database Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Pairwise Alignment to Database Similarity Search

  2. Global vs Local Alignment Global Alignment ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT

  3. Global vs. Local alignment Alignment of two Genomic sequences >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Mouse DNA CATGCGTCTGACgctttttgctagcgatatcggactATCGATATA

  4. Global vs. Local alignment Alignment of two Genomic sequences Global Alignment Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA Mouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA ****** ***** * *** * ****** *** Human:CATGCGACTGAC Mouse:CATGCGTCTGAC Human:ATCGATCATA Mouse:ATCGAT-ATA Local Alignment

  5. Global vs. Local alignment Alignment of Genomic DNA and mRNA >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA

  6. Global vs. Local alignment Alignment of Genomic DNA and mRNA Global Alignment DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA mRNA:CATGCGACTGAC---------------------------ATCGATCATA ************ ********** DNA: CATGCGACTGAC mRNA:CATGCGACTGAC DNA: ATCGATCATA mRNA:ATCGATCATA Local Alignment

  7. Why do we care to align sequences? • Sequences that are similar probably have the same function

  8. Why do we care to align sequences?

  9. new sequence ? Similar function ≈ Discover Function of a new sequence Sequence Database

  10. Searching Databases for similar sequences Naïve solution: Use exact algorithm to compare each sequence in the database to query. Is this reasonable ?? How much time will it take to calculate?

  11. Complexity for genomes • Human genome contains3  109base pairs • Searching an mRNA against HG requires~1012 cells • -Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.

  12. So what can we do?

  13. Searching databases Solution: Use a heuristic (approximate) algorithm

  14. Heuristic strategy • Remove regions that are not useful for meaningful alignments • Preprocess database into new data structure to enable fast accession

  15. Heuristic strategy • Remove regions that are not useful for meaningful alignments • Preprocess database into new data structure to enable fast accession

  16. What sequences to remove? • AAAAAAAAAAA • ATATATATATATA • Transposable elements 53% of the genome is repetitive DNA Low complexity sequences (JUNK???)

  17. Low Complexity Sequences What's wrong with them? * Not informative * Produce artificial high scoring alignments. So what do we do? We apply Low Complexity masking to the database and the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

  18. Heuristic strategy • Remove low-complexity regions that are not useful for meaningful alignments • Preprocess database into new data structure to enable fast accession

  19. BLAST Basic Local Alignment Search Tool • General idea - a good alignment contains subsequences of high identity: • First, identify very short almost exact matches. • Next, the best short hits from the 1st step are extended to longer regions of similarity. • Finally, the best hits are optimized using the Smith-Waterman algorithm. Altschul et al 1990

  20. BLAST(Protein Sequence Example) • Search the database for matching words • Example: • Protein sequence …FSGTWYA… • Words of length 3: FSG, SGT, GTW, TWY, WYA • All words in database (bag of words): • FSG SGT GTW TWY WYA • YSG TGT ATW SWY WFA • FTG SVT GSW TWF WYS….

  21. BLAST(Protein Sequence Example) • Search the database for matching words • Example: • Protein sequence …FSGTWYA… • Words of length 3: FSG, SGT, GTW, TWY, WYA… • All words in database (bag of words): • FSG SGT GTW TWY WYA • YSG TGT ATW SWY WFA • FTG SVT GSW TWF WYS….

  22. BLAST(Protein Sequence Example) 1.Search the database for matching word pairs (L= 3) 2.Extend word pairs as much as possible,i.e., as long as the total score increases • High-scoring Segment Pairs (HSPs) Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN Q= query sequence, D= sequence in database

  23. BLAST 3. Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

  24. Running BLAST to predict a function of a new protein >Arrestin protein (C. elegance) MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKG IGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQF GSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPF GCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKK LAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTAL PGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR

  25. How to interpret a BLAST score: • The score is a measure of the similarity of the query to the sequence shown. • How do we know if the score is significant? • -Statistical significance • -Biological significance

  26. How to interpret a BLAST search: For each blast score we can calculate an expectation value (E-value) The expectation value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p (p-value). page 105

  27. BLAST- E value: Increases linearly with length of query sequence Decreases exponentially with score of alignment Increases linearly with length of database m = length of query ; n= length of database ; s= score • K ,λ: statistical parameters dependent upon scoring system • and background residue frequencies

  28. What is a Good E-value (Thumb rule) • E values of less than 0.00001 show that sequences are almost always homologues. • Greater E values, can represent homologues as well. • Generally the decision whether an E-value is biologically significant depends on the size of database that is searched • Sometimes a real match has an E value > 1 • Sometimes a similar E value occurs for a short exact match and long less exact match

  29. How to interpret a BLAST search: • The score is a measure of the similarity of the query to the sequence shown. • How do we know if the score is significant? • -Statistical significance • -Biological significance

  30. Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Sometimes correction to the model are needed to infer biological significance

  31. Gap Scores • Standard solution: affine gap model wx = g + r(x-1) wx : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length • Once-off cost for opening a gap • Lower cost for extending the gap • Changes required to algorithm

  32. Significance of Gapped Alignments • Gapped alignments use same statistics •  and K cannot be easily estimated • Empirical estimations and gap scores determined by looking at random alignments

  33. BLAST BLAST is a family of programs Query:DNAProtein Database:DNAProtein

More Related