1 / 84

Bioinformatics Workshop 1 Sequences and Similarity Searches

Bioinformatics Workshop 1 Sequences and Similarity Searches. Open a web browser and type in the URL: informatics.gurdon.cam.ac.uk/online/workshops Bookmark this page Click on the link to the file: useful-websites.html Bookmark this page too

lynda
Download Presentation

Bioinformatics Workshop 1 Sequences and Similarity Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Workshop 1Sequences and Similarity Searches • Open a web browser and type in the URL: • informatics.gurdon.cam.ac.uk/online/workshops • Bookmark this page • Click on the link to the file: • useful-websites.html • Bookmark this page too • It also contains links to the example sequence files used in the workshop, and the presentations themselves

  2. The Basic Questions Where, and how, do I find something? How do I know it’s real? Exercise 0: Write a concise definition of what a gene is.

  3. Part 1: Structural Genomics DNA arranged in chromosomes Vertebrate ~ 109 base pairs

  4. Chromosomes and Genes Total of ~30,000 genes on ~20 chromosomes 1000 – 2000 genes per chromosome

  5. Gene to Protein ~ gene locus genome primary transcript mRNA protein

  6. Sequence Signals mRNA CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA M L T I L L A

  7. promoters splice sites ===CACGATCGAGTC=================== ==ACGTA…………CAGTA==================== enhancers transcription start site ===CGCTATAAGCG==================== ===CGCAATAAAGCG=================== polyadenylation signal Genomic Signals

  8. 5’ EST 3’ EST Derivative Sequences mRNA 5’ 3’ capture by cloning into cDNA library EST: single pass sequence from each end of the clone cDNA sequence cDNA: multiple pass sequencing over whole length of the clone

  9. Gene Models exons gene model

  10. mRNAs/cDNAs S43105.1 ‘similar to Cyclin B1 [mus musculus]’ BT006437.1 ‘Cyclin B1, isoform 1 [mus musculus]’ X58708.1 ‘Cyclin B1, isoform 2 [mus musculus]’ NM_111985.3 ‘CCNB1, Cyclin B1 [mus musculus]’ AAB22970.1 proteins AAP21245.1 CAA41545.1 NP_187759.2 Sequences and Genes(Accession Numbers and Names) gene

  11. Gene Symbols, Names, Etc. Gene Symbol: CCNB1 Gene Name: cyclin B1 [Homo sapiens] Description: G2/mitotic-specific cyclin B1 Aliases: CCNB, CYCB1

  12. genomic location S43105.1 AAB22970.1 BT006437.1 AAP21245.1 X58708.1 CAA41545.1 NM_111985.3 NP_187759.2 expression data A Gene-Centric View Entrez Gene http://www.ncbi.nlm.nih.gov/ Cyclin B1 Exercise 1: Go to Entrez Gene and look for your favourite gene or genes.

  13. Sequences and Accession Numbers NM_001015922.1 gi=62860271 GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA NM_001015922.2 gi=62860589 GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA BC009638.1 gi=16307106 GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA NP_001015922.1 protein translated from mRNA XM_001102567.1 predicted mRNA XP_001089765.1 predicted protein translated from predicted mRNA

  14. genome exon intron exon intron exon mRNA Splicing Signals mRNA CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA gene model splice sites CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA GTAAG.donor .TTTCAG acceptor

  15. Gene Predictions • Given: • coding sequence must run from ATG – STOP codon in-frame • introns GT. . . . . . AG can be spliced out • Also take a statistical approach: • coding and non-coding sequence are slightly different in composition • some ‘possible’ splice sites are more likely than others scan genomic sequence … . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . most likely gene model . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

  16. Supporting Evidence! exons: 1 2 3 4 gene model genome EST evidence We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even then…). So predicted genes based on computational gene models alone will usually lack UTR regions, which has some important consequences.

  17. Theoretical/Predicted Sequences exons: 1 2 3 4 predicted gene model genome predicted transcript predicted protein We’ve now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence, but we shouldn’t lose sight of the fact that we don’t really know if these predicted proteins exists – especially where supporting EST evidence is weak, or non-existent.

  18. Sequences for a model organism ESTs – millions @ £10 each Cheap to sequence – so we get millions per organism But lots of errors And incomplete gene sequences Can give us relative expression levels cDNAs – tens of thousands @ £1000 each Expensive – but only need to do one (or a small number) per gene Few errors with multipass sequencing Gives us protein sequences Genomes – one ! @ £30,000,000 Extremely expensive But the only way to get the whole picture Gives us gene regulation

  19. So What’s in the Databases Now? NCBI July 2005 15,000,000 ESTs 3,300,000 cDNAs DNA nr RefSeq 2,700,000 proteins 950,000 proteins Proteins

  20. Part 2: Comparative Genomics Evolution by sequence mutation Gene sequence ATGAAGGCTGCCTACGACTGCCGTG ATGCAGGCTGCCTACGACTGCCGTG ATGCAGGCTGCCAACGACTGCCGTG Imagine one mutation gets fixed every 100,000 years in this gene sequence… ATGCATGCTGCCAACGACTGCCGTG ATGCATGCTGCCAACGACTGCCCTG ATGCATGCTGCCAACGGCTGCCCTG ATGCATGCTGCCAACGGATGCCCTG ATGCATGCCGCCAACGGATGCCCTG ATGCATGCCGCCAACGGATGTCCTG

  21. Speciation Gene A ATGAAGGCTGCCTACGACTGCCGTG ATGAAGGCTGCCTACGACTGCCGTG ATGAAGGCCGCCTACGACTGCCGTG ATGCAGGCTGCCTACGACTGCCGTG ATGAAGGCCGCCAACGACTGTCGTG ATGCAGGCTGCCAACGACTGCCGTG ATGAAAGCCGCCAACGACTGTCGTG ATGCATGCTGCCAACGACTGCCGTG ATGAAAGCCGCCAACGACAGTCGTG ATGCATGCTGCCAACGACTGCCCTG ATGCATGCTGCCAACGGCTGCCCTG ATGAAAGCCGCCTACGACAGTCGTG ATGCATGCTGCCAACGGATGCCCTG ATGAAAGCCGCCTACGACAGTCCTG ATGCATGCTGCCAACGGATGCCCTG ||| | || ||| ||| | |||| ATGAAAGCCGCCTACGACAGTCCTG If the genetic difference means they can no longer interbreed, with fertile offspring – then we have a new species…

  22. ATGCATGCTGCCAACGGATGCCCTG ||| | || ||| ||| | |||| ATGAAAGCCGCCTACGACAGTCCTG ATGCATGCTGCCAACGGATGCCCTG ||| | | || | | | || | ATGGAAGGCGCTTAGGATAGTCCAG Residual Similarity We can still easily detect residual similarity between these sequences, this is what we call homology – detectable similarity because of common evolutionary origin. After longer periods of evolution, homology may no longer be detectable in the DNA sequence…

  23. ATGCATGCTGCCAACGGATGCCCTG ||| | || ||| ||| | |||| ATGAAAGCCGCCTACGACAGTCCTG GCTGACTCGTAGCGCTTAGCTAGCT | || | | | | CCAACATCTAGCCAGATTAGTTAGT Computers Can Detect Homology In fact computers are very good at this task – the two primary challenges are: (a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientist’s attention span (b) at low levels of similarity, being able to distinguish between biologically related sequences and chance matches…

  24. A A A Orthologs Gene duplication though speciation The two copies of Gene A will now evolve independently, but will continue to have the ~same function They are ORTHOLOGS

  25. A A A A A’ Paralogs The two copies of Gene A will now evolve independently, but will probably not continue to have exactly the same function Gene duplication though internal genome duplication They are PARALOGS

  26. A A A A A’ ‘Other’-logs What about gene duplication after speciation ? How can we describe the relationship(s) between the various copies of gene A in the two frogs? Bear in mind that understanding gene function is more important than semantics… The two copies of A in the orange frog are sometimes called IN-PARALOGS. If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS.

  27. A A A A cyclin b1 cyclin b1 The Essential Paradigm 1. any group of modern species can be traced back to some extinct common ancestor 2. in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor 3. If we can experimentally determine the function of a gene in one of these organisms, then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

  28. living organisms Function Conserved Longer than Detectable Similarity start from first self-replicating sequence whole genome duplication local duplication same function detectable similarity

  29. Redundancy in the Genetic Code GCA A alanine GCC A GCG A GCT A TGC C cystine TGT C GAC D aspartate GAT D GGA G glycine GGC G GGG G GGT G ‘Synonymous’ or ‘silent’ mutations in the third position of the codon triplets have no effect on the amino acid coded for – so there is no evolutionary pressure against this…

  30. LSREPV CTATCACGAGAACCTGTG |||||| || || || || || || CTGTCCCGTGAGCCAGTT LSREPV LSREPV CTATCACGAGAACCTGTG ||| || | || | || || TTGTCCCGGTCGCCAGTT LSRFPV Protein Similarity Persists Longer CTATCACGAGAACCTGTG CTATCCCGAGAACCTGTG CTATCCCGAGAACCAGTG CTATCCCGTGAACCAGTG CTATCCCGTGAGCCAGTG CTATCCCGTGAGCCAGTT CTGTCCCGTGAGCCAGTT 100% 67% 80% 44%

  31. Always Compare Protein Sequences amino acid comparison DNA comparison ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+|| ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR The DNA sequence can change while the amino acid sequence stays the same, so always look for similarities by comparing amino acid sequences.

  32. Exercise 1nucleotide vs amino acid search Go to the file example-sequences.html, and locate the section for this exercise. There should be two sequences: ‘surfeit1’ for frog and fly. Go to NCBI Blast home page, then ‘Align two sequences’ (bottom left ‘special’ panel), paste one sequence into each window and hit ‘Align’ – this will do a direct DNA/DNA comparison. Now find the open reading frames of the two genes, and translate them into amino acid protein sequences, then repeat the two sequences comparison. Go to NCBI ORF Finder – paste sequence – hit OrfFind – identify longest ORF – click on it – next screen, hit Accept – change View to Fasta protein – hit View – copy sequence to Blast2Seqs. Do the same with the other sequence. Before you hit ‘Align’ change the ‘Program’ (top left) to blastp…

  33. Answers: Exercise 1

  34. The Essential Task experiment data mining what is its function? gene sequence Cyclin-A FoxA1 database of proteins in other species cdc25 alpha-tubulin Predicted protein Gravin-like Gravin-like Sprouty-2 calmodulin KIAA10786568 we can only do this because of implied function based on orthology frizzled Wint8 Troponin T3

  35. Xenopus gene function unknown Functional Orthologs ? Human gene function known, annotation ‘Gravin’ available sequence similarity orthologs same function ? But we know that function is largely determined by shape similar shape? Which in general we cannot determine – but it is probably SHAPE not SEQUENCE that is conserved! We make an assumption that the same gene function is likely to be present in the two organisms, and the ones that have this function are likely to be the most similar in sequence

  36. Finding Orthologs So how do we find orthologs, and can we know when we have? The simplest is Reciprocal Best BLAST, but it implicitly relies on having all the protein sequences of you own organism, and the one you wish to find an ortholog in. database of human proteins database of frog proteins best match human protein frog protein x

  37. Using Synteny is Better We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another. Mouse chromosome 10 Human chromosome 5 Mouse chromosome 2 And we find the same genes (i.e. orthologs!) in more or less the same order in the syntenic sections. These of course represent chromosomal re-arrangements since these organisms diverged.

  38. Metazome Fortunately someone has done all the hard work for us…. Dan Rokhsar http://www.metazome.net/

  39. Metazome Exercise Go back to Entrez Gene and look for your favourite gene again. Pick probable ortholog vertebrate genes from common organisms (human, mouse, rat, chicken, frog, fish) and paste their protein sequences into a temporary space. Go to Metazome (http://www.metazome.net/), find the blast window, open two versions of it, and blast your sequences against the Tetrapod or Jawed vertebrate node. See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)…

  40. ATGCATGCTGCCAACGGATGTCCTG ATGCATGCTGGCCAACGGATGTCCTG ||| | || ||| ||| | |||||| ||| | || ||| | | | | ATGAAAGCCGCCTACGAAAGTCCTG ATGAAAGCCGCCTACGAAAGTCCTG Part 3: Finding Sequence Similarities We want computer programs which will compare sequences at all possible different alignments, looking for a degree of similarity greater than we would expect to find by chance. But first we have to consider the implication of gaps… Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments:

  41. Gaps in Alignments Consider these two obviously similar sequences: TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | | TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence: TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| |||| TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA So in general we allow ourselves to insert gaps, until we find the optimal alignment. But where should this process stop?

  42. The Downside of Gaps Take two random sequences, with no ‘real’ similarity: GACACTAGGTCGATGCGTGGTGGCGAGA ACGCATCCGGATGTGCACCGTGGAACTG And allow ‘cost free’ gaps: GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG Clearly, although the alignment has no mismatches, it is obviously not biologically meaningful! To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches – and this is the essence of ‘finding gapped alignments’. We want to find the ‘alignment’ between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps …

  43. BLAST There are many programs used to find similarities between sequences. They range from relatively slow programs which find the exact best matching alignment, through ones which take progressively inexact shortcuts to speed things up. Of this latter class, the best known, and easily most widely used is BLAST, developed by Stephen Altschul and others, and continuously refined over the last 10-15 years. The essential idea is to compare your query sequence against a collection or ‘database’ of target sequences, looking for the one(s) that match the query sequence the best. database >target1 AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAG >target2 CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG >target3 GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC >target4 CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG query >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA COMPARE LIST MATCHES

  44. query sequence other operation? database sequences Flavours of BLAST BLASTn ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT FAST BLASTp MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN MQWCGYRWTYQGYRW FAST BLASTx ACGATAGATCCCATCCATAAAT MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN 6 frame translation SLOW tBLASTn ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT MQWCGYRWTYQGYRW SLOWER tBLASTx HORRIBLY SLOW! ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

  45. How does it work? The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is: query CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC ||||||||||||||||||||||||| |||||||||||||||||||||||| | | | | | ||||||||||||||||||||||||| |||||||||||||||||||||||| | | | | | | || | | | | | | | | CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT 1st database sequence This would actually be a very slow search process if implemented like this… BLAST achieves its speed through two strategies: - it takes a WORD based approach - it pre-INDEXES database sequences

  46. BLAST: WORDS and INDEXING 1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA 2 TAAGCAAATTTAATTTTGTTTACATTTTC 3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA Database of sequences Numbered list of all possible ‘words’ AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 : ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 : GACAAATC 33568 GACAAATG 33569 : TCCAAACC 64321 TCCAAACC 64322 : Build a position index of all words in the database sequence position word 1 1 33658 1 2 07967 1 3 16210 : 3 15 33568 3 16 07967 :

  47. Analyse the Query Sequence >query AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA QUERY SEQUENCE Numbered list of all possible ‘words’ Analyse QUERY SEQUENCE AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 : ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 : GACAAATC 33568 GACAAATG 33569 : TCCAAACC 64321 TCCAAACC 64322 : position word 1 14236 2 33658 3 07967 : Index of database sequence position word 1 1 33658 1 2 07967 1 3 16210 : 3 15 33568 3 16 07967 :

  48. Expand from Word Based Matches We ‘instantly’ know which sequences in the database have at least a word length match with our query sequence, and at what relative position. Next, the potential alignments are expanded, adding up a score for (total matches – mismatches – gap penalties), to make the best possible alignment. But this is usually for a tiny proportion of the sequences in the database – so overall it is much quicker. The highest scoring alignments are reported. But we can potentially miss alignments with no word-size bits in common, consider BLASTn with a default word-size of 11: TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC Care is sometimes needed…

  49. BLAST –Typical Output INPUT: >partial cDNA sequence, Xenopus tropicalis CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA OUTPUT: Query= (311 letters) Database: NCBI Protein Reference Sequences 954,378 sequences; 347,895,532 total letters >gi|41055060|ref|NP_957420.1|   similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio] Length=691 Score = 133 bits (335) Expect = 6e-31 Identities = 76/98 (77%) Positives = 82/98 (83%) Gaps = 4/98 (4%) Frame = +2 Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59 Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

  50. When is a match significant? Here is a ‘typical’ weak alignment from BLASTp: RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS In fact the sequences were randomly generated, so there is no biologically significant alignment… RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

More Related