1 / 28

Introduction to NCBI & Ensembl tools including BLAST and database searching

Introduction to NCBI & Ensembl tools including BLAST and database searching. Incorporating Bioinformatics into the High School Biology Curriculum. Fran Lewitter, Ph.D. Founding Director, Bioinformatics & Research Computing Whitehead Institute for Biomedical Research

delbertm
Download Presentation

Introduction to NCBI & Ensembl tools including BLAST and database searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to NCBI & Ensembltools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter, Ph.D. Founding Director, Bioinformatics & Research Computing Whitehead Institute for Biomedical Research Lewitter (-AT--) iscb.org http://www.iscb.org/iscb-education-home

  2. What I hope you’ll learn • What do we learn from database searching and sequence alignments • What tools are available at NCBI • What tools are available through Ensembl Lewitter-ISCB High School Teachers Workshop – 7/11/14

  3. First some background info It’s all about alignments. Lewitter-ISCB High School Teachers Workshop – 7/11/14

  4. Lewitter-ISCB High School Teachers Workshop – 7/11/14

  5. Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Antoniades HN. Science 221:275-277, 1983. Lewitter-ISCB High School Teachers Workshop – 7/11/14

  6. Why do alignments • Use sequence similarity to infer homology and/or structural similarity between 2 or more genes/proteins • Identify more conserved regions of a protein, potentially identifying regions of most functional importance • Compare and contrast homologs (perhaps into groups) based on shared positions or regions • Infer evolutionary distance from sequence dissimilarity Lewitter-ISCB High School Teachers Workshop – 7/11/14

  7. Evolutionary Basis of Sequence Alignment • Similarity - observable quantity, such as percent identity • Homology - conclusion drawn from data that two genes share a common evolutionary history; no metric is associated with this • Paralog – genes related by duplication • Ortholog – genes related by speciation Lewitter-ISCB High School Teachers Workshop – 7/11/14

  8. Local vs Global Alignment LOCAL GLOBAL From Mount, Bioinformatics, 2004, pg 71 Lewitter-ISCB High School Teachers Workshop – 7/11/14

  9. Nucleotide vs Protein • If comparing protein coding genes, use protein sequences because of less noise • If protein sequences are very similar, it might be more instructive to use DNA sequences Lewitter-ISCB High School Teachers Workshop – 7/11/14

  10. Example of simple scoring system for nucleic acids • Match = +1 (ex. A-A, T-T, C-C, G-G) • Mismatch = -1 (ex. A-T, A-C, etc) • Gap opening = - 5 • Gap extension = -2 TCAG AC G A GT G TCGG A- - G CT G +1 +1 -1 +1 +1 -5 -2 -1 -1 +1 +1 = -4 Lewitter-ISCB High School Teachers Workshop – 7/11/14

  11. Possible Alignments A: T C A G A C G A G T G B: T C G G A G C T G I. TCAG AC G A GT G TCGG A- - G CT G score=-4 score=-5 score=-5 II. TCAG AC G A GT G TCGG A- G C -T G III. TCAG AC G A GT G TCGG A- G - CT G Lewitter-ISCB High School Teachers Workshop – 7/11/14

  12. Scoring for Protein Alignments - Amino Acid Substitution Matrices • PAM - point accepted mutation based on global alignment[evolutionary model] • BLOSUM - block substitutions based on local alignments [similarity among conserved sequences] Lewitter-ISCB High School Teachers Workshop – 7/11/14

  13. AA Scoring Matrices • PAM - point accepted mutation based on global alignment[evolutionary model] • BLOSUM - block substitutions based on local alignments [similarity among conserved sequences] Log-odds= pair in homologous proteins pair in unrelated proteins bychance Log-odds= obs freq of aa substitutions freq expected by chance Lewitter-ISCB High School Teachers Workshop – 7/11/14

  14. BLOSUM 30 BLOSUM 62 BLOSUM 80 % identity PAM 250 (80) PAM 120 (66) PAM 90 (50) % change Substitution Matrices Increasing similarity Lewitter-ISCB High School Teachers Workshop – 7/11/14

  15. Scoring for BLAST Alignments Score = 94.0 bits (230), Expect = 6e-19 Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%) Query: 204 YTGPFCDV----DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ---AR 256 Y+ FC + + CY G G +YRG T SGA C PW S V Q A+ Sbjct: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257 Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297 GLG H +CRNPD D +PWC VL RL+WEYCD+ C T Sbjct: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298 Position 1: Y - Y = 7 Position 2: T - S = 1 Position 3: G - S = 0 Position 4: P - E = -1 . . . Position 9: - - P = -11 Position 10: - - A = -1 . . . Sum 230 Based on BLOSUM62 Lewitter-ISCB High School Teachers Workshop – 7/11/14

  16. Statistical Significance • Raw Scores - score of an alignment equal to the sum of substitution and gap scores. • Bit scores - scaled version of an alignment’s raw score that accounts for the statistical properties of the scoring system used. • E-value- expected number of distinct alignments that would achieve a given score by chance. Lower E-value => more significant. Lewitter-ISCB High School Teachers Workshop – 7/11/14

  17. A formula E = Kmn e-lS This is the Expected number of high-scoring segment pairs (HSPs) with score at least S for sequences of length m and n. This is the E value for the score S. The parameters K and lcan be thought of simply as natural scales for the search space size and the scoring system respectively. Lewitter-ISCB High School Teachers Workshop – 7/11/14

  18. What’s significant? • High confidence - >40% identity for long alignments (Rost, 1999 found that sequence alignments unambiguously distinguish between protein pairs of similar and non-similar structure when the pairwise sequence identity >40%) • “Twilight zone” – blurry - 20-35% identity • “Midnight zone” - <20% identity Lewitter-ISCB High School Teachers Workshop – 7/11/14

  19. Tools of Interest • NCBI (US) - http://www.ncbi.nlm.nih.gov/ • BLAST • Entrez • EBI/EMBL/WTSI (Europe) • Ensembl (http://www.ensembl.org/) Lewitter-ISCB High School Teachers Workshop – 7/11/14

  20. NCBI Home Page http://www.ncbi.nlm.nih.gov Lewitter-ISCB High School Teachers Workshop – 7/11/14

  21. Ensembl Home Page http://ensembl.org Lewitter-ISCB High School Teachers Workshop – 7/11/14

  22. Hands-On • Entrez – finding sequences • BLAST – look for similar sequences • Ensembl – finding sequences Lewitter-ISCB High School Teachers Workshop – 7/11/14

  23. How is BLAST used? • Looking for homologs • Identifying domains in proteins • Peptide identification • Genome annotation Lewitter-ISCB High School Teachers Workshop – 7/11/14

  24. Remember • Don’t be afraid to look at help pages for each website • Click, click, click • Any analysis tool will give you results • Interpret results you find Lewitter-ISCB High School Teachers Workshop – 7/11/14

  25. Sequence of Note Biotechniques, 1992 1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC 31CCCCCTGACG AGCATCACAA AAATCGACGC 61 GGTGGCGAAA CCCGACAGGA CTATAAAGAT ………….. 1371 GTAAAGTCTG GAAACGCGGA AGTCAGCGCC “Here you see the actual structure of a small fragment of dinosaur DNA,” Wu said. “Notice the sequence is made up of four compounds - adenine, guanine, thymine and cytosine. This amount of DNA probably contains instructions to make a single protein - say, a hormone or an enzyme. The full DNA molecule contains three billion of these bases. If we looked at a screen like this once a second, for eight hours a day, it’d still take more than two years to look at the entire DNA strand. It’s that big.” (page 103) (Published in 1990) Lewitter-ISCB High School Teachers Workshop – 7/11/14

  26. DinoDNA "Dinosaur DNA" from Crichton's THE LOST WORLD p. 135 GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCCTCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGCAGACACGGGTACTTTGGGGACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTGCAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGCGCCTGCGGCCTCTACTACAAAC…… Published in 1995 Lewitter-ISCB High School Teachers Workshop – 7/11/14

  27. >Erythroid transcription factor (NF-E1 DNA-binding protein) Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG Sbjct: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60 Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT Sbjct: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116 Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT Sbjct: 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169 Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG Sbjct: 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226 Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF Sbjct: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286 Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQI Sbjct: 287 FGGGAGGYTAPPGLSPQI 304 Lewitter-ISCB High School Teachers Workshop – 7/11/14

  28. >Erythroid transcription factor (NF-E1 DNA-binding protein) Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG Sbjct: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60 Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT Sbjct: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116 Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT Sbjct: 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169 Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG Sbjct: 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226 Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF Sbjct: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286 Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQI Sbjct: 287 FGGGAGGYTAPPGLSPQI 304 Lewitter-ISCB High School Teachers Workshop – 7/11/14

More Related