1 / 59

Li-San Wang Penn Center for Bioinformatics University of Pennsylvania lswang@maild.upenn

Introduction to NCBI (and Other Online Bioinformatics Resources) Society for Developmental Biology 2008. Li-San Wang Penn Center for Bioinformatics University of Pennsylvania lswang@mail.med.upenn.edu http://people.pcbi.upenn.edu/~lswang/. Outline.

aadi
Download Presentation

Li-San Wang Penn Center for Bioinformatics University of Pennsylvania lswang@maild.upenn

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to NCBI (and Other Online Bioinformatics Resources)Society for Developmental Biology 2008 Li-San Wang Penn Center for Bioinformatics University of Pennsylvania lswang@mail.med.upenn.edu http://people.pcbi.upenn.edu/~lswang/

  2. Outline • Introduction of the NCBI databases and web services • Introduction to some concepts in bioinformatics • Hands-on experience • Other online resources: • UCSC Genome Browser and NIAID DAVID

  3. http://www.ncbi.nlm.nih.gov/

  4. http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html (Nov 2004)

  5. Entities Genome, Chromosome Gene, Exon, Intron Protein, Domain, SNP … Relations Homology Taxonomy Ontology OMIM etc. Annotations Phenotype Publication Gene Expression

  6. Some Common Tasks • Find information about a gene/genome, etc. • Find homologs • Find genes related to a phenotype • Find similar sequences to an input sequence (BLAST)

  7. NCBI Entrez (Google “Entrez”) http://www.ncbi.nlm.nih.gov/sites/gquery

  8. Accession Numbers • Example: TP53 NM_000546.4 → NP_000537.3 tumor protein p53 isoform a NM_000546.4 gi: 187830767 NP_000537.3 gi:120407068 http://www.ncbi.nlm.nih.gov/Sequin/acc.html

  9. File Format Fasta GFF XML

  10. NCBI Entrez Gene (previously LocusLink) http://www.ncbi.nlm.nih.gov/sites/entrez

  11. Add Limits in Your Query

  12. Exercise (NCBI Minicourse) • Retrieve human entries related to "prion protein" in Entrez Gene. • Name the map location of this gene on the human genome. • What is the function of this protein? • What are the alternate gene symbols? • Name the phenotypes associated with the mutations in this gene. • How many alternatively spliced products have been annotated for the gene?

  13. Entrez Gene and dbSNP • Retrieve human prion protein by Entrez Gene (PRNP) • Identify the variations annotated on this gene by clicking on the SNP:geneView. • How many of them are nonsynonymous changes? • Are there known SNPs in the coding region of a gene associated with any phenotype?

  14. NCBI Map Viewer

  15. Exercise • Find human GDNF on Map Viewer • Download the gene sequence and 5kb upstream by using the "dl" link. • Add the Component and Contig maps for this region. Name the contig and GenBank accession numbers for the sequence covering this region. Are the sequences finished? • Add the Ab initio (model) and Transcript (RNA) maps. How many alternatively spliced transcripts have been annotated for the gene • Display the current data as "Data As Table View". • Add the phenotype map. Name the disease with which the GDNF gene is associated. Obtain more information about the disease by linking to the corresponding OMIM record.

  16. NCBI Genome and Genome Project http://www.ncbi.nlm.nih.gov/Genomes/

  17. Relations Between Sequence Data • Gene • Unigene • Homologene • Taxonomy

  18. UniGene http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=2723799&TAXID=9606&SEARCH=tp53%20AND%20human%20[organism]

  19. HomoloGene

  20. Taxonomy

  21. Exercise • Locate chimpanzee using TaxBrowser. What is its lineage? How many sub-species are there? • How many genome projects are under Mammalia class? • Find the common tree of the following species: • Human/Chimp/Dog/Horse/Mouse/Rat/Chicken/Zebrafish • Which of mouse or dog is closer to human? • Which species diverged earliest from the human lineage?

  22. CDD

  23. Example Query • Gene: Prion Protein (PRNP) (or your preferred gene) • How many proteins does the gene encode? • What proteins in other organisms are homologous to this protein? • What are the domains in the protein? Find a sequence alignment to its homologs • View the conserved regions on the 3D structure (download NCBI CN3D)

  24. GEO http://www.ncbi.nlm.nih.gov/geo/

  25. OMIM (Online Mendelian Inheritance in Man)

  26. Examples • What human genes are related to hypertension? Which of those genes are on chromosome 17? • List the OMIM entries that describe genes on chromosome 10. • List the OMIM entries that contain information about allelic variants. • Retrieve the OMIM record for the cystic fibrosis transmembrane conductance regulator (CFTR), and link to related protein sequence records via Entrez. • Find the OMIM record for the p53 tumor protein, and link out to related information in Entrez Gene and the p53 Mutation Database. http://www.ncbi.nlm.nih.gov/Omim/omimhelp.html#SampleQuestions

  27. Complex Queries cancer[titl] AND 11[chrom] AND autosomal dominant [clin] • The Boolean operators, AND, OR, NOT, should be written in upper case. Use parentheses for precedence. • Search field tags are enclosed in square brackets

  28. Save your search history

  29. Quick Review • Genbank • Entrez Gene, HomoloGene, Unigene • Protein structures and CCD • Taxonomy • GEO • OMIM • Complex queries

  30. PubMed

  31. MeSH

  32. Example (NCBI PubMed tutorial exercise 4) • Use the MeSH Database to build a strategy that will find citations to articles about schizophrenia resulting from prenatal exposure to influenza. Schizophrenia and influenza should be the major topics of the articles.

  33. Basic Local Alignment Search Tool (BLAST) • Usage: Find sequences in a database that are similar to the input sequence • Applications: • Infer the function of newly sequenced genes • Predict new members of gene families • Explorer evolutionary relationships • Predict the location and function of protein-coding and transcription-regulation regions in genomic DNA

  34. How BLAST works • Sequence databases are preprocessed for faster access by BLAST • Given an input sequence S: • List all k-mers (e.g. k=11 for DNA) of S • Find sequences in DB having similar k-mers • Extend the matched words to form High-Scoring Pairs (HSPs) • Evaluate the significance of HSP

  35. http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHomehttp://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome

  36. >gi|187960039|ref|NM_001127233.1| Mus musculus transformation related protein 53 (Trp53), transcript variant 2, mRNA TTTCCCCTCCCACGTGCTCACCCTGGCTAAAGTTCTGTAGCTTCAGTTCATTGGGACCATCCTGGCTGTAGGTAGCGACTACAGTTAGGGGGCACCTAGCATTCAGGCCCTCATCCTCCTCCTTCCCAGCAGGGTGTCACGCTTCTCCGAAGACTGGATGACTGCCATGGAGGAGTCACAGTCGGATATCAGCCTCGAGCTCCCTCTGAGCCAGGAGACATTTTCAGGCTTATGGAAACTACTTCCTCCAGAAGATATCCTGCCATCACCTCACTGCATGGACGATCTGTTGCTGCCCCAGGATGTTGAGGAGTTTTTTGAAGGCCCAAGTGAAGCCCTCCGAGTGTCAGGAGCTCCTGCAGCACAGGACCCTGTCACCGAGACCCCTGGGCCAGTGGCCCCTGCCCCAGCCACTCCATGGCCCCTGTCATCTTTTGTCCCTTCTCAAAAAACTTACCAGGGCAACTATGGCTTCCACCTGGGCTTCCTGCAGTCTGGGACAGCCAAGTCTGTTATGTGCACGTACTCTCCTCCCCTCAATAAGCTATTCTGCCAGCTGGCGAAGACGTGCCCTGTGCAGTTGTGGGTCAGCGCCACACCTCCAGCTGGGAGCCGTGTCCGCGCCATGGCCATCTACAAGAAGTCACAGCACATGACGGAGGTCGTGAGACGCTGCCCCCACCATGAGCGCTGCTCCGATGGTGATGGCCTGGCTCCTCCCCAGCATCTTATCCGGGTGGAAGGAAATTTGTATCCCGAGTATCTGGAAGACAGGCAGACTTTTCGCCACAGCGTGGTGGTACCTTATGAGCCACCCGAGGCCGGCTCTGAGTATACCACCATCCACTACAAGTACATGTGTAATAGCTCCTGCATGGGGGGCATGAACCGCCGACCTATCCTTACCATCATCACACTGGAAGACTCCAGTGGGAACCTTCTGGGACGGGACAGCTTTGAGGTTCGTGTTTGTGCCTGCCCTGGGAGAGACCGCCGTACAGAAGAAGAAAATTTCCGCAAAAAGGAAGTCCTTTGCCCTGAACTGCCCCCAGGGAGCGCAAAGAGAGCGCTGCCCACCTGCACAAGCGCCTCTCCCCCGCAAAAGAAAAAACCACTTGATGGAGAGTATTTCACCCTCAAGATCCGCGGGCGTAAACGCTTCGAGATGTTCCGGGAGCTGAATGAGGCCTTAGAGTTAAAGGATGCCCATGCTACAGAGGAGTCTGGAGACAGCAGGGCTCACTCCAGCCTCCAGCCTAGAGCCTTCCAAGCCTTGATCAAGGAGGAAAGCCCAAACTGCTAGCTCCCATCACTTCATCCCTCCCCTTTTCTGTCTTCCTATAGCTACCTGAAGACCAAGAAGGGCCAGTCTACTTCCCGCCATAAAAAAACAATGGTCAAGAAAGTGGGGCCTGACTCAGACTGACTGCCTCTGCATCCCGTCCCCATCACCAGCCTCCCCCTCTCCTTGCTGTCTTATGACTTCAGGGCTGAGACACAATCCTCCCGGTCCCTTCTGCTGCCTTTTTTACCTTGTAGCTAGGGCTCAGCCCCCTCTCTGAGTAGTGGTTCCTGGCCCAAGTTGGGGAATAGGTTGATAGTTGTCAGGTCTCTGCTGGCCCAGCGAAATTCTATCCAGCCAGTTGTTGGACCCTGGCACCTACAATGAAATCTCACCCTACCCCACACCCTGTAAGATTCTATCTTGGGCCCTCATAGGGTCCATATCCTCCAGGGCCTACTTTCCTTCCATTCTGCAAAGCCTGTCTGCATTTATCCACCCCCCACCCTGTCTCCCTCTTTTTTTTTTTTTTACCCCTTTTTATATATCAATTTCCTATTTTACAATAAAATTTTGTTATCACTTAAAAAAAAAA

  37. Blast Types

  38. Databases • Protein • nr / refseq / swissprot / pat / pdb / month / env_nr • Nucleotide • nr / refseq_rna / refseq_genomic / est / est_human / est_others / gss / htgs / pat / pdb / month / dbsts / chromosome / wgs / env_nt http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases

  39. Other Options • Number of hits to display • Weights for matching • Nucleotide: matching • Protein: scoring matrix • Weights for gap • (open) + (k-1) * (extend) • Organism • Mask low-complexity regions

  40. Blast two sequences (bl2seq)

  41. gi: 4759254 >gi|4759254|ref|NP_004611.1| TNF receptor-associated factor 6 [Homo sapiens] MSLLNCENSCGSSQSESDCCVAMASSCSAVTKDDSVGGTASTGNLSSSFMEEIQGYDVEFDPPLESKYECPICLMALREAVQTPCGHRFCKACIIKSIRDAGHKCPVDNEILLENQLFPDNFAKREILSLMVKCPNEGCLHKMELRHLEDHQAHCEFALMDCPQCQRPFQKFHINIHILKDCPRRQVSCDNCAASMAFEDKEIHDQNCPLANVICEYCNTILIREQMPNHYDLDCPTAPIPCTFSTFGCHEKMQRNHLARHLQENTQSHMRMLAQAVHSLSVIPDSGYISEVRNFQETIHQLEGRLVRQDHQIRELTAKMETQSMYVSELKRTIRTLEDKVAEIEAQQCNGIYIWKIGNFGMHLKCQEEEKPVVIHSPGFYTGKPGYKLCMRLHLQLPTAQRCANYISLFVHTMQGEYDSHLPWPFQGTIRLTILDQSEAPVRQNHEEIMDAKPELLAFQRPTIPRNPKGFGYVTFMHLEALRQRTFIKDDTLLVRCEVSTRFDMGSLRREGFQPRSTDAGV gi:22027612 >gi|22027612|ref|NP_066961.2| TNF receptor-associated factor 2 [Homo sapiens]MAAASVTPPGSLELLQPGFSKTLLGTKLEAKYLCSACRNVLRRPFQAQCGHRYCSFCLASILSSGPQNCAACVHEGIYEEGISILESSSAFPDNAARREVESLPAVCPSDGCTWKGTLKEYESCHEGRCPLMLTECPACKGLVRLGEKERHLEHECPERSLSCRHCRAPCCGADVKAHHEVCPKFPLTCDGCGKKKIPREKFQDHVKTCGKCRVPCRFHAIGCLETVEGEKQQEHEVQWLREHLAMLLSSVLEAKPLLGDQSHAGSELLQRCESLEKKTATFENIVCVLNREVERVAMTAEACSRQHRLDQDKIEALSSKVQQLERSIGLKDLAMADLEQKVLEMEASTYDGVFIWKISDFARKRQEAVAGRIPAIFSPAFYTSRYGYKMCLRIYLNGDGTGRGTHLSLFFVVMKGPNDALLRWPFNQKVTLMLLDQNNREHVIDAFRPDVTSSSFQRPVNDMNIASGCPLFCPVSKMEAKNSYVRDDAIFIKAIVDLTGL http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

  42. Example (NCBI Minicourse #6) • Problem: A laboratory has generated an EST library from a hemochromatosis patient and wants to identify the gene(s) causing the phenotype. We will follow these steps to solve the problem: • Compare ESTs from a hemochromatosis patient to the human genome (using BLAST). • Identify the gene(s) aligning the ESTs and download their sequences (using Map Viewer). • Identify whether the ESTs contain any known nucleotide variations (single nucleotide polymorphisms) (using dbSNP). • Determine whether a mutant form of the gene is known to cause a phenotype (using OMIM).

  43. Sequences • TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCCTGGATCAGCCCCTCATTGTGATCTGGG http://people.pcbi.upenn.edu/~lswang/seq1.txt http://www.ncbi.nlm.nih.gov/Class/minicourses/diseasegene.html

  44. UCSC Genome Browserhttp://genome.ucsc.edu/ • Google “Genome Browser” http://genome.ucsc.edu/

  45. Example • Locate MLL (myeloid/lymphoid or mixed-lineage leukemia) on the human genome • Find relevant information • Conservation across the gene • Retrieve the sequences of human MLL and divide into exon/intron regions • Retrieve the 5’ and 3’ flanking region sequences

  46. BLAT • Blast-like alignment tool • Quickly finds genomic regions highly similar to the input query sequence

  47. Example >hg18_knownGene_uc002gil.1_1 range=chr17:7531420-7531642 5'pad=0 3'pad=0 strand=- repeatMasking=none ACTTGTCATGGCGACTGTCCAGCTTTGTGCCAGGAGCCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGG http://genome.ucsc.edu/cgi-bin/hgBlat?command=start&org=Human&db=hg18&hgsid=110368820

  48. Other Tasks for Genome Browser • Download the database • Retrieve genomic sequences/annotations • Upload your own annotation (customized track) using .bed format and visualize on the browser • Many tasks are easier using the Galaxy web service from Penn State U (Google “Galaxy Trac” or go to http://galaxy.psu.edu/)

  49. DAVID (NIAID) • Google “NIH DAVID”

More Related