1 / 72

Introduction to Bioinformatics Databases

Introduction to Bioinformatics Databases. Central dogma of molecular biology. DNA. RNA. protein. phenotype. A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems. Page 6. After Pace NR (1997) Science 276:734.

clarke-chan
Download Presentation

Introduction to Bioinformatics Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics Databases

  2. Central dogma of molecular biology DNA RNA protein phenotype A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems.

  3. Page 6 After Pace NR (1997) Science 276:734 With the use of bioinformatics we can learn the variation that occur between species, and we can deduce the evolutionary history of life on Earth.

  4. Growth of GenBank 70 60 50 Base pairs of DNA (billions) 40 Sequences (millions) 30 20 10 0 1985 1990 1995 2000 December 1982 June 2006

  5. Growth of the International Nucleotide Sequence Database Collaboration Base pairs of DNA (billions) Base pairs contributed by GenBank EMBL DDBJ http://www.ncbi.nlm.nih.gov/Genbank/

  6. genome transcriptome proteome Central dogma of bioinformatics and genomics Central dogma of molecular biology DNA RNA protein

  7. DNA RNA protein phenotype protein sequence databases cDNA ESTs UniGene genomic DNA databases Fig. 2.2 Page 20

  8. There are three major public DNA databases GenBank EMBL DDBJ The underlying raw DNA sequences are identical Page 16

  9. There are three major public DNA databases EMBL GenBank DDBJ Housed at EBI European Bioinformatics Institute Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 16

  10. >300,000 species are represented in GenBank Table 2-1

  11. Taxonomy nodes at NCBI 8/06 http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi

  12. The most sequenced organisms in GenBank Homo sapiens 10.7 billion bases Mus musculus6.5b Rattus norvegicus5.6b Danio rerio1.7b Zea mays 1.4b Oryza sativa0.8b Drosophila melanogaster0.7b Gallus gallus 0.5b Arabidopsis thaliana0.5b Table 2-2 Page 18 Updated 8-12-04 GenBank release 142.0

  13. The most sequenced organisms in GenBank Homo sapiens 11.2 billion bases Mus musculus7.5b Rattus norvegicus5.7b Danio rerio2.1b Bos taurus 1.9b Zea mays 1.4b Oryza sativa (japonica)1.2b Xenopus tropicalis 0.9b Canis familiaris 0.8b Drosophila melanogaster0.7b Table 2-2 Page 18 Updated 8-29-05 GenBank release 149.0

  14. The most sequenced organisms in GenBank Homo sapiens 12.3 billion bases Mus musculus8.0b Rattus norvegicus5.7b Bos taurus 3.5b Danio rerio2.5b Zea mays 1.8b Oryza sativa (japonica)1.5b Strongylocentrotus purpurata 1.2b Sus scrofa 1.0b Xenopus tropicalis 1.0b Table 2-2 Page 18 Updated 7-19-06 GenBank release 154.0

  15. National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24

  16. Types of Data in GenBank • DNA level • RNA level (cDNA) • Protein sequences. • …

  17. Fig. 2.5 Page 25 www.ncbi.nlm.nih.gov

  18. Fig. 2.5 Page 25

  19. PubMed is… • National Library of Medicine's search service • 16 million citations in MEDLINE • links to participating online journals • PubMed tutorial (via “Education” on side bar) Page 24

  20. Entrez integrates… • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes Page 24

  21. Entrez is a search and retrieval system that integrates NCBI databases Page 24

  22. BLAST is… • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 100,000 searches per day Page 25

  23. OMIM is… • Online Mendelian Inheritance in Man • catalog of human genes and genetic disorders • edited by Dr. Victor McKusick, others at JHU Page 25

  24. Books is… • searchable resource of on-line books Page 26

  25. TaxBrowser is… • browser for the major divisions of living organisms • (archaea, bacteria, eukaryota, viruses) • taxonomy information such as genetic codes • molecular data on extinct organisms Page 26

  26. Structure site includes… • Molecular Modelling Database (MMDB) • biopolymer structures obtained from • the Protein Data Bank (PDB) • Cn3D (a 3D-structure viewer) • vector alignment search tool (VAST) Page 26

  27. Accessing information on molecular sequences Page 26

  28. Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

  29. What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein Page 27

  30. Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 27

  31. 4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27

  32. From the NCBI home page, type “rbp4” and hit “Go” revised Fig. 2.7 Page 29

  33. revised Fig. 2.7 Page 29

  34. By applying limits, there are now just two entries

  35. Entrez Gene (top of page) Note that links to many other RBP4 database entries are available revised Fig. 2.8 Page 30

  36. Entrez Gene (middle of page)

  37. Entrez Gene (bottom of page)

  38. Fig. 2.9 Page 32

  39. Fig. 2.9 Page 32

  40. Fig. 2.9 Page 32

  41. FASTA format Fig. 2.10 Page 32

  42. FASTA format

  43. What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein Page 27

  44. NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mRNA (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735 Page 29-30

  45. NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences AccessionMoleculeNote AP_123456 Protein Protein products; alternate NC_123456 Genomic Complete genomic molecules NG_123456 Genomic Incomplete genomic regions NM_123456 mRNA Transcript products; mRNA NM_123456789 mRNA Transcript products; 9-digit NP_123456 Protein Protein products; NP_123456789 Protein Protein products; 9-digit NR_123456 RNA Non-coding transcripts NT_123456 Genomic Genomic assemblies NW_123456 Genomic Genomic assemblies NZ_ABCD12345678 Genomic Whole genome shotgun data XM_123456 mRNA Transcript products XP_123456 Protein Protein products XR_123456 RNA Transcript products YP_123456 Protein Protein products ZP_12345678 Protein Protein products

  46. Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, Its aim is to provide a centralised resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates. We will encounter Ensembl as we study the human genome, BLAST, and other topics.

More Related