ncbi molecular biology resources n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
NCBI Molecular Biology Resources PowerPoint Presentation
Download Presentation
NCBI Molecular Biology Resources

Loading in 2 Seconds...

play fullscreen
1 / 57

NCBI Molecular Biology Resources - PowerPoint PPT Presentation


  • 265 Views
  • Uploaded on

NCBI Molecular Biology Resources. A Field Guide part 1. September 29, 2004 ICGEB. NCBI Resources. About NCBI The NCBI Entrez System NCBI Sequence Databases NCBI Genomic Resources ** Intermission **

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'NCBI Molecular Biology Resources' - lynde


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ncbi molecular biology resources

NCBI Molecular Biology Resources

A Field Guide

part 1

September 29, 2004 ICGEB

ncbi resources
NCBI Resources
  • About NCBI
  • The NCBI Entrez System
  • NCBI Sequence Databases
  • NCBI Genomic Resources

** Intermission **

  • NCBI Precomputed Resources
    • Behind the scenes
the national center for biotechnology information
The National Center for Biotechnology Information
  • Created as a part of NLM in 1988
    • Establish public databases
    • Perform research in computational biology
    • Develop software tools for sequence analysis
    • Disseminate biomedical information
number of users and hits per day

Christmas &

New Year’sDays

Number of Users and Hits Per Day

1997 1998 1999 2000 2001 2002 2003

Currently averaging

10,000,000 to 35,000,000

hits per day!

slide13

Part 2. Data Flow and Processing

Part 1. The Databases

Part 3. Querying and Linking the Data

Part 4. User Support

A part of the NCBI Bookshelf

slide17

OMIM - A catalogue of genes involved with human disease processes

- Detailed clinical and reference information

- Curated and maintained by Johns Hopkins

- Links to PubMed and sequence databases

the entrez system
The Entrez System

Gene

UniGene

CancerChromosomes

UniSTS

Homologene

SNP

PopSet

Genome

Nucleotide

GEO

Books

Entrez

Taxonomy

PubMed

GEO Datasets

MeSH

OMIM

Protein

PMC

Journals

Domains

3D Domains

Structure

slide21

types of databases
Types of Databases
  • Primary Databases
    • Original submissions by experimentalists
    • Database staff review and may organize the data, but we don’t add/modify additional information
    • Records are “owned” and updated by their authors
      • Examples: GenBank, SNP, GEO
  • Derivative Databases
    • Human-curated (compilation and correction of data)
      • Examples: Gene(LocusLink), Structure & Literature databases
    • Computationally-Derived
      • Example: UniGene
    • Combination
      • Examples: RefSeq, Genome Assembly, Domain databases
primary vs derivative sequence databases

ACGTGC

C

C

GA

GA

ATT

GA

GA

C

ATT

TATAGCCG

AGCTCCGATA

CCGATGACAA

RefSeq

C

TATAGCCG

ACGTGC

Curators

CGTGA

ATTGACTA

TTGACA

Genome

Assembly

TTGACA

TTGACA

ACGTGC

ACGTGC

TATAGCCG

CGTGA

CGTGA

TATAGCCG

ATTGACTA

TATAGCCG

ATTGACTA

ATTGACTA

CGTGA

ATTGACTA

ATTGACTA

ATT

TATAGCCG

TATAGCCG

TATAGCCG

TATAGCCG

TATAGCCG

TTGACA

C

GenBank

UniGene

GA

AT

C

C

C

C

ATT

GA

GA

GA

GA

ATT

ATT

ATT

Algorithms

GA

GA

GA

GA

C

C

ATT

ATT

C

C

Primary vs. DerivativeSequence Databases

Labs

Sequencing

Centers

Updated

continually

by NCBI

Updated ONLY

by submitters

how to query a particular database

Examples of

tag delimiters

How to Query a Particular Database

term1 term2

(term1[tag delimiter]op term2[tag delimiter]op …)

op = AND, OR, NOT

  • Boolean operators MUST be in ALL CAPS!

tag delimiter= Entrez indexing field

Organism

Journal

User compounds

Author

sample query
Sample Query

Brauninger a c-src kinase

Organism

Journal

User compounds

Author

using fields to find records
Using Fields to Find Records

Accession

All Fields

Author

EC/RN Number

Feature Key

Filter

Gene Name

Issue

Journal

Keyword

Modification Date

Organism

Page Number

Primary Accession

Properties

Protein Name

Publication Date

SeqID String

Sequence Length

Substance Name

Text Word

Title

Volume

  • Most useful search field [Organism]:
    • human[orgn] …or… bacteria[orgn]
  • Useful search terms in [Properties] field:
    • srcdb: “source database” ( srcdb genbank[prop] )
    • gbdiv: “genbank division” ( gbdiv est[prop] )
    • biomol: “biomolecular type” ( biomol mrna[prop] )
using field limits
Using Field Limits

#1: thyroid peroxidase 335

#2: thyroid peroxidase AND human[orgn] 291

#3: thyroid peroxidase[title]AND human[orgn] 166

#4: #3 AND srcdb refseq[prop] 5

#5: #3 AND srcdb ddbj/embl/genbank[prop] 161

#6: #5 AND gbdiv est[prop] 20

#7: #5 AND gbdiv pri[prop] 141

#8: #7 AND biomol genomic[prop] 25

#9: #7 AND biomol mrna[prop] 116

complex searches you can do with preview index
Complex searches you can do with Preview/Index

Terms used (and indexed) in Entrez fields

can be searched to gain useful information!

How many rat Unigene clusters contain at least one mRNA?

  • Select the UniGene database.
  • Find all the rat records.
  • Find those that have ≥ 1 mRNAs. (“not 0”)

NOT

rat [organism]

1 sequence database

1º Sequence Database

GenBank

  • Nucleotide only sequence database
  • Archival in nature
  • Submission of GenBank Data to NCBI
    • Direct submissions of individual records via Web (BankIt, Sequin)
    • Batch submissions of bulk sequences via Email

(EST, GSS, STS)

    • FTP accounts for Sequencing Centers
genbank

Sequence records

  • Total base pairs

35

40

35

30

30

25

25

20

20

15

15

10

10

5

5

0

0

GenBank

Release 143: 37.3 million records

41.8 billion nucleotides

Average doubling time ≈ 14 months

Sequence Records

(millions)

Total Base Pairs

(billions)

’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04

genbank1

Release 143 August 2004

37,343,937 Records

41,808,045,653 Nucleotides

>170,000 Species

160 Gigabytes 657 files

GenBank
  • full release every two months
  • incremental and cumulative updates daily
  • available only through internet

ftp://ftp.ncbi.nih.gov/genbank/

the international sequence database collaboration
The International Sequence Database Collaboration

NIH

Entrez

Sequin

BankIt

ftp

NCBI

GenBank

  • Submissions
  • Updates
  • Submissions
  • Updates

EMBL

DDBJ

EBI

CIB

NIG

  • Submissions
  • Updates

SRS

EMBL

getentry

organization of genbank genbank divisions gbdiv
Organization of GenBank:GenBank Divisions (gbdiv)

Records are divided into 17 Divisions.

  • 1 Patent (11 files)
  • 5 High Throughput
  • 11 Traditional

EST (335) Expressed Sequence Tag

GSS (116) Genome Survey Sequence

HTG (61) High Throughput Genomic

STS (5) Sequence Tagged Site

HTC (6) High Throughput cDNA

PRI (28) Primate

PLN (12) Plant and Fungal

BCT (10) Bacterial and Archeal

INV (6) Invertebrate

ROD (13) Rodent

VRL (3) Viral

VRT (7) Other Vertebrate

MAM (1) Mammalian (ex. ROD and PRI)

PHG (1) Phage

SYN (1) Synthetic (cloning vectors)

UNA (1) Unannotated

  • Traditional Divisions:
  • Direct Submissions
  • (Sequin and BankIt)
  • Accurate
  • Well characterized
  • BULK Divisions:
  • Batch Submission
  • (Email and FTP)
  • Inaccurate
  • Poorly characterized
file formats of the sequence databases
File Formats of theSequence Databases

Each sequence is represented by

a text record called a flat file.

  • GenBank/GenPept (useful for scientists)
  • FASTA (the simplest format)
  • ASN.1 & XML (useful for programmers)
a traditional genbank record

Accession Number

ACCESSION AF062069

VERSION AF062069.2 GI:7144484

Length

mRNA = cDNA

DNA = genomic

Date of

most recent

modification

Division

ORGANISM Limulus polyphemus

Eukaryota;Metazoa;Arthropoda;Chelicerata;Merostomata;

Xiphosura;Limulidae;Limulus.

Accession.Version

GI Number

LOCUS AF0620069 3808 bp mRNA INV 02-MAR-2000

DEFINITION Limulus polyphemus myosin III mRNA, complete cds.

A Traditional “GenBank” Record

LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000

DEFINITION Limulus polyphemus myosin III mRNA, complete cds.

ACCESSION AF062069

VERSION AF062069.2 GI:7144484

KEYWORDS .

SOURCE Atlantic horseshoe crab.

ORGANISM Limulus polyphemus

Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;

Xiphosura; Limulidae; Limulus.

REFERENCE 1 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein

JOURNAL J. Neurosci. (1998) In press

REFERENCE 2 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REFERENCE 3 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REMARK Sequence update by submitter

COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.

Definition =Title

References

NCBI’s Taxonomy

lower down in the genbank record

/protein_id="AAC16332.2"

/db_xref="GI:7144485"

Lower down in the GenBank Record

FEATURES Location/Qualifiers

source 1..3808

/organism="Limulus polyphemus"

/db_xref="taxon:6850"

/tissue_type="lateral eye"

CDS 258..3302

/note="N-terminal protein kinase domain; C-terminal myosin

heavy chain head; substrate for PKA"

/codon_start=1

/product="myosin III"

/protein_id="AAC16332.2"

/db_xref="GI:7144485"

/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA

NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI

EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF

SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG

ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR

PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ

BASE COUNT 201 a 689 c 782 g 1136 t

ORIGIN

1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt

3781 aagatacagt aactagggaa aaaaaaaa

//

Feature Table

GenPept Protein ID

fasta format
FASTA format

>gi|4680720|gb|M17755.2|HUMTPOC Homo sapiens thyroid peroxidase (TPO) mRNA, complete cds

GAGGCAATTGAGGCGCCCATTTCAGAAGAGTTACAGCCGTGAAAATTACTCAGCAGTGCAGTTGGCTGAG

AAGAGGAAAAAAGAATGAGAGCGCTGGCTGTGCTGTCTGTCACGCTGGTTATGGCCTGCACAGAAGCCTT

CTTCCCCTTCATCTCGAGAGGGAAAGAACTCCTTTGGGGAAAGCCTGAGGAGTCTCGTGTCTCTAGCGTC

TTGGAGGAAAGCAAGCGCCTGGTGGACACCGCCATGTACGCCACGATGCAGAGAAACCTCAAGAAAAGAG

GAATCCTTTCTGGAGCTCAGCTTCTGTCTTTTTCCAAACTTCCTGAGCCAACAAGCGGAGTGATTGCCCG

AGCAGCAGAGATAATGGAAACATCAATACAAGCGATGAAAAGAAAAGTCAACCTGAAAACTCAACAATCA

CAGCATCCAACGGATGCTTTATCAGAAGATCTGCTGAGCATCATTGCAAACATGTCTGGATGTCTCCCTT

ACATGCTGCCCCCAAAATGCCCAAACACTTGCCTGGCGAACAAATACAGGCCCATCACAGGAGCTTGCAA

CAACAGAGACCACCCCAGATGGGGCGCCTCCAACACGGCCCTGGCACGATGGCTCCCTCCAGTCTATGAG

GACGGCTTCAGTCAGCCCCGAGGCTGGAACCCCGGCTTCTTGTACAACGGGTTCCCACTGCCCCCGGTCC

GGGAGGTGACAAGACATGTCATTCAAGTTTCAAATGAGGTTGTCACAGATGATGACCGCTATTCTGACCT

CCTGATGGCATGGGGACAATACATCGACCACGACATCGCGTTCACACCACAGAGCACCAGCAAAGCTGCC

...

>gi|4680721|gb|AAA61217.2| thyroid peroxidase [Homo sapiens]

MRALAVLSVTLVMACTEAFFPFISRGKELLWGKPEESRVSSVLEESKRLVDTAMYATMQRNLKKRGILSG

AQLLSFSKLPEPTSGVIARAAEIMETSIQAMKRKVNLKTQQSQHPTDALSEDLLSIIANMSGCLPYMLPP

KCPNTCLANKYRPITGACNNRDHPRWGASNTALARWLPPVYEDGFSQPRGWNPGFLYNGFPLPPVREVTR

HVIQVSNEVVTDDDRYSDLLMAWGQYIDHDIAFTPQSTSKAAFGGGSDCQMTCENQNPCFPIQLPEEARP

AAGTACLPFYRSSAACGTGDQGALFGNLSTANPRQQMNGLTSFLDASTVYGSSPALERQLRNWTSAEGLL

RVHGRLRDSGRAYLPFVPPRAPAACAPEPGNPGETRGPCFLAGDGRASEVPSLTALHTLWLREHNRLAAA

LKALNAHWSADAVYQEARKVVGALHQIITLRDYIPRILGPEAFQQYVGPYEGYDSTANPTVSNVFSTAAF

RFGHATIHPLVRRLDASFQEHPDLPGLWLHQAFFSPWTLLRGGGLDPLIRGLLARPAKLQVQDQLMNEEL

TERLFVLSNSSTLDLASINLQRGRDHGLPGYNEWREFCGLPRLETPADLSTAIASRSVADKILDLYKHPD

NIDVWLGGLAENFLPRARTGPLFACLIGKQMKALRDGDWFWWENSHVFTDAQRRELEKHSLSRVICDNTG

LTRVPMDAFQVGKFPEDFESCDSITGMNLEAWRETFPQDDKCGFPESVENGDFVHCEESGRRVLVYSCRH

GYELQGREQLTCTQEGWDFQPPLCKDVNECADGAHPPCHASARCRNTKGGFQCLCADPYELGDDGRTCVD

...

abstract syntax notation asn 1

GenPept

GenBank

ASN.1

FASTA

Protein

FASTA

Nucleotide

Abstract Syntax Notation: ASN.1

Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

title "Human thyroid peroxidase mRNA, partial cds., and translated

products" ,

source {

org {

taxname "Homo sapiens" ,

common "human" ,

db {

{

db "taxon" ,

tag

id 9606 } } ,

orgname {

name

binomial {

genus "Homo" ,

species "sapiens" } ,

lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo" ,

bulk divisions
Bulk Divisions
  • Batch Submission and htg (email and ftp)
  • Inaccurate
  • Poorly Characterized
  • Expressed Sequence Tag
    • 1st pass single read cDNA
  • Genome Survey Sequence
    • 1st pass single read gDNA
  • High Throughput Genomic
    • incomplete sequences of genomic clones
  • Sequence Tagged Site
    • PCR-based mapping reagents
est division e xpressed s equence t ags

5’

3’

make cDNA

library

80-100,000 unique

cDNA clones in library

EST Division: Expressed Sequence Tags

gbdiv_est[Properties]

nucleus

30,000

genes

gatccantgccatacg

>IMAGE:275615 5' mRNA sequence

GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG

TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA

TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA

GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC

TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC

AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN

TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

ctcgccaattcnntcg

  • - isolate unique clones
  • sequence once
  • from each end

RNA

gene products

>IMAGE:275615 3', mRNA sequence

NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA

TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT

AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT

CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG

GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

genome sequencing htg gss wgs
Genome Sequencing - HTG, GSS,(WGS)

Whole BAC insert (or genome)

shredding

sequencing

cloning isolating

GSS division

or trace archive

whole genome shotgun assemblies

(traditional division)

assembly

Draft Sequence (HTG division)

htg division honeybee draft sequences
HTG Division: Honeybee Draft Sequences
  • Unfinished sequences of BACs
  • Gaps and unordered pieces
  • Finished sequences move to traditional GenBank division
other primary databases
Other Primary Databases
  • GEO (Gene Expression Omnibus)
    • Searchable microarray data repository
  • SNP (Single Nucleotide Polymorphism)
    • Allelic variations (including minisatellites/ simple sequence repeats and insertions/ deletions)
slide48

Redesigned

with

new features

  • Submit and update data
  • Query the database:
    • gene identifiers
    • field information
    • sequence
  • Browse datasets
  • Download data
slide49

Submitted by

Experimentalists

Curated by

NCBI

Submitted by

Manufacturer*

GDS

Grouping of

experiments

GSE

Grouping of

slide/chip data

“a single experiment”

GPL

Platform

descriptions

GSM

Raw/processed

spot intensities

from a single

slide/chip

Entrez

GEO Datasets

Entrez GEO

gds177 cmv infection of hff cells

FHCRC non-commercial human 18K array

Comparison of gene expression profiles of HFF cells infected with CMV strains

GDS177: CMV infection of HFF cells

src1: CMV infected fibroblasts src2: uninfected fibroblasts

GSM827 : FHCMV-T-1GSM825 : FHCMV-T-2GSM828 : FHCMV-T-3

GSM829 : FHCMV-H-1GSM830 : FHCMV-H-2GSM831 : FHCMV-H-3

GSM832 : CMV_AD169-2GSM833 : CMV_AD169-3

Expression