National center for biotechnology information
Download
1 / 127

National Center for Biotechnology Information - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

National Center for Biotechnology Information. A Field Guide to GenBank and NCBI’s Molecular Biology Resources. University of Colorado Health Sciences Center. August 30, 2005. Topics. About NCBI GenBank overview Primary vs derivative databases The Reference Sequence (RefSeq) project

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' National Center for Biotechnology Information' - makoto


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
National center for biotechnology information

National Center for Biotechnology Information

A Field Guide to GenBank

and NCBI’s Molecular Biology Resources

University of Colorado Health Sciences Center

August 30, 2005


Topics
Topics

  • About NCBI

  • GenBank overview

  • Primary vs derivative databases

    • The Reference Sequence (RefSeq) project

  • Entrez databases

  • Genome resources

  • Bookshelf

    -break-

  • Entrez text searching

  • BLAST sequence searching

  • VAST structure searching

  • An integrated example


The national institutes of health

Bethesda, MD

The National Institutes of Health


The national center for biotechnology information
The National Center for Biotechnology Information

  • Accepts submissions of primary data

  • Develops tools to analyze these data

  • Creates derivative databases based on the primary data

  • Provides free search, link, and retrieval of these data, primarily through the Entrez system



Number of users per day

Christmas & New Year

Number of Users Per Day

1997 1998 1999 2000 2001 2002 2003



all[filter]

1/11/2005

3/15/2005

8/15/2005


Entrez nucleotide
Entrez Nucleotide

# records

Primary Data

  • GenBank / DDBJ / EMBL 57.3 million (97.4 %)

    Derivative Data

  • RefSeq 1.47 million (2.5 %)

    • RefSeq reviewed 60,000

  • PDB (structures) 5,973

    “Total” 59 million

GenBank


Genbank ncbi s primary sequence database

Release 149 August 2005

47 x 106 Records

52 x 109 Nucleotides

195 Gigabytes 816 files

GenBank: NCBI’s Primary Sequence Database

Over 100 billion

bases!

  • full release every two months

  • incremental and cumulative updates daily

  • available only through internet

  • release notes: gbrel.txt

ftp://ftp.ncbi.nih.gov/genbank/

ftp://genbank.sdsc.edu/pub

ftp://bio-mirror.net/biomirror/genbank


What is genbank
What is GenBank?

  • Nucleotide only sequence database

  • Archival in nature

  • GenBank Data

    • Direct submissions (traditional records)

    • Batch submissions (EST, GSS, STS)

    • ftp accounts (genome data)

  • Three collaborating databases

    • GenBank

    • DNA Database of Japan (DDBJ)

    • European Molecular Biology Laboratory (EMBL) Database


Genbank divisions
GenBank Divisions

“Organismal”

PRI (28) Primate

ROD (15) Rodent

PLN (13) Plant and Fungal

BCT (11)Bacterial/Archeal

INV (7) Invertebrate

VRT (7)Other Vertebrate

VRL (4)Viral

MAM (2) Mammalian

PHG (1) Phage

SYN (1) Synthetic

UNA (1)Unannotated

  • Organized by taxonomy (sort of)

  • Direct submissions (Sequin/Bankit)

  • Accurate (~1 error per 10,000 bp)

  • Well characterized

“Functional”

EST (377)Expressed Sequence Tag

GSS (138) Genome Survey Sequence

HTG (63) High Throughput Genomic

PAT (17) Patent

STS (9) Sequence Tagged Site

CON (1) Contigs, virtual

  • Organized by sequence type

  • Batch submissions (ftp/email)

  • Inaccurate

  • Poorly characterized


Genbank functional bulk divisions

EST

GenBank

GSS

HTG

STS

GenBank Functional (Bulk) Divisions

  • Expressed Sequence Tag

    • 1st pass single read cDNA

  • Genome Survey Sequence

    • 1st pass single read gDNA

  • High Throughput Genomic

    • incomplete sequences of genomic clones

  • Sequence Tagged Site

    • PCR-based mapping reagents

      Whole Genome Shotgun


Est division e xpressed s equence t ags

5’

3’

make cDNA

library

80-100,000 unique

cDNA clones in library

EST Division: Expressed Sequence Tags

>IMAGE:275615 5' mRNA sequence

GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG

TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA

TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA

GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC

TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC

AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN

TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

nucleus

30,000

genes

gatccantgccatacg

ctcgccaattcnntcg

  • - isolate unique clones

  • sequence once from each end

>IMAGE:275615 3', mRNA sequence

NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA

TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT

AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT

CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG

GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

RNA

gene products


Gss wgs htg

GSS division

or trace archive

whole genome shotgun assemblies

(traditional division)

assembly

Draft sequence (HTG division)

GSS, WGS, HTG

Whole BAC insert (or genome)

shred

sequence

isolate clones


Htg example honeybee draft sequences

LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004

DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces.

ACCESSION AC141845

VERSION AC141845.1 GI:29124029

KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.

HTG Example: Honeybee Draft Sequences

  • Unfinished sequences of BACs

  • Gaps and unordered pieces

  • Finished sequences (Phase 3) move to traditional GenBank division


Whole genome shotgun projects
Whole Genome Shotgun Projects 19-MAR-2004

  • 351 projects

    • Bacteria (251)

    • Environmental sequences (6)

    • Archaea (6)

    • Eukaryotes (88), including:

      • Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human

      • Pufferfish (2)

      • Honeybee, Anopheles, Fruit Flies (3), Silkworm

      • Nematode (2)

      • Yeasts (8), Aspergillus (2)

      • Rice (2)


Whole genome shotgun wgs projects
Whole Genome Shotgun (WGS) Projects 19-MAR-2004

wgs master[properties]


Derivative databases

C 19-MAR-2004

GA

ATT

GA

ATT

C

C

C

ATT

C

ACT

GA

TA

Derivative Databases

Sequencing

Centers

UniGene

UniSTS

Updated

by NCBI

EST

GenBank

STS

Updated ONLY

by submitters

RefSeq

HTG

RefSeq:

Entrez Gene and

annotation pipelines

GSS

INV

VRT

PHG

VRL

PRI

ROD

PLN

MAM

BCT

Labs


Why make reference sequences
Why Make Reference Sequences? 19-MAR-2004

Entrez Nucleotide query:

human[organism] AND lipase[title]


Why make reference sequences1

Entrez Nucleotide query: 19-MAR-2004

human[organism] AND lipase[title]

Why Make Reference Sequences?


Human organism and lipase title and endothelial title

3927 bp 19-MAR-2004

4150 bp

2323 bp

3927 bp

261 bp

human[organism] AND lipase[title] AND endothelial[title]

human[organism] AND lipase[title] AND endothelial[title]


Refseq benefits
RefSeq Benefits 19-MAR-2004

genomes

transcripts

proteins

  • non-redundant; best representative

  • updates to reflect current sequence data and biology

  • distinct, stable accession series


Reference sequence refseq
Reference Sequence: RefSeq 19-MAR-2004

AccessionSequence Type

NM_123456789mRNA

NP_123456789protein, from NM_

NR_123456non-coding RNA

XM_123456predicted mRNA

XP_123456predicted protein

XR_123456predicted non-coding RNA

ZP_12345678 predicted from NZ_

NC_123456genomic, e.g., chromosomes

NG_123455genomic, incomplete region

NT_123456genomic, BAC assembly

NW_123456genomic, WGS assembly

NZ_ABCD12345678 genomic, WGS collection

blue=curated


Annotation process
Annotation Process 19-MAR-2004

Genomic DNA

(NC,NT, NW)

Scanning....

Model mRNA(XM)

(XR)

Model protein (XP)

Curated mRNA(NM)

(NR)

Curated Protein(NP)

RefSeq

Genbank

Sequences


Creating nm records
Creating NM_ Records 19-MAR-2004

Genome annotation

NM’s must have

cDNA support

transcript variant 1

transcript variant 2

transcript variant 3

Longest mRNA


Where is refseq
Where is RefSeq? 19-MAR-2004


The entrez system

GENSAT 19-MAR-2004

PubChem

The Entrez System

Gene

UniGene

CancerChromosomes

UniSTS

Homologene

SNP

PopSet

Genome

Nucleotide

GEO

Books

Entrez

Taxonomy

PubMed

MeSH

OMIM

Protein

PMC

Journals

Domains

3D Domains

Structure


A few entrez databases

A Few Entrez Databases 19-MAR-2004

UniGeneClustersof ESTs, mRNAs

dbSNP Single Nucleotide Polymorphisms

GEOGene Expression Omnibus

microarray and other expression data

CDDConserved Domain Database

protein families (COGs and KOGs)

single domains (PFAM, SMART, CD)


Unigene
UniGene 19-MAR-2004

Gene-oriented clusters of expressed sequences

  • Automatic clustering using MegaBlast

  • Each cluster represents a unique gene

  • Informed by genome hits

  • Information on tissue types and map locations

  • Useful for gene discovery and selection of mapping reagents

unique gene


A cluster of ests
A Cluster of ESTs 19-MAR-2004

query

5’ EST hits

3’ EST hits


Unigene collections
UniGene Collections 19-MAR-2004




Unigene cluster hs 95351
UniGene Cluster Hs.95351 19-MAR-2004

SELECTED PROTEIN SIMILARITES


Unigene cluster hs 953511
UniGene Cluster Hs.95351 19-MAR-2004

GENE EXPRESSION




Download sequences
Download sequences 19-MAR-2004

web page

ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/


Entrez geo
Entrez GEO 19-MAR-2004


Ncbi s snp database
NCBI’s SNP Database 19-MAR-2004

  • Primary and derivative (RefSNP)

    • Single nucleotide polymorphisms

    • Repeat polymorphisms

    • Insertion-deletion polymorphisms

  • Over 19 million refSNPs (rsXXXXXXX)

    (August, 2005)


Searching dbsnp
Searching dbSNP 19-MAR-2004


Refsnp
RefSNP 19-MAR-2004


Refsnp1
RefSNP 19-MAR-2004


Refsnp2
RefSNP 19-MAR-2004


Refsnp3
RefSNP 19-MAR-2004

Search Mouse SNP between strains


Refsnp4

MapView 19-MAR-2004

No 3D

OMIM

SeqView

GeneView

RefSNP


Refsnp5
RefSNP 19-MAR-2004


Entrez geo1
Entrez GEO 19-MAR-2004


Submitted by 19-MAR-2004

Experimentalists

Curated by

NCBI

Submitted by

Manufacturer*

GDS

Grouping of

experiments

GSE

Grouping of

slide/chip data

“a single experiment”

GPL

Platform

descriptions

GSM

Raw/processed

spot intensities

from a single

slide/chip

GEO SEries:

set of related samples

GEO SaMple:

experimental conditions

Entrez

GEO Datasets

Entrez GEO


Supplied by submitter 19-MAR-2004

  • DataSet

  • (GDS)

  • A collection of experimentally-related samplesprocessed using the sameplatform.

  • Samples within DataSets are organized into subgroups based on experimental variables.

  • Form the basis of GEO’s query, analysis and data display tools.

Assembled by GEO staff

Platform(GPL)

array definition

Sample

(GSM)

hyb. measurements

Series

(GSE)

related Samples

What’s a DataSet?


Gene expression omnibus geo

Gene Expression Omnibus ( 19-MAR-2004GEO)

Dataset browser


Geo dataset browser

GEO Dataset Browser 19-MAR-2004


Geo dataset report

GEO 19-MAR-2004 Dataset Report


Geo profiles

… of 12625 19-MAR-2004

GEO Profiles


Entrez cdd
Entrez CDD 19-MAR-2004


Conserved domain database
Conserved Domain Database 19-MAR-2004

  • Multiple sequence alignments

  • Position-specific scoring matrices (PSSM)

  • Sources SMART, PFAM, COGs, KOGs, and

    NCBI curated domains (structure-informed alignments)


CDD 19-MAR-2004

>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPSSTNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEILKKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNSCVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE


CDD 19-MAR-2004

Click on a colored bar to align your sequence to the CD

CD

Pfam

COG



CDD 19-MAR-2004



cdd 19-MAR-2004

Linking from Entrez Protein


Genome Resources 19-MAR-2004

Genomic Biology

Gene database

Homologene

Map Viewer

Trace Archive


Genomic biology
Genomic Biology 19-MAR-2004











Genome Resources 19-MAR-2004

Genomic Biology

Gene database

Homologene

Map Viewer

Trace Archive


Entrez gene
Entrez Gene 19-MAR-2004

  • A single query interface to …

  • Sequences

  • - RefSeqs

  • - GenBank

  • - Homologene

  • Maps – MapViewer

  • Entrez links

  • Linkouts

  • More organisms, ~ 3000

  • Entrez integration



Entrez gene nadh2
Entrez Gene: NADH2 19-MAR-2004


Gene record for pongo nadh2

Homo sapiens 19-MAR-2004

Gene Record for Pongo NADH2

Not found with “nadh2”



Human hfe transcripts
Human HFE: Transcripts 19-MAR-2004

Transcripts with experimental evidence


Gene table
Gene Table 19-MAR-2004


Introns exons gene table
Introns/Exons: Gene Table 19-MAR-2004

links to sequence


Human hfe links
Human HFE: Links 19-MAR-2004


Genotype
Genotype 19-MAR-2004


Genotype1
Genotype 19-MAR-2004


Human hfe links1
Human HFE: Links 19-MAR-2004


Geneview in dbsnp
GeneView in dbSNP 19-MAR-2004


Snp in structure
SNP in Structure 19-MAR-2004


Snp in structure1
SNP in Structure 19-MAR-2004


Snp in structure2
SNP in Structure 19-MAR-2004

H41

S43

C260



Variants in omim
Variants in OMIM 19-MAR-2004


Genome Resources 19-MAR-2004

Genomic Biology

Gene database

Homologene

Map Viewer

Trace Archive


The new homologene
The New Homologene 19-MAR-2004

Automated detection of homologs among the annotated genes of completely sequenced eukaryotic genomes.

  • No longer UniGene based

  • Protein similarities first

  • Guided by taxonomic tree

  • Includes orthologs and paralogs


The new homologene1

Homologene Build 43.1 (8/23/05) 19-MAR-2004

Species Number of genes

input grouped groups

The New Homologene


Rag1 homologene
RAG1 19-MAR-2004→ Homologene


Rag1 homolgene

RAG1 19-MAR-2004

RAG1 → Homolgene


RAG1 19-MAR-2004

RING-finger


Rag1 homolgene1

RAG1 19-MAR-2004

RAG1 → Homolgene


RAG1 19-MAR-2004

Sugar_tr



BLASTP 19-MAR-2004

bl2seq


LocusLink 19-MAR-2004

UniGene

Homologene

Trace Archive

Genome Resources

Gene database

Map Viewer


List view
List View 19-MAR-2004


Human mapviewer
Human MapViewer 19-MAR-2004

adar



Mv hs adar
MV Hs ADAR 19-MAR-2004

3’ UTR

5’ UTR


Maps options

= SNP 19-MAR-2004

Maps & Options

Maps & Options

--Sequence maps--

Ab initio

Assembly

Repeats

BES_Clone

Clone

NCI_Clone

Contig

Component

CpG island

dbSNP haplotype

Fosmid

GenBank_DNA

Gene

Phenotype

SAGE_Tag

STS

TCAG_RNA

Transcript (RNA)

Hs_UniGene

Hs_EST

--Cytogenetic maps--

Ideogram

FISH Clone

Gene_Cytogenetic

Mitelman Breakpoint

Morbid/Disease

--Genetic Maps--

deCODE

Genethon

Marshfield

--RH maps--

GeneMap99-G3

GeneMap99-GB4

NCBI RH

Standford-G3

TNG

Whitehead-RH

Whitehead-YAC

Mm_UniGene

Mm_EST

Rn_UniGene

Rn_EST

Ssc_UniGene

Ssc_EST

Bt_UniGene

Bt_EST

Gga_UniGene

Gga_EST

Variation


Mapviewer
MapViewer 19-MAR-2004

Component

Gene

UniGene

Repeats


Phenotype 19-MAR-2004

Variation

Gene


Maps options1

Maps & Options 19-MAR-2004

Maps & Options


Gene database 19-MAR-2004

LocusLink

UniGene

Homologene

Map Viewer

Genome Resources

Trace Archive


Trace archive page
Trace Archive Page 19-MAR-2004


Macaca mulatta traces
Macaca Mulatta 19-MAR-2004Traces


Trace archive blast page
Trace Archive BLAST Page 19-MAR-2004

Access to sequences NOT in GenBank


Literature links

Literature Links 19-MAR-2004


Books database
BOOKS Database 19-MAR-2004



Books database1
BOOKS Database 19-MAR-2004


Books database2
BOOKS Database 19-MAR-2004


Books database3
BOOKS Database 19-MAR-2004


Genes dis
Genes & Dis 19-MAR-2004


Genes dis1
Genes & Dis 19-MAR-2004



Intermission

Intermission 19-MAR-2004


ad