national center for biotechnology information
Download
Skip this Video
Download Presentation
National Center for Biotechnology Information

Loading in 2 Seconds...

play fullscreen
1 / 127

National Center for Biotechnology Information - PowerPoint PPT Presentation


  • 154 Views
  • Uploaded on

National Center for Biotechnology Information. A Field Guide to GenBank and NCBI’s Molecular Biology Resources. University of Colorado Health Sciences Center. August 30, 2005. Topics. About NCBI GenBank overview Primary vs derivative databases The Reference Sequence (RefSeq) project

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' National Center for Biotechnology Information' - makoto


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
national center for biotechnology information

National Center for Biotechnology Information

A Field Guide to GenBank

and NCBI’s Molecular Biology Resources

University of Colorado Health Sciences Center

August 30, 2005

topics
Topics
  • About NCBI
  • GenBank overview
  • Primary vs derivative databases
    • The Reference Sequence (RefSeq) project
  • Entrez databases
  • Genome resources
  • Bookshelf

-break-

  • Entrez text searching
  • BLAST sequence searching
  • VAST structure searching
  • An integrated example
the national center for biotechnology information
The National Center for Biotechnology Information
  • Accepts submissions of primary data
  • Develops tools to analyze these data
  • Creates derivative databases based on the primary data
  • Provides free search, link, and retrieval of these data, primarily through the Entrez system
number of users per day

Christmas & New Year

Number of Users Per Day

1997 1998 1999 2000 2001 2002 2003

slide8

all[filter]

1/11/2005

3/15/2005

8/15/2005

entrez nucleotide
Entrez Nucleotide

# records

Primary Data

  • GenBank / DDBJ / EMBL 57.3 million (97.4 %)

Derivative Data

  • RefSeq 1.47 million (2.5 %)
    • RefSeq reviewed 60,000
  • PDB (structures) 5,973

“Total” 59 million

GenBank

genbank ncbi s primary sequence database

Release 149 August 2005

47 x 106 Records

52 x 109 Nucleotides

195 Gigabytes 816 files

GenBank: NCBI’s Primary Sequence Database

Over 100 billion

bases!

  • full release every two months
  • incremental and cumulative updates daily
  • available only through internet
  • release notes: gbrel.txt

ftp://ftp.ncbi.nih.gov/genbank/

ftp://genbank.sdsc.edu/pub

ftp://bio-mirror.net/biomirror/genbank

what is genbank
What is GenBank?
  • Nucleotide only sequence database
  • Archival in nature
  • GenBank Data
    • Direct submissions (traditional records)
    • Batch submissions (EST, GSS, STS)
    • ftp accounts (genome data)
  • Three collaborating databases
    • GenBank
    • DNA Database of Japan (DDBJ)
    • European Molecular Biology Laboratory (EMBL) Database
genbank divisions
GenBank Divisions

“Organismal”

PRI (28) Primate

ROD (15) Rodent

PLN (13) Plant and Fungal

BCT (11)Bacterial/Archeal

INV (7) Invertebrate

VRT (7)Other Vertebrate

VRL (4)Viral

MAM (2) Mammalian

PHG (1) Phage

SYN (1) Synthetic

UNA (1)Unannotated

  • Organized by taxonomy (sort of)
  • Direct submissions (Sequin/Bankit)
  • Accurate (~1 error per 10,000 bp)
  • Well characterized

“Functional”

EST (377)Expressed Sequence Tag

GSS (138) Genome Survey Sequence

HTG (63) High Throughput Genomic

PAT (17) Patent

STS (9) Sequence Tagged Site

CON (1) Contigs, virtual

  • Organized by sequence type
  • Batch submissions (ftp/email)
  • Inaccurate
  • Poorly characterized
genbank functional bulk divisions

EST

GenBank

GSS

HTG

STS

GenBank Functional (Bulk) Divisions
  • Expressed Sequence Tag
    • 1st pass single read cDNA
  • Genome Survey Sequence
    • 1st pass single read gDNA
  • High Throughput Genomic
    • incomplete sequences of genomic clones
  • Sequence Tagged Site
    • PCR-based mapping reagents

Whole Genome Shotgun

est division e xpressed s equence t ags

5’

3’

make cDNA

library

80-100,000 unique

cDNA clones in library

EST Division: Expressed Sequence Tags

>IMAGE:275615 5\' mRNA sequence

GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG

TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA

TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA

GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC

TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC

AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN

TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

nucleus

30,000

genes

gatccantgccatacg

ctcgccaattcnntcg

  • - isolate unique clones
  • sequence once from each end

>IMAGE:275615 3\', mRNA sequence

NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA

TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT

AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT

CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG

GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

RNA

gene products

gss wgs htg

GSS division

or trace archive

whole genome shotgun assemblies

(traditional division)

assembly

Draft sequence (HTG division)

GSS, WGS, HTG

Whole BAC insert (or genome)

shred

sequence

isolate clones

htg example honeybee draft sequences

LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004

DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces.

ACCESSION AC141845

VERSION AC141845.1 GI:29124029

KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.

HTG Example: Honeybee Draft Sequences
  • Unfinished sequences of BACs
  • Gaps and unordered pieces
  • Finished sequences (Phase 3) move to traditional GenBank division
whole genome shotgun projects
Whole Genome Shotgun Projects
  • 351 projects
    • Bacteria (251)
    • Environmental sequences (6)
    • Archaea (6)
    • Eukaryotes (88), including:
      • Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human
      • Pufferfish (2)
      • Honeybee, Anopheles, Fruit Flies (3), Silkworm
      • Nematode (2)
      • Yeasts (8), Aspergillus (2)
      • Rice (2)
derivative databases

C

GA

ATT

GA

ATT

C

C

C

ATT

C

ACT

GA

TA

Derivative Databases

Sequencing

Centers

UniGene

UniSTS

Updated

by NCBI

EST

GenBank

STS

Updated ONLY

by submitters

RefSeq

HTG

RefSeq:

Entrez Gene and

annotation pipelines

GSS

INV

VRT

PHG

VRL

PRI

ROD

PLN

MAM

BCT

Labs

why make reference sequences
Why Make Reference Sequences?

Entrez Nucleotide query:

human[organism] AND lipase[title]

human organism and lipase title and endothelial title

3927 bp

4150 bp

2323 bp

3927 bp

261 bp

human[organism] AND lipase[title] AND endothelial[title]

human[organism] AND lipase[title] AND endothelial[title]
refseq benefits
RefSeq Benefits

genomes

transcripts

proteins

  • non-redundant; best representative
  • updates to reflect current sequence data and biology
  • distinct, stable accession series
reference sequence refseq
Reference Sequence: RefSeq

AccessionSequence Type

NM_123456789mRNA

NP_123456789protein, from NM_

NR_123456non-coding RNA

XM_123456predicted mRNA

XP_123456predicted protein

XR_123456predicted non-coding RNA

ZP_12345678 predicted from NZ_

NC_123456genomic, e.g., chromosomes

NG_123455genomic, incomplete region

NT_123456genomic, BAC assembly

NW_123456genomic, WGS assembly

NZ_ABCD12345678 genomic, WGS collection

blue=curated

annotation process
Annotation Process

Genomic DNA

(NC,NT, NW)

Scanning....

Model mRNA(XM)

(XR)

Model protein (XP)

Curated mRNA(NM)

(NR)

Curated Protein(NP)

RefSeq

Genbank

Sequences

creating nm records
Creating NM_ Records

Genome annotation

NM’s must have

cDNA support

transcript variant 1

transcript variant 2

transcript variant 3

Longest mRNA

the entrez system

GENSAT

PubChem

The Entrez System

Gene

UniGene

CancerChromosomes

UniSTS

Homologene

SNP

PopSet

Genome

Nucleotide

GEO

Books

Entrez

Taxonomy

PubMed

MeSH

OMIM

Protein

PMC

Journals

Domains

3D Domains

Structure

a few entrez databases

A Few Entrez Databases

UniGeneClustersof ESTs, mRNAs

dbSNP Single Nucleotide Polymorphisms

GEOGene Expression Omnibus

microarray and other expression data

CDDConserved Domain Database

protein families (COGs and KOGs)

single domains (PFAM, SMART, CD)

unigene
UniGene

Gene-oriented clusters of expressed sequences

  • Automatic clustering using MegaBlast
  • Each cluster represents a unique gene
  • Informed by genome hits
  • Information on tissue types and map locations
  • Useful for gene discovery and selection of mapping reagents

unique gene

a cluster of ests
A Cluster of ESTs

query

5’ EST hits

3’ EST hits

unigene cluster hs 95351
UniGene Cluster Hs.95351

SELECTED PROTEIN SIMILARITES

download sequences
Download sequences

web page

ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

ncbi s snp database
NCBI’s SNP Database
  • Primary and derivative (RefSNP)
    • Single nucleotide polymorphisms
    • Repeat polymorphisms
    • Insertion-deletion polymorphisms
  • Over 19 million refSNPs (rsXXXXXXX)

(August, 2005)

refsnp3
RefSNP

Search Mouse SNP between strains

refsnp4

MapView

No 3D

OMIM

SeqView

GeneView

RefSNP
slide50

Submitted by

Experimentalists

Curated by

NCBI

Submitted by

Manufacturer*

GDS

Grouping of

experiments

GSE

Grouping of

slide/chip data

“a single experiment”

GPL

Platform

descriptions

GSM

Raw/processed

spot intensities

from a single

slide/chip

GEO SEries:

set of related samples

GEO SaMple:

experimental conditions

Entrez

GEO Datasets

Entrez GEO

slide51

Supplied by submitter

  • DataSet
  • (GDS)
  • A collection of experimentally-related samplesprocessed using the sameplatform.
  • Samples within DataSets are organized into subgroups based on experimental variables.
  • Form the basis of GEO’s query, analysis and data display tools.

Assembled by GEO staff

Platform(GPL)

array definition

Sample

(GSM)

hyb. measurements

Series

(GSE)

related Samples

What’s a DataSet?

conserved domain database
Conserved Domain Database
  • Multiple sequence alignments
  • Position-specific scoring matrices (PSSM)
  • Sources SMART, PFAM, COGs, KOGs, and

NCBI curated domains (structure-informed alignments)

slide58
CDD

>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPSSTNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEILKKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNSCVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE

slide59
CDD

Click on a colored bar to align your sequence to the CD

CD

Pfam

COG

slide63
cdd

Linking from Entrez Protein

slide64

Genome Resources

Genomic Biology

Gene database

Homologene

Map Viewer

Trace Archive

slide75

Genome Resources

Genomic Biology

Gene database

Homologene

Map Viewer

Trace Archive

entrez gene
Entrez Gene
  • A single query interface to …
  • Sequences
  • - RefSeqs
  • - GenBank
  • - Homologene
  • Maps – MapViewer
  • Entrez links
  • Linkouts
  • More organisms, ~ 3000
  • Entrez integration
gene record for pongo nadh2

Homo sapiens

Gene Record for Pongo NADH2

Not found with “nadh2”

human hfe transcripts
Human HFE: Transcripts

Transcripts with experimental evidence

introns exons gene table
Introns/Exons: Gene Table

links to sequence

slide94

Genome Resources

Genomic Biology

Gene database

Homologene

Map Viewer

Trace Archive

the new homologene
The New Homologene

Automated detection of homologs among the annotated genes of completely sequenced eukaryotic genomes.

  • No longer UniGene based
  • Protein similarities first
  • Guided by taxonomic tree
  • Includes orthologs and paralogs
the new homologene1

Homologene Build 43.1 (8/23/05)

Species Number of genes

input grouped groups

The New Homologene
slide99
RAG1

RING-finger

slide101
RAG1

Sugar_tr

slide103

BLASTP

bl2seq

slide104

LocusLink

UniGene

Homologene

Trace Archive

Genome Resources

Gene database

Map Viewer

mv hs adar
MV Hs ADAR

3’ UTR

5’ UTR

maps options

= SNP

Maps & Options

Maps & Options

--Sequence maps--

Ab initio

Assembly

Repeats

BES_Clone

Clone

NCI_Clone

Contig

Component

CpG island

dbSNP haplotype

Fosmid

GenBank_DNA

Gene

Phenotype

SAGE_Tag

STS

TCAG_RNA

Transcript (RNA)

Hs_UniGene

Hs_EST

--Cytogenetic maps--

Ideogram

FISH Clone

Gene_Cytogenetic

Mitelman Breakpoint

Morbid/Disease

--Genetic Maps--

deCODE

Genethon

Marshfield

--RH maps--

GeneMap99-G3

GeneMap99-GB4

NCBI RH

Standford-G3

TNG

Whitehead-RH

Whitehead-YAC

Mm_UniGene

Mm_EST

Rn_UniGene

Rn_EST

Ssc_UniGene

Ssc_EST

Bt_UniGene

Bt_EST

Gga_UniGene

Gga_EST

Variation

mapviewer
MapViewer

Component

Gene

UniGene

Repeats

slide111

Phenotype

Variation

Gene

slide113

Gene database

LocusLink

UniGene

Homologene

Map Viewer

Genome Resources

Trace Archive

trace archive blast page
Trace Archive BLAST Page

Access to sequences NOT in GenBank

ad