ncbi molecular biology resources
Download
Skip this Video
Download Presentation
NCBI Molecular Biology Resources

Loading in 2 Seconds...

play fullscreen
1 / 93

NCBI Molecular Biology Resources - PowerPoint PPT Presentation


  • 209 Views
  • Uploaded on

NCBI Molecular Biology Resources. A Field Guide. August 2-3, 2005. University of Massachusetts. NCBI Resources. The NCBI Entrez System NCBI Sequence Databases Primary data: GenBank Derivative data: RefSeq, Gene, Genome Beyond Refseq: UniGene, Trace Archive NCBI Genomic Resources

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' NCBI Molecular Biology Resources' - judd


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ncbi molecular biology resources

NCBI Molecular Biology Resources

A Field Guide

August 2-3, 2005

University of Massachusetts

ncbi resources
NCBI Resources
  • The NCBI Entrez System
  • NCBI Sequence Databases
    • Primary data: GenBank
    • Derivative data: RefSeq, Gene, Genome
    • Beyond Refseq: UniGene, Trace Archive
  • NCBI Genomic Resources

** Intermission **

  • BLAST
  • Protein Structure and Function
  • Sequence polymorphisms and phenotypes
the national center for biotechnology information
The National Center for Biotechnology Information
  • Created as a part of NLM in 1988
    • Establish public databases
    • Perform research in computational biology
    • Develop software tools for sequence analysis
    • Disseminate biomedical information
web access
Web Access

Text

Entrez

Sequence

BLAST

Structure

VAST

the ncbi ftp site
The NCBI ftp site

30,000 files per day

620 Gigabytes per day

what does ncbi do
What does NCBI do?
  • NCBI accepts submissions of primary data
  • NCBI develops tools to analyze these data
  • NCBI uses these tools to create derivative databases based on the primary data
  • NCBI provides free search, link, and retreival of these data, primarily through the Entrez system
types of databases
Types of Databases
  • Primary Databases
    • Original submissions by experimentalists
    • Content controlled by the submitter
      • Examples: GenBank, SNP, GEO, PubChem Substance
  • Derivative Databases
    • Built from primary data
    • Content controlled by third party (NCBI)
      • Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound
primary vs derivative databases
Primary vs. Derivative Databases

C

C

GA

ATT

GA

UniGene

GA

C

ATT

GA

Algorithms

C

TATAGCCG

Sequencing

Centers

ACGTGC

TTGACA

ATTGACTA

ACGTGC

CGTGA

UniSTS

EST

GenBank

Updated

continually

by NCBI

STS

Updated ONLY

by submitters

RefSeq:

Annotation

Pipeline

GSS

HTG

INV

VRT

PHG

VRL

PRI

ROD

PLN

MAM

BCT

ACGTGC

RefSeq:

LocusLink and

Genomes Pipelines

Curators

TATAGCCG

AGCTCCGATA

CCGATGACAA

Labs

what is entrez
What is Entrez?
  • A system of 29 linked databases
  • A text search engine
  • A tool for finding biologically linked data
  • A retrieval engine
  • A virtual workspace for manipulating large datasets
entrez databases
Entrez Databases
  • Each record is assigned a UID
    • unique integer identifier for internal tracking
    • GI number for Nucleotide
  • Each record is given a Document Summary
    • a summary of the record’s content (DocSum)
  • Each record is assigned links to biologically related UIDs
  • Each record is indexed by data fields
    • [author], [title], [organism], and many others
entrez taxonomy
Entrez Taxonomy

The backbone of NCBI

[organism]

an entrez database nucleotide
An Entrez Database - Nucleotide
  • GenBank: Primary Data (97.9%)
    • original submissions by experimentalists
    • submitters retain editorial control of records
    • archival in nature
  • RefSeq: Derivative Data (2.1%)
    • curated by NCBI staff
    • NCBI retains editorial control of records
    • record content is updated continually
entrez nucleotide
Entrez Nucleotide

Primary Data

  • DDBJ / EMBL / GenBank 56,865,268

Derivative Data

  • RefSeq 1,226,084
  • PDB 5,973
  • Third Party Annotation 4,650

Total 58,101,975

what is genbank ncbi s primary sequence database
What is GenBank?NCBI’s Primary Sequence Database
  • Nucleotide only sequence database
  • Archival in nature
  • Each record is assigned a stable accession number
  • GenBank Data
    • Direct submissions (traditional records )
    • Batch submissions (EST, GSS, STS)
    • ftp accounts (genome data)
  • Three collaborating databases
    • GenBank
    • DNA Database of Japan (DDBJ)
    • European Molecular Biology Laboratory (EMBL) Database
the international sequence database collaboration
The International Sequence Database Collaboration

NIH

Entrez

Sequin

BankIt

ftp

NCBI

GenBank

  • Submissions
  • Updates
  • Submissions
  • Updates

EMBL

DDBJ

EBI

CIB

NIG

  • Submissions
  • Updates

SRS

EMBL

getentry

genbank releases
GenBank Releases

Release 148 June 2005

45,236,251 Records

49,398,852,122 Nucleotides

>140,000 Species

172 Gigabytes 785 files

  • full release every two months
  • incremental and cumulative updates daily
  • available only through internet

ftp://ftp.ncbi.nih.gov/genbank/

the growth of genbank
The Growth of GenBank

Release 148: 45.2 million records

49.4 billion nucleotides

Average doubling time ≈ 14 months*

genbank divisions
GenBank Divisions

PRI (28) Primate

ROD (14) Rodent

PLN (13) Plant and Fungal

BCT (10)Bacterial/Archeal

INV (7) Invertebrate

VRT (7)Other Vertebrate

VRL (4) Viral

MAM (2) Mammalian

PHG (1) Phage

SYN (1) Synthetic

UNA (1)Unannotated

Traditional

  • Direct Submissions (Sequin/Bankit)
  • Accurate (~1 error per 10,000 bp)
  • Well characterized
  • Organized by taxonomy

Bulk

  • From sequencing projects
  • Batch submissions (ftp/email)
  • Inaccurate
  • Poorly Characterized
  • Organized by sequence type

EST (349)Expressed Sequence Tag

GSS (120) Genome Survey Sequence

HTG (62) High Throughput Genomic

HTC (6)High Throughput cDNA

STS (5) Sequence Tagged Site

a traditional genbank record

Header

Feature Table

Sequence

A Traditional GenBank Record

LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004

DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,

complete cds.

ACCESSION AY182241

VERSION AY182241.2 GI:32265057

KEYWORDS .

SOURCE Malus x domestica (cultivated apple)

ORGANISM Malus x domestica

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;

rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.

REFERENCE 1 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Cloning and functional expression of an (E,E)-alpha-farnesene

synthase cDNA from peel tissue of apple fruit

JOURNAL Planta 219, 84-94 (2004)

REFERENCE 2 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Direct Submission

JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD

20705, USA

REFERENCE 3 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Direct Submission

JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD

20705, USA

REMARK Sequence update by submitter

COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.

FEATURES Location/Qualifiers

source 1..1931

/organism="Malus x domestica"

/mol_type="mRNA"

/cultivar="\'Law Rome\'"

/db_xref="taxon:3750"

/tissue_type="peel"

gene 1..1931

/gene="AFS1"

CDS 54..1784

/gene="AFS1"

/note="terpene synthase"

/codon_start=1

/product="(E,E)-alpha-farnesene synthase"

/protein_id="AAO22848.2"

/db_xref="GI:32265058"

/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK

NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF

EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE

NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS

LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW

ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS

EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT

KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA

DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK

GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI

LSLLFQPLVN"

ORIGIN

1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat

61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg

121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt

181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga

241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt

1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt

1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa

1921 aaaaaaaaaa a

//

The Flatfile Format

an example record m17755
An Example Record – M17755

Indexing for Nucleotide UID 4680720

Field Indexed Terms

[primary accession] M17755

[title] Homo sapiens thyroid peroxidase (TPO) mRNA…

[organism] Homo sapiens

[sequence length] 3060

[modification date] 1999/04/26

[properties] biomol mrna

gbdiv pri

srcdb genbank

m17755 feature table
M17755: Feature Table

TPO [gene name]

CDS position in bp

thyroiditis

[text word]

thyroid peroxidase

[protein name]

protein

accession

sequence 99 99 accurate
Sequence: 99.99% Accurate

The sequence itself

is not indexed…

Use BLAST for that!

entrez protein
Entrez Protein
  • GenPept (DDBJ, EMBL, GenBank)4,444,405
  • RefSeq 1,753,167
  • PIR 222,395
  • Swiss Prot 189,005
  • PDB 68,621
  • PRF 12,079
  • Third Party Annotation 4,219

Total 6,693,891

protein sources and links
Protein Sources and Links

PIR

no mRNA!

RefSeq

 NM_000537

SWISS-PROT

no mRNA!

GenPept

 M17755

sequence revisions
Sequence Revisions

First seen at NCBI, not first seen at GenBank!

Version and GI change only if the sequence changes

The accession number always retrieves the most recent version

update without a sequence change
Update without a Sequence Change

June 15, 1989!

GenBank came

to NCBI in 1992!

genbank file formats
GenBank File Formats

ASN.1 – The Raw Data

flat file

XML (4 flavors)

FASTA

slide32

Toolbox Sources

ftp> open ftp.ncbi.nih.gov

.

.

ftp> cd toolbox

ftp> cd ncbi_tools

ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools

NCBI Toolbox

/************************************************************************

*

* asn2ff.c

* convert an ASN.1 entry to flat file format, using the FFPrintArray.

*

**************************************************************************/

#include <accentr.h>

#include "asn2ff.h"

#include "asn2ffp.h"

#include "ffprint.h"

#include <subutil.h>

#include <objall.h>

#include <objcode.h>

#include <lsqfetch.h>

#include <explore.h>

#ifdef ENABLE_ID1

#include <accid1.h>

#endif

FILE *fpl;

Args myargs[] = {

{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,\'a\',ARG_FILE_IN,0.0,0,NULL},

{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,\'e\',ARG_BOOLEAN,0.0,0,NULL},

{"Input asnfile in binary mode","F",NULL,NULL,TRUE,\'b\',ARG_BOOLEAN,0.0,0,NULL},

{"Output Filename","stdout", NULL,NULL,TRUE,\'o\',ARG_FILE_OUT,0.0,0,NULL},

{"Show Sequence?","T", NULL ,NULL ,TRUE,\'h\',ARG_BOOLEAN,0.0,0,NULL},

text searches in entrez
Text Searches in Entrez

term1 term2

If no [limit] is specified…

Organism?  [ organism ]

Journal?  [ journal ]

User compounds?  search as phrase

Author?  [author]

else [All Fields]

term1[limit]OPterm2[limit]OP …

where

limit =Entrez indexing field (organism, author, …)

op = AND, OR, NOT

entrez tabs
Entrez Tabs

Provides a simple form for applying commonly used Entrez limits

Limits

Allows access to the full indexing of each Entrez database

and aids in constructing complex queries

Preview/Index

Provides access to previous searches in the current Entrez database

History

Clipboard

A temporary storage area for selected records

Details

Displays the detailed parsing of the current Entrez query, and

lists errors and terms without matches

programming entrez e utilities
Programming Entrez: E-Utilities

http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html

ESearch

Entrez query

UID list or History

ESummary

UID list or History

Document summaries

EFetch

Formatted data

UID list or History

ELink

UID list or History

UID list or History

EPost

History

UID list

finding primary sequences
Finding Primary Sequences
  • Search Entrez Nucleotide
    • 97.9% GenBank (primary data)
    • 2.1% RefSeq (curated data)

Possible queries we’ve seen so far…

M17755 [primary accession] TPO [gene name]

thyroid peroxidase [title] thyroiditis [text word]

Homo sapiens [organism] thyroid peroxidase [protein name]

3060 [sequence length] 1999/04/26 [modification date]

biomol mrna [properties] gbdiv pri [properties]

srcdb genbank [properties]

a starting query
A Starting Query

Find nucleotide records for human thyroid peroxidase

309 records

human thyroid peroxidase

(("Homo sapiens“[Organism] OR human[All Fields]) AND

thyroid peroxidase[All Fields])

Field Limit!

298 records

human[organism] AND thyroid peroxidase

("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields])

11 records aren’t human sequences!!

limit by title and database
Limit by Title and Database
  • Entrez Nucleotide
    • GenBank srcdbddbj/embl/genbank[properties]
    • RefSeq srcdbrefseq[properties]

#1: thyroid peroxidase AND human[orgn] 298

#2: thyroid peroxidase[title]AND human[orgn] 169

#3: #2 AND srcdbrefseq[properties] 5

#4: #2 AND srcdbddbj/embl/genbank[properties] 164

primary data

limit by genbank division
Limit by Genbank Division

EST Division gbdiv est[prop]

Primate Division gbdiv pri[prop]

#1: thyroid peroxidase AND human[orgn] 298

#2: thyroid peroxidase[title] AND human[orgn] 169

#3: #2 AND srcdb refseq[properties] 5

#4: #2 AND srcdb ddbj/embl/genbank[properties] 164

#5: #4 AND gbdiv est[prop] 20

#6: #4 AND gbdiv pri[prop] 144

traditional GenBank records

limit by biomolecule type
Limit by Biomolecule Type

Genomic DNA biomol genomic[prop]

cDNA biomol mrna[prop]

#1: thyroid peroxidase AND human[orgn] 298

#2: thyroid peroxidase[title] AND human[orgn] 169

#3: #2 AND srcdb refseq[properties] 5

#4: #2 AND srcdb ddbj/embl/genbank[properties] 164

#5: #2 AND gbdiv est[prop] 20

#6: #2 AND gbdiv pri[prop] 144

#7: #6 AND biomol genomic[prop] 26

#8: #6 AND biomol mrna[prop] 118

genomic DNA

mRNA / cDNA

limit by protein name
Limit by Protein Name

thyroid peroxidase[protein name]AND human[orgn]AND

gbdiv pri[prop]AND biomol mrna[prop]

118 records [title]  4 records [protein name]

entrez document summaries
Entrez Document Summaries

Links menu

Click the accession to view the record

Links to other

Entrez databases

computed for M17755

entrez links for gi 4680720
Entrez Links for GI 4680720

Gene annotation based on M17755

Full text online articles about M17755

All polymorphisms in the TPO gene

DNA/RNA sequences similar to M17755

Graphical view of TPO gene annotation

Human phenotypes involving TPO

Microarray datasets for M17755

Protein translation of M17755

Literature abstracts about M17755

Sequence polymorphisms in M17755

Source organism of M17755

STS markers in the TPO gene

TPO links beyond NCBI

genbank sequences for human tpo
GenBank Sequences for Human TPO

Which one is the best sequence???

refseq ncbi s derivative sequence database
RefSeq: NCBI’s Derivative Sequence Database

RefSeq Benefits

  • Non-redundant  
  • Explicitly linked nucleotide and protein sequences
  • Updated to reflect current sequence data and biology
  • Validated by hand
  • Format consistency
  • Distinct accession series
  • Stewardship by NCBI staff and collaborators

ftp://ftp.ncbi.nih.gov/refseq/release

refseq ncbi s derivative sequence database1
RefSeq: NCBI’s Derivative Sequence Database
  • Curated transcripts and proteins
    • NM_123456  NP_123456
    • NR_123456 (non-coding RNA)
  • Model transcripts and proteins
    • XM_123456  XP_123456
    • XR_123456 (non-coding RNA)
  • Assembled Genomic Regions (contigs)
    • NT_123456 (BAC clones)
    • NW_123456 (WGS)
  • Other Genomic Sequence
    • NG_123456 (complex regions, pseudogenes)
    • NZ_ABCD12345678 (WGS)  ZP_123456
  • Chromosome records in Entrez Genome
    • NC_123456 (chromosome; microbial or organelle genome)

Nucleotide

Protein

creating nm records
Creating NM Records

Genome annotation

Longest mRNA

NMs must have

cDNA support

nm np records in entrez
NM/NP Records in Entrez

NM_000547: variant 1

COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff.

The reference sequence was derived from M17755.2 and AW874082.1.

On Feb 25, 2003 this sequence version replaced gi:21361188.

EST that completes 3’ end

NM_175719: variant 2

COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff.

The reference sequence was derived from J02970.1, AW874082.1 and M17755.2.

Nucleotide

Protein

annotating the gene
Annotating the Gene

Genomic DNA

(NC, NT, NW)

Scanning....

Model mRNA(XM)

(XR)

Model protein (XP)

= ?

= !

Curated mRNA(NM)

(NR)

Curated Protein(NP)

RefSeq

Genbank

Sequences

entrez gene and refseq
Entrez Gene and RefSeq

Gene

GenBank

RefSeq

Nucleotide

  • Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI
  • Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs)
  • Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases
  • NCBI RefSeqs are based on primary sequence data in GenBank
blast results for xm 496543
BLAST Results for XM_496543

Is there any GenBank support for this mRNA?

srcdb ddbj/embl/genbank[prop] AND biomol mrna[prop]

no full-length hit

the perils of the xm
The Perils of the XM

XM records are models based only on genomic sequence, and are subject

to revision or removal with each new build of that genome.

BLAST the XM against the RefSeq database to look for a replacement:

Query= gi|20850420|ref|XM_124429.1|

Mus musculus expressed sequence AA553001 (AA553001), mRNA

gi|19527087|ref|NM_133873.1|

Mus musculus DNA segment, Chr 4, Wayne State University 114,

expressed (D4Wsu114e), mRNA Length=1898

Score = 3701.55 bits (1867), Expect = 0

Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus

eukaryotic nm xm records
Eukaryotic NM/XM Records

Bos taurus: 37541

Oryza sativa (japonica cultivar-group): 36836

Danio rerio: 30577

Homo sapiens: 29261

Arabidopsis thaliana: 28953

Mus musculus: 27033

Rattus norvegicus: 23975

Pan troglodytes: 21810

Caenorhabditis elegans: 21124

Drosophila melanogaster: 19412

Aspergillus nidulans FGSC A4: 18951

Gallus gallus: 18120

Canis familiaris: 16891

Anopheles gambiae str. PEST: 15328

Plasmodium chabaudi: 14747

Candida albicans SC5314: 13672

Dictyostelium discoideum: 13570

Ustilago maydis 521: 13044

Plasmodium berghei: 11778

Gibberella zeae PH-1: 11640

Magnaporthe grisea 70-15: 11109

Neurospora crassa: 10079

Aspergillus fumigatus Af293: 9923

Entamoeba histolytica HM-1:IMSS: 9772

Cryptococcus neoformans var. neoformans JEC21: 6594

Giardia lamblia ATCC 50803: 6569

Yarrowia lipolytica CLIB99: 6521

Debaryomyces hansenii CBS767: 6318

Apis mellifera: 6292

Kluyveromyces lactis NRRL Y-1140: 5327

Candida glabrata CBS138: 5181

Schizosaccharomyces pombe 972h-: 5035

Eremothecium gossypii: 4718

Theileria parva: 4079

Xenopus tropicalis: 4069

Cryptosporidium hominis: 3886

Cryptosporidium parvum: 3396

Sus scrofa: 938

Trypanosoma brucei: 599

Ovis aries: 253

Strongylocentrotus purpuratus: 215

Felis catus: 162

Plasmodium yoelii yoelii: 105

Takifugu rubripes: 7

Ciona intestinalis: 3

Trypanosoma cruzi: 3

genome annotation in entrez nucleotide
Genome Annotation in Entrez Nucleotide

GenBank Components

(clones, WGS)

NT/NW Contigs

NC

Genome

Assembly

NM/XM

Master

mRNA

Components

Components

genome annotation links
Genome Annotation Links

curated mRNA

genomic contig on human chromosome 2

containing NM_000547

human chromosome 2

the 21 contigs of the

chromosome 2 assembly

getting the annotation details
Getting the Annotation Details

Genomic sequence

ACCESSION NC_000002 REGION: 1396242..1525502

getting the annotation details1
Getting the Annotation Details

ACCESSION NC_000002 REGION: 1396242..1525502

exon-intron structure

These flat files contain all annotations in the gene and the full, explicit sequence

searching entrez gene
Searching Entrez Gene

Gene symbol:human thyroid peroxidase (TPO)

tpo [sym] ANDhuman [organism]

Protein name:topoisomerase genes from Archaea

topoisomerase[gene/protein name]ANDarchaea [organism]

Chromosome and Links:genes on human chromosome 2 with OMIM links

2 [chromosome] ANDgene omim [filter] ANDhuman [organism]

RefSeq status and variants:Reviewed RefSeqs with transcript variants

srcdb refseq reviewed[prop]ANDhas transcript variants[prop]

Disease and Gene Ontology:Membrane proteins linked to cancer

integral to plasma membrane[gene ontology]ANDcancer [dis]

gene links in entrez
Gene Links in Entrez

Microarray datasets for TPO

Gene homologs for TPO

DNA and RNA sequences for TPO

Phenotypes involving TPO

Protein sequences for TPO

Literature abstracts about TPO

Sequence polymorphisms in TPO

Species whose genome has this TPO gene

STS markers in the TPO gene

ESTs aligned to the TPO gene

third party annotation tpa database
NCBI now accepts the submission of new annotations

of existing GenBank sequences.

Submissions must be published in a peer-reviewed journal.

Facilitates the annotation of sequences by experts.

What should not be submitted to TPA?

Synthetic constructs (such as cloning vectors) that use well-characterized, publicly available genes, promoters, or terminators

Updates or changes to existing sequence data

Sequence annotations without experimental evidence

Third Party Annotation(TPA) Database

Examples of sequences appropriate for TPA are:

  • Annotation of features on gene and/or mRNA sequences
  • Assembled “full length” genes and/or mRNAs
beyond refseq
Beyond RefSeq

If your organism does not have RefSeqs…

  • UniGene : gene-based clusters of cDNAs and ESTs
  • WGS sequences in Entrez Nucleotide (wgs[prop])
  • Trace Archive
what is unigene
What is UniGene?

A gene-oriented view of sequence entries

  • MegaBlast based automated sequence clustering
  • Now informed by genome hits New!
  • Nonredundant set of gene oriented clusters
  • Each cluster a unique gene
  • Information on tissue types and map locations
  • Includes known genes and uncharacterized ESTs
  • Useful for gene discovery and selection of mapping reagents
organisms in unigene
Organisms in UniGene

Top Ten

1. Human

2. Rice

3. Mouse

4. Cow

5. Wheat

6. Zebrafish

7. Pig

8. Chicken

9. Frog (X. laevis)

10. Frog (X. tropicalis)

finding unigene clusters
Finding UniGene Clusters

by link

by Entrez search

slide71

Submitted by

Experimentalists

Curated by

NCBI

Submitted by

Manufacturer*

GDS

Grouping of

experiments

GSE

Grouping of

slide/chip data

“a single experiment”

GPL

Platform

descriptions

GSM

Raw/processed

spot intensities

from a single

slide/chip

Entrez

GEO Datasets

Entrez GEO

whole genome shotgun projects
Traditional GenBank Divisions

300 + projects

Viruses

Bacteria

Environmental sequences

Archaea

73 Eukaryotes featuring:

Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human

Pufferfish (2), Zebrafish

Honeybee, Anopheles, Fruit Flies (4), Silkworm

Nematode (C. briggsae)

Yeasts (9), Aspergillus (3)

Rice

Whole Genome Shotgun Projects
viewing simple genomes
Viewing Simple Genomes

All are RefSeq NC records in Entrez Genome

  • Full chromosomal sequences are provided
  • Genes are annotated
  • The annotation can be shown graphically and linked to sequence records
viewing complex genomes
Viewing Complex Genomes

NCBI Map Viewer

  • Map Viewer Home Page
    • Shows all supported organisms
    • Provides links to genomic BLAST
  • Genome Overview Page
    • Provides links to individual chromosomes
    • Shows hits on a genome graphically
  • Chromosome Viewing Page
    • Allows interactive views of annotation details
    • Provides numerous maps unique to each genome
genome overview page
Genome Overview Page

Search the maps

Genomic BLAST

Species-specific help!

chromosome viewing page
Chromosome Viewing Page

Map Summary

Add or remove maps

Master Map

with exploded content

Genes

UniGene

Contigs

Zooming

Controls

Ideogram

map summary
Map Summary

TPO’s contig!

map content
Map Content

Map content varies greatly by species!

  • Sequence Maps
    • Core assembly
    • Annotation evidence
    • Clones & Markers
    • Polymorphisms
    • Links & Features
  • Genetic Maps
    • Cytogenetic maps
    • Linkage maps
    • Radiation hybrid maps

Assembly

Contig

Component

Transcript

Gene

assembly of chr 2
Assembly of Chr. 2

NT_033000

1255072

1563756

view of tpo
View of TPO

Links to Entrez Nucleotide

Links to Entrez Gene

Links to Tools and Data

Gap in assembly

map content1
Map Content

Map content varies greatly by species!

  • Sequence Maps
    • Core assembly
    • Annotation evidence
    • Clones & Markers
    • Polymorphisms
    • Links & Features
  • Genetic Maps
    • Cytogenetic maps
    • Linkage maps
    • Radiation hybrid maps

Ab initio (model)

GenBank DNA

EST

UniGene

Gene

annotation evidence
Annotation Evidence

GenBank records not used in assembly

UniGene Clusters

Ab initio models

Aligned ESTs

entrez homologene
Entrez Homologene

Homologs by protein BLAST

ad