Ncbi molecular biology resources
This presentation is the property of its rightful owner.
Sponsored Links
1 / 93

NCBI Molecular Biology Resources PowerPoint PPT Presentation


  • 151 Views
  • Uploaded on
  • Presentation posted in: General

NCBI Molecular Biology Resources. A Field Guide. August 2-3, 2005. University of Massachusetts. NCBI Resources. The NCBI Entrez System NCBI Sequence Databases Primary data: GenBank Derivative data: RefSeq, Gene, Genome Beyond Refseq: UniGene, Trace Archive NCBI Genomic Resources

Download Presentation

NCBI Molecular Biology Resources

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ncbi molecular biology resources

NCBI Molecular Biology Resources

A Field Guide

August 2-3, 2005

University of Massachusetts


Ncbi resources

NCBI Resources

  • The NCBI Entrez System

  • NCBI Sequence Databases

    • Primary data: GenBank

    • Derivative data: RefSeq, Gene, Genome

    • Beyond Refseq: UniGene, Trace Archive

  • NCBI Genomic Resources

    ** Intermission **

  • BLAST

  • Protein Structure and Function

  • Sequence polymorphisms and phenotypes


The national institutes of health

Bethesda, MD

The National Institutes of Health


The national center for biotechnology information

The National Center for Biotechnology Information

  • Created as a part of NLM in 1988

    • Establish public databases

    • Perform research in computational biology

    • Develop software tools for sequence analysis

    • Disseminate biomedical information


Web access

Web Access

Text

Entrez

Sequence

BLAST

Structure

VAST


Ncbi web traffic

Christmas and New Year’s Day

NCBI Web Traffic

User’s per day


The ncbi ftp site

The NCBI ftp site

30,000 files per day

620 Gigabytes per day


What does ncbi do

What does NCBI do?

  • NCBI accepts submissions of primary data

  • NCBI develops tools to analyze these data

  • NCBI uses these tools to create derivative databases based on the primary data

  • NCBI provides free search, link, and retreival of these data, primarily through the Entrez system


Types of databases

Types of Databases

  • Primary Databases

    • Original submissions by experimentalists

    • Content controlled by the submitter

      • Examples: GenBank, SNP, GEO, PubChem Substance

  • Derivative Databases

    • Built from primary data

    • Content controlled by third party (NCBI)

      • Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound


Primary vs derivative databases

Primary vs. Derivative Databases

C

C

GA

ATT

GA

UniGene

GA

C

ATT

GA

Algorithms

C

TATAGCCG

Sequencing

Centers

ACGTGC

TTGACA

ATTGACTA

ACGTGC

CGTGA

UniSTS

EST

GenBank

Updated

continually

by NCBI

STS

Updated ONLY

by submitters

RefSeq:

Annotation

Pipeline

GSS

HTG

INV

VRT

PHG

VRL

PRI

ROD

PLN

MAM

BCT

ACGTGC

RefSeq:

LocusLink and

Genomes Pipelines

Curators

TATAGCCG

AGCTCCGATA

CCGATGACAA

Labs


What is entrez

What is Entrez?

  • A system of 29 linked databases

  • A text search engine

  • A tool for finding biologically linked data

  • A retrieval engine

  • A virtual workspace for manipulating large datasets


The entrez system text searches

The Entrez System: Text Searches


Entrez databases

Entrez Databases

  • Each record is assigned a UID

    • unique integer identifier for internal tracking

    • GI number for Nucleotide

  • Each record is given a Document Summary

    • a summary of the record’s content (DocSum)

  • Each record is assigned links to biologically related UIDs

  • Each record is indexed by data fields

    • [author], [title], [organism], and many others


Entrez taxonomy

Entrez Taxonomy

The backbone of NCBI

[organism]


An entrez database nucleotide

An Entrez Database - Nucleotide

  • GenBank: Primary Data (97.9%)

    • original submissions by experimentalists

    • submitters retain editorial control of records

    • archival in nature

  • RefSeq: Derivative Data (2.1%)

    • curated by NCBI staff

    • NCBI retains editorial control of records

    • record content is updated continually


Entrez nucleotide

Entrez Nucleotide

Primary Data

  • DDBJ / EMBL / GenBank 56,865,268

    Derivative Data

  • RefSeq 1,226,084

  • PDB 5,973

  • Third Party Annotation 4,650

    Total 58,101,975


What is genbank ncbi s primary sequence database

What is GenBank?NCBI’s Primary Sequence Database

  • Nucleotide only sequence database

  • Archival in nature

  • Each record is assigned a stable accession number

  • GenBank Data

    • Direct submissions (traditional records )

    • Batch submissions (EST, GSS, STS)

    • ftp accounts (genome data)

  • Three collaborating databases

    • GenBank

    • DNA Database of Japan (DDBJ)

    • European Molecular Biology Laboratory (EMBL) Database


The international sequence database collaboration

The International Sequence Database Collaboration

NIH

Entrez

Sequin

BankIt

ftp

NCBI

GenBank

  • Submissions

  • Updates

  • Submissions

  • Updates

EMBL

DDBJ

EBI

CIB

NIG

  • Submissions

  • Updates

SRS

EMBL

getentry


Genbank releases

GenBank Releases

Release 148June 2005

45,236,251Records

49,398,852,122Nucleotides

>140,000Species

172 Gigabytes 785 files

  • full release every two months

  • incremental and cumulative updates daily

  • available only through internet

ftp://ftp.ncbi.nih.gov/genbank/


The growth of genbank

The Growth of GenBank

Release 148: 45.2 million records

49.4 billion nucleotides

Average doubling time ≈ 14 months*


Genbank divisions

GenBank Divisions

PRI (28) Primate

ROD (14) Rodent

PLN (13) Plant and Fungal

BCT (10)Bacterial/Archeal

INV (7) Invertebrate

VRT (7)Other Vertebrate

VRL (4) Viral

MAM (2) Mammalian

PHG (1) Phage

SYN (1) Synthetic

UNA (1)Unannotated

Traditional

  • Direct Submissions (Sequin/Bankit)

  • Accurate (~1 error per 10,000 bp)

  • Well characterized

  • Organized by taxonomy

Bulk

  • From sequencing projects

  • Batch submissions (ftp/email)

  • Inaccurate

  • Poorly Characterized

  • Organized by sequence type

EST (349)Expressed Sequence Tag

GSS (120) Genome Survey Sequence

HTG (62) High Throughput Genomic

HTC (6)High Throughput cDNA

STS (5) Sequence Tagged Site


A traditional genbank record

Header

Feature Table

Sequence

A Traditional GenBank Record

LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004

DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,

complete cds.

ACCESSION AY182241

VERSION AY182241.2 GI:32265057

KEYWORDS .

SOURCE Malus x domestica (cultivated apple)

ORGANISM Malus x domestica

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;

rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.

REFERENCE 1 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Cloning and functional expression of an (E,E)-alpha-farnesene

synthase cDNA from peel tissue of apple fruit

JOURNAL Planta 219, 84-94 (2004)

REFERENCE 2 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Direct Submission

JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD

20705, USA

REFERENCE 3 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Direct Submission

JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD

20705, USA

REMARK Sequence update by submitter

COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.

FEATURES Location/Qualifiers

source 1..1931

/organism="Malus x domestica"

/mol_type="mRNA"

/cultivar="'Law Rome'"

/db_xref="taxon:3750"

/tissue_type="peel"

gene 1..1931

/gene="AFS1"

CDS 54..1784

/gene="AFS1"

/note="terpene synthase"

/codon_start=1

/product="(E,E)-alpha-farnesene synthase"

/protein_id="AAO22848.2"

/db_xref="GI:32265058"

/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK

NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF

EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE

NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS

LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW

ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS

EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT

KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA

DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK

GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI

LSLLFQPLVN"

ORIGIN

1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat

61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg

121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt

181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga

241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt

1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt

1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa

1921 aaaaaaaaaa a

//

The Flatfile Format


An example record m17755

An Example Record – M17755

Indexing for Nucleotide UID 4680720

FieldIndexed Terms

[primary accession]M17755

[title]Homo sapiens thyroid peroxidase (TPO) mRNA…

[organism]Homo sapiens

[sequence length]3060

[modification date]1999/04/26

[properties]biomol mrna

gbdiv pri

srcdb genbank


M17755 feature table

M17755: Feature Table

TPO [gene name]

CDS position in bp

thyroiditis

[text word]

thyroid peroxidase

[protein name]

protein

accession


Sequence 99 99 accurate

Sequence: 99.99% Accurate

The sequence itself

is not indexed…

Use BLAST for that!


Entrez protein

Entrez Protein

  • GenPept (DDBJ, EMBL, GenBank)4,444,405

  • RefSeq 1,753,167

  • PIR 222,395

  • Swiss Prot 189,005

  • PDB68,621

  • PRF 12,079

  • Third Party Annotation 4,219

    Total 6,693,891


Protein sources and links

Protein Sources and Links

PIR

no mRNA!

RefSeq

 NM_000537

SWISS-PROT

no mRNA!

GenPept

 M17755


Sequence revisions

Sequence Revisions

First seen at NCBI, not first seen at GenBank!

Version and GI change only if the sequence changes

The accession number always retrieves the most recent version


Update without a sequence change

Update without a Sequence Change

June 15, 1989!

GenBank came

to NCBI in 1992!


Update with a sequence change

Update with a Sequence Change


Genbank file formats

GenBank File Formats

ASN.1 – The Raw Data

flat file

XML (4 flavors)

FASTA


Ncbi molecular biology resources

Toolbox Sources

ftp> open ftp.ncbi.nih.gov

.

.

ftp> cd toolbox

ftp> cd ncbi_tools

ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools

NCBI Toolbox

/************************************************************************

*

* asn2ff.c

* convert an ASN.1 entry to flat file format, using the FFPrintArray.

*

**************************************************************************/

#include <accentr.h>

#include "asn2ff.h"

#include "asn2ffp.h"

#include "ffprint.h"

#include <subutil.h>

#include <objall.h>

#include <objcode.h>

#include <lsqfetch.h>

#include <explore.h>

#ifdef ENABLE_ID1

#include <accid1.h>

#endif

FILE *fpl;

Args myargs[] = {

{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},

{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},

{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},

{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},

{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},


Text searches in entrez

Text Searches in Entrez

term1 term2

If no [limit] is specified…

Organism?  [ organism ]

Journal?  [ journal ]

User compounds?  search as phrase

Author?  [author]

else [All Fields]

term1[limit]OPterm2[limit]OP …

where

limit =Entrez indexing field (organism, author, …)

op = AND, OR, NOT


Entrez tabs

Entrez Tabs

Provides a simple form for applying commonly used Entrez limits

Limits

Allows access to the full indexing of each Entrez database

and aids in constructing complex queries

Preview/Index

Provides access to previous searches in the current Entrez database

History

Clipboard

A temporary storage area for selected records

Details

Displays the detailed parsing of the current Entrez query, and

lists errors and terms without matches


Programming entrez e utilities

Programming Entrez: E-Utilities

http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html

ESearch

Entrez query

UID list or History

ESummary

UID list or History

Document summaries

EFetch

Formatted data

UID list or History

ELink

UID list or History

UID list or History

EPost

History

UID list


Finding primary sequences

Finding Primary Sequences

  • Search Entrez Nucleotide

    • 97.9% GenBank (primary data)

    • 2.1% RefSeq (curated data)

Possible queries we’ve seen so far…

M17755 [primary accession]TPO [gene name]

thyroid peroxidase [title]thyroiditis [text word]

Homo sapiens [organism]thyroid peroxidase [protein name]

3060 [sequence length]1999/04/26 [modification date]

biomol mrna [properties]gbdiv pri [properties]

srcdb genbank [properties]


A starting query

A Starting Query

Find nucleotide records for human thyroid peroxidase

309 records

human thyroid peroxidase

(("Homo sapiens“[Organism] OR human[All Fields]) AND

thyroid peroxidase[All Fields])

Field Limit!

298 records

human[organism] AND thyroid peroxidase

("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields])

11 records aren’t human sequences!!


Limit by title and database

Limit by Title and Database

  • Entrez Nucleotide

    • GenBank srcdbddbj/embl/genbank[properties]

    • RefSeq srcdbrefseq[properties]

#1: thyroid peroxidase AND human[orgn] 298

#2: thyroid peroxidase[title]AND human[orgn] 169

#3: #2 AND srcdbrefseq[properties] 5

#4: #2 AND srcdbddbj/embl/genbank[properties]164

primary data


Limit by genbank division

Limit by Genbank Division

EST Divisiongbdiv est[prop]

Primate Divisiongbdiv pri[prop]

#1: thyroid peroxidase AND human[orgn] 298

#2: thyroid peroxidase[title] AND human[orgn] 169

#3: #2 AND srcdb refseq[properties] 5

#4: #2 AND srcdb ddbj/embl/genbank[properties]164

#5: #4 AND gbdiv est[prop] 20

#6: #4 AND gbdiv pri[prop] 144

traditional GenBank records


Limit by biomolecule type

Limit by Biomolecule Type

Genomic DNAbiomol genomic[prop]

cDNAbiomol mrna[prop]

#1: thyroid peroxidase AND human[orgn] 298

#2: thyroid peroxidase[title] AND human[orgn] 169

#3: #2 AND srcdb refseq[properties] 5

#4: #2 AND srcdb ddbj/embl/genbank[properties]164

#5: #2 AND gbdiv est[prop] 20

#6: #2 AND gbdiv pri[prop] 144

#7: #6 AND biomol genomic[prop] 26

#8: #6 AND biomol mrna[prop] 118

genomic DNA

mRNA / cDNA


Limit by protein name

Limit by Protein Name

thyroid peroxidase[protein name]AND human[orgn]AND

gbdiv pri[prop]AND biomol mrna[prop]

118 records [title]  4 records [protein name]


Entrez document summaries

Entrez Document Summaries

Links menu

Click the accession to view the record

Links to other

Entrez databases

computed for M17755


Entrez links for gi 4680720

Entrez Links for GI 4680720

Gene annotation based on M17755

Full text online articles about M17755

All polymorphisms in the TPO gene

DNA/RNA sequences similar to M17755

Graphical view of TPO gene annotation

Human phenotypes involving TPO

Microarray datasets for M17755

Protein translation of M17755

Literature abstracts about M17755

Sequence polymorphisms in M17755

Source organism of M17755

STS markers in the TPO gene

TPO links beyond NCBI


Viewing m17755

Viewing M17755


Genbank sequences for human tpo

GenBank Sequences for Human TPO

Which one is the best sequence???


Refseq ncbi s derivative sequence database

RefSeq: NCBI’s Derivative Sequence Database

RefSeq Benefits

  • Non-redundant  

  • Explicitly linked nucleotide and protein sequences

  • Updated to reflect current sequence data and biology

  • Validated by hand

  • Format consistency

  • Distinct accession series

  • Stewardship by NCBI staff and collaborators

ftp://ftp.ncbi.nih.gov/refseq/release


Refseq ncbi s derivative sequence database1

RefSeq: NCBI’s Derivative Sequence Database

  • Curated transcripts and proteins

    • NM_123456  NP_123456

    • NR_123456 (non-coding RNA)

  • Model transcripts and proteins

    • XM_123456  XP_123456

    • XR_123456 (non-coding RNA)

  • Assembled Genomic Regions (contigs)

    • NT_123456 (BAC clones)

    • NW_123456 (WGS)

  • Other Genomic Sequence

    • NG_123456 (complex regions, pseudogenes)

    • NZ_ABCD12345678 (WGS)  ZP_123456

  • Chromosome records in Entrez Genome

    • NC_123456 (chromosome; microbial or organelle genome)

Nucleotide

Protein


Creating nm records

Creating NM Records

Genome annotation

Longest mRNA

NMs must have

cDNA support


Nm np records in entrez

NM/NP Records in Entrez

NM_000547: variant 1

COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff.

The reference sequence was derived from M17755.2 and AW874082.1.

On Feb 25, 2003 this sequence version replaced gi:21361188.

EST that completes 3’ end

NM_175719: variant 2

COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff.

The reference sequence was derived from J02970.1, AW874082.1 and M17755.2.

Nucleotide

Protein


Annotating the gene

Annotating the Gene

Genomic DNA

(NC, NT, NW)

Scanning....

Model mRNA(XM)

(XR)

Model protein (XP)

= ?

= !

Curated mRNA(NM)

(NR)

Curated Protein(NP)

RefSeq

Genbank

Sequences


Entrez gene and refseq

Entrez Gene and RefSeq

Gene

GenBank

RefSeq

Nucleotide

  • Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI

  • Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs)

  • Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases

  • NCBI RefSeqs are based on primary sequence data in GenBank


Entrez gene refseq annotations

Entrez Gene: RefSeq Annotations


Nm np records in entrez gene

NM/NP Records in Entrez Gene


Entrez gene refseq graphics

Entrez Gene RefSeq Graphics

NM

NP


What about loc440844

What about LOC440844?

Entrez Gene


Blast results for xm 496543

BLAST Results for XM_496543

Is there any GenBank support for this mRNA?

srcdb ddbj/embl/genbank[prop] AND biomol mrna[prop]

no full-length hit


The perils of the xm

The Perils of the XM

XM records are models based only on genomic sequence, and are subject

to revision or removal with each new build of that genome.

BLAST the XM against the RefSeq database to look for a replacement:

Query= gi|20850420|ref|XM_124429.1|

Mus musculus expressed sequence AA553001 (AA553001), mRNA

gi|19527087|ref|NM_133873.1|

Mus musculus DNA segment, Chr 4, Wayne State University 114,

expressed (D4Wsu114e), mRNA Length=1898

Score = 3701.55 bits (1867), Expect = 0

Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus


Eukaryotic nm xm records

Eukaryotic NM/XM Records

Bos taurus: 37541

Oryza sativa (japonica cultivar-group): 36836

Danio rerio: 30577

Homo sapiens: 29261

Arabidopsis thaliana: 28953

Mus musculus: 27033

Rattus norvegicus: 23975

Pan troglodytes: 21810

Caenorhabditis elegans: 21124

Drosophila melanogaster: 19412

Aspergillus nidulans FGSC A4: 18951

Gallus gallus: 18120

Canis familiaris: 16891

Anopheles gambiae str. PEST: 15328

Plasmodium chabaudi: 14747

Candida albicans SC5314: 13672

Dictyostelium discoideum: 13570

Ustilago maydis 521: 13044

Plasmodium berghei: 11778

Gibberella zeae PH-1: 11640

Magnaporthe grisea 70-15: 11109

Neurospora crassa: 10079

Aspergillus fumigatus Af293: 9923

Entamoeba histolytica HM-1:IMSS: 9772

Cryptococcus neoformans var. neoformans JEC21: 6594

Giardia lamblia ATCC 50803: 6569

Yarrowia lipolytica CLIB99: 6521

Debaryomyces hansenii CBS767: 6318

Apis mellifera: 6292

Kluyveromyces lactis NRRL Y-1140: 5327

Candida glabrata CBS138: 5181

Schizosaccharomyces pombe 972h-: 5035

Eremothecium gossypii: 4718

Theileria parva: 4079

Xenopus tropicalis: 4069

Cryptosporidium hominis: 3886

Cryptosporidium parvum: 3396

Sus scrofa: 938

Trypanosoma brucei: 599

Ovis aries: 253

Strongylocentrotus purpuratus: 215

Felis catus: 162

Plasmodium yoelii yoelii: 105

Takifugu rubripes: 7

Ciona intestinalis: 3

Trypanosoma cruzi: 3


Genome annotation in entrez nucleotide

Genome Annotation in Entrez Nucleotide

GenBank Components

(clones, WGS)

NT/NW Contigs

NC

Genome

Assembly

NM/XM

Master

mRNA

Components

Components


Genome annotation links

Genome Annotation Links

curated mRNA

genomic contig on human chromosome 2

containing NM_000547

human chromosome 2

the 21 contigs of the

chromosome 2 assembly


Getting the annotation details

Getting the Annotation Details

Genomic sequence

ACCESSION NC_000002 REGION: 1396242..1525502


Getting the annotation details1

Getting the Annotation Details

ACCESSION NC_000002 REGION: 1396242..1525502

exon-intron structure

These flat files contain all annotations in the gene and the full, explicit sequence


Searching entrez gene

Searching Entrez Gene

Gene symbol:human thyroid peroxidase (TPO)

tpo [sym] ANDhuman [organism]

Protein name:topoisomerase genes from Archaea

topoisomerase[gene/protein name]ANDarchaea [organism]

Chromosome and Links:genes on human chromosome 2 with OMIM links

2 [chromosome] ANDgene omim [filter] ANDhuman [organism]

RefSeq status and variants:Reviewed RefSeqs with transcript variants

srcdb refseq reviewed[prop]ANDhas transcript variants[prop]

Disease and Gene Ontology:Membrane proteins linked to cancer

integral to plasma membrane[gene ontology]ANDcancer [dis]


Gene links in entrez

Gene Links in Entrez

Microarray datasets for TPO

Gene homologs for TPO

DNA and RNA sequences for TPO

Phenotypes involving TPO

Protein sequences for TPO

Literature abstracts about TPO

Sequence polymorphisms in TPO

Species whose genome has this TPO gene

STS markers in the TPO gene

ESTs aligned to the TPO gene


Third party annotation tpa database

NCBI now accepts the submission of new annotations

of existing GenBank sequences.

Submissions must be published in a peer-reviewed journal.

Facilitates the annotation of sequences by experts.

What should not be submitted to TPA?

Synthetic constructs (such as cloning vectors) that use well-characterized, publicly available genes, promoters, or terminators

Updates or changes to existing sequence data

Sequence annotations without experimental evidence

Third Party Annotation(TPA) Database

Examples of sequences appropriate for TPA are:

  • Annotation of features on gene and/or mRNA sequences

  • Assembled “full length” genes and/or mRNAs


Beyond refseq

Beyond RefSeq

If your organism does not have RefSeqs…

  • UniGene : gene-based clusters of cDNAs and ESTs

  • WGS sequences in Entrez Nucleotide (wgs[prop])

  • Trace Archive


What is unigene

What is UniGene?

A gene-oriented view of sequence entries

  • MegaBlast based automated sequence clustering

  • Now informed by genome hits New!

  • Nonredundant set of gene oriented clusters

  • Each cluster a unique gene

  • Information on tissue types and map locations

  • Includes known genes and uncharacterized ESTs

  • Useful for gene discovery and selection of mapping reagents


Organisms in unigene

Organisms in UniGene

Top Ten

1. Human

2. Rice

3. Mouse

4. Cow

5. Wheat

6. Zebrafish

7. Pig

8. Chicken

9. Frog (X. laevis)

10. Frog (X. tropicalis)


Finding unigene clusters

Finding UniGene Clusters

by link

by Entrez search


Unigene cluster for tpo

UniGene Cluster for TPO


Ncbi molecular biology resources

Submitted by

Experimentalists

Curated by

NCBI

Submitted by

Manufacturer*

GDS

Grouping of

experiments

GSE

Grouping of

slide/chip data

“a single experiment”

GPL

Platform

descriptions

GSM

Raw/processed

spot intensities

from a single

slide/chip

Entrez

GEO Datasets

Entrez GEO


Linking to geo

Linking to GEO


Geo datasets

GEO Datasets


Whole genome shotgun projects

Traditional GenBank Divisions

300 + projects

Viruses

Bacteria

Environmental sequences

Archaea

73 Eukaryotes featuring:

Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human

Pufferfish (2), Zebrafish

Honeybee, Anopheles, Fruit Flies (4), Silkworm

Nematode (C. briggsae)

Yeasts (9), Aspergillus (3)

Rice

Whole Genome Shotgun Projects


Trace archive

Trace Archive


Short tailed opossum traces

Short-tailed opossum traces


Viewing simple genomes

Viewing Simple Genomes

All are RefSeq NC records in Entrez Genome

  • Full chromosomal sequences are provided

  • Genes are annotated

  • The annotation can be shown graphically and linked to sequence records


Ncbi molecular biology resources

mutL


Viewing complex genomes

Viewing Complex Genomes

NCBI Map Viewer

  • Map Viewer Home Page

    • Shows all supported organisms

    • Provides links to genomic BLAST

  • Genome Overview Page

    • Provides links to individual chromosomes

    • Shows hits on a genome graphically

  • Chromosome Viewing Page

    • Allows interactive views of annotation details

    • Provides numerous maps unique to each genome


Map viewer home page

Map Viewer Home Page


Genome overview page

Genome Overview Page

Search the maps

Genomic BLAST

Species-specific help!


Chromosome viewing page

Chromosome Viewing Page

Map Summary

Add or remove maps

Master Map

with exploded content

Genes

UniGene

Contigs

Zooming

Controls

Ideogram


Map summary

Map Summary

TPO’s contig!


Map content

Map Content

Map content varies greatly by species!

  • Sequence Maps

    • Core assembly

    • Annotation evidence

    • Clones & Markers

    • Polymorphisms

    • Links & Features

  • Genetic Maps

    • Cytogenetic maps

    • Linkage maps

    • Radiation hybrid maps

Assembly

Contig

Component

Transcript

Gene


View the assembly near tpo

View the Assembly near TPO


Assembly of chr 2

Assembly of Chr. 2

NT_033000

1255072

1563756


Assembly of chromosome 2

Assembly of Chromosome 2


Zooming

Zooming


View of tpo

View of TPO

Links to Entrez Nucleotide

Links to Entrez Gene

Links to Tools and Data

Gap in assembly


Map content1

Map Content

Map content varies greatly by species!

  • Sequence Maps

    • Core assembly

    • Annotation evidence

    • Clones & Markers

    • Polymorphisms

    • Links & Features

  • Genetic Maps

    • Cytogenetic maps

    • Linkage maps

    • Radiation hybrid maps

Ab initio (model)

GenBank DNA

EST

UniGene

Gene


Annotation evidence

Annotation Evidence

GenBank records not used in assembly

UniGene Clusters

Ab initio models

Aligned ESTs


Entrez homologene

Entrez Homologene

Homologs by protein BLAST


  • Login