Ncbi molecular biology resources
Download
1 / 74

NCBI Molecular Biology Resources - PowerPoint PPT Presentation


  • 172 Views
  • Uploaded on

NCBI Molecular Biology Resources. A Field Guide. Nov. 6, 2001. NCBI Resources. About NCBI NCBI Sequence Databases Primary Database – GenBank Derivative Databases - RefSeq Entrez Databases and Text Searching BLAST Services Genomic Resources.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'NCBI Molecular Biology Resources' - toril


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ncbi molecular biology resources l.jpg

NCBI Molecular Biology Resources

A Field Guide

Nov. 6, 2001


Ncbi resources l.jpg
NCBI Resources

  • About NCBI

  • NCBI Sequence Databases

    • Primary Database – GenBank

    • Derivative Databases - RefSeq

  • Entrez Databases and Text Searching

  • BLAST Services

  • Genomic Resources


The national center for biotechnology information ncbi l.jpg
The National Center for Biotechnology Information (NCBI)

  • Created as a part of the National Library of Medicine in 1988

    • Establish public databases

    • Research in computational biology

    • Develop software tools for sequence analysis

    • Disseminate biomedical information

  • Tools: BLAST(1990), Entrez (1992)

  • GenBank (1992)

  • Free MEDLINE (PubMed, 1997)

  • Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq


Molecular databases l.jpg
Molecular Databases

  • Primary Databases

    • Original submissions by experimentalists

    • Database staff organize but don’t add additional information

      • Example:GenBank

  • Derivative Databases

    • Human curated

      • compilation and correction of data

      • Example:SWISS-PROT, NCBI RefSeq mRNA

    • Computationally Derived

      • Example:UniGene

    • Combinations

      • Example:NCBI Genome Assembly


What is genbank ncbi s primary sequence database l.jpg
What is GenBank? NCBI’s Primary Sequence Database

  • Nucleotide only sequence database

  • Archival in nature

  • GenBank Data

    • Direct submissions individual records (BankIt, Sequin)

    • Batch submissions via email (EST, GSS, STS)

    • ftp accounts sequencing centers

  • Data shared nightly among three collaborating databases

    • GenBank

    • DNA Database of Japan (DDBJ).

    • European Molecular Biology Laboratory Database (EMBL) at EBI.


Slide6 l.jpg

Entrez

NIH

NCBI

GenBank

  • Submissions

  • Updates

  • Submissions

  • Updates

EMBL

DDBJ

EBI

CIB

NIG

  • Submissions

  • Updates

SRS

EMBL

getentry


Genbank l.jpg

Release 126 October2001

13,602,262 Records

14,396,883,064 Nucleotides

80,000 + Species

  • full release every two months

  • incremental and cumulative updates daily

  • available only through internet

ftp://ncbi.nlm.nih.gov/genbank/

or

ftp://genbank.sdsc.edu/pub/

GenBank


Genbank on ftp site l.jpg
GenBank on FTP site

ftp> open ftp.ncbi.nlm.nih.gov

.

.

ftp> cd genbank

Release 125: 243 files; 55.23 Gigabytes uncompressed


Genbank divisions l.jpg
GenBank Divisions

Bulk Sequence Divisions

PATPatent

ESTExpressed Sequence Tags (133 files)

STSSequence Tagged Site

GSSGenome Survey Sequence (41 files)

HTGHigh Throughput Genome (25 files)

HTC High Throughput cDNA

CONContig

Traditional Divisions

BCT INV MAM PHG PLN PRI

ROD SYN UNA VRL VRT


Est division e xpressed s equence t ags l.jpg

5’

3’

make cDNA

library

80-100,000 unique

cDNA clones in library

EST Division: Expressed Sequence Tags

>IMAGE:275615 5' mRNA sequence

GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC

TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT

TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA

GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC

TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC

AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT

TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

nucleus

30,000

genes

gatccantgccatacg

ctcgccaattcnntcg

>IMAGE:275615 3', mRNA sequence

NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT

TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC

AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA

CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT

GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

  • - isolate unique clones

  • sequence once

  • from each end

80-100,000 RNA

gene products


Sts division s equence t agged s ites l.jpg
STS Division : Sequence Tagged Sites

  • Segment of gene, EST , mRNA or genomic DNA of known position (microsatellite)

  • PCR with STS primers gives unique product (one per genome)

  • Basis of Radiation Hybrid Mapping

    • UniGene

    • Genome Assembly

  • Related resource: Electronic PCR

    http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi


Rh mapping using stss l.jpg

A

A

B

B

C

C

D

D

D

Human Chromosome

A

B

Hybrid Cells

PCR Results

A

B

C

D

+

+

-

+

-

-

+

+

+

+

-

-

RH mapping using STSs


Epcr results hexokinase 1 est l.jpg
ePCR Results Hexokinase 1 EST

SHGC-35892

dbSTS id: 44155, GenBank Accession: G29974

Organism: Homo sapiens

Primer1: CATACGACACGGCTCACAAA

Primer2: CTGTTTGTCTCGTGGGGG

STS location: 30..160 Chromosome: 10

Expected amplicon size: 129, Observed amplicon size: 130

Primers match in forward orientation

Query sequence:

1 TTTTTGAATT GGTACAAAGT TTACTAGGTC ATACGACACG GCTCACAAAG CGGTGGGAAA

61 TTCCAGTGAT GGCATTGTTT GTTGGTTGGT TCCTTTTATC CAAATGGAGA CAAGACACAT

121 TTCCGCAGAC GTGTCCACCT CCCCCCACGA GACAAACAGA ATGCAAGACT GTCACACGCG

181 GCTAGGACTG GTTCCACGGA CACACGATTT TGTGGCATTG ACACACCACG ATGCGATGCC

241 AGGCCACAGT GGGTGCCAGG AGGGGAGGAA GCAGCTAATG CTATGCCCAC ACTCGCCTTC

301 AGCATGTGCC CCGGGAGGAG GCCCGGCAGT GTCTGCTGGT GATAATACAT TTCACACGGG

361 GAGGGGGAAC CAAGGATGAG CTTTGGAGGC CAGAAGGCTG TCAGGTGGTG TG


Genome sequencing l.jpg

Whole BAC insert (or genome)

sonication

sequencing

cloning isolating

GSS division

assembly

Draft Sequence (HTG division)

Genome Sequencing


Gss division g enome s urvey s equences l.jpg

Genomic Clone (BAC)

GSS Division: Genome Survey Sequences

  • Genomic equivalent of ESTs

  • BAC and other first pass surveys

  • BAC end sequences

  • Whole Genome Shotgun (some)

  • RAPIDS and other anonymous loci

SP6 end

T7 end


Htg division h igh t hroughput g enome records l.jpg

phase 1

HTG

Acc = AC008701 gi = 6601005

phase 2

HTG

Acc = AC008701 gi = 6671909

PRI

phase 3

Acc = AC008701 gi = 7328720

HTG Division: High Throughput Genome Records

40,000 to > 350,000 bp



A simple genbank record l.jpg
A Simple GenBank Record

LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000

DEFINITION Limulus polyphemus myosin III mRNA, complete cds.

ACCESSION AF062069

VERSION AF062069.2 GI:7144484

KEYWORDS .

SOURCE Atlantic horseshoe crab.

ORGANISM Limulus polyphemus

Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;

Xiphosura; Limulidae; Limulus.

REFERENCE 1 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein

JOURNAL J. Neurosci. (1998) In press

REFERENCE 2 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REFERENCE 3 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REMARK Sequence update by submitter

COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.


Genbank record cont l.jpg
GenBank Record, cont.

FEATURES Location/Qualifiers

source 1..3808

/organism="Limulus polyphemus"

/db_xref="taxon:6850"

/tissue_type="lateral eye"

CDS 258..3302

/note="N-terminal protein kinase domain; C-terminal myosin

heavy chain head; substrate for PKA"

/codon_start=1

/product="myosin III"

/protein_id="AAC16332.2"

/db_xref="GI:7144485"

/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ

BASE COUNT 1201 a 689 c 782 g 1136 t

ORIGIN

1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt

3781 aagatacagt aactagggaa aaaaaaaa

//


Sequence and database identifiers l.jpg

Sequence and Database Identifiers

Sequence

length

Modification Date

mol-type

mRNA (= cDNA)

rRNA

snRNA

DNA

Locus Name

GB Division

LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000

Locus, accession, gi, version

DEFINITION Limulus polyphemus myosin III mRNA, complete cds.

ACCESSION AF062069

Accession Number

VERSION AF062069.2 GI:7144484

DEF line (Title)

Accession.version

gi number


Keywords source organism l.jpg
Keywords, Source-organism

  • Legacy field

  • exception

  • EST

  • GSS

  • HTG

Accepted common name

KEYWORDS .

SOURCE Atlantic horseshoe crab.

ORGANISM Limulus polyphemus

Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;

Xiphosura; Limulidae; Limulus.

Scientific name

Taxonomic lineage according to GenBank


Citation l.jpg
Citation

REFERENCE 1 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein

JOURNAL J. Neurosci. (1998) In press

REFERENCE 2 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REFERENCE 3 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REMARK Sequence update by submitter

COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.

Article

Submitter Block

Update history

Previous version


Feature table l.jpg
Feature Table

FEATURES Location/Qualifiers

source 1..3808

/organism="Limulus polyphemus"

/db_xref="taxon:6850"

/tissue_type="lateral eye"

CDS 258..3302

/note="N-terminal protein kinase domain;

C-terminal myosin heavy chain head; substrate for PKA"

/codon_start=1

/product="myosin III"

/protein_id="AAC16332.2"

/db_xref="GI:7144485"

/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDK

NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWL

"

Biosource

Coding

Sequence

Reading Frame

GenPept Protein Identifiers


Sequence l.jpg
Sequence

Indicates beginning of sequence data

BASE COUNT 1201 a 689 c 782 g 1136 t

ORIGIN

1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt

<sequence omitted>

3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata

3781 aagatacagt aactagggaa aaaaaaaa

//

End of record


Ncbi derivative sequence databases refseq l.jpg
NCBI Derivative Sequence Databases: RefSeq

NCBI Reference Sequences

mRNAs and Proteins

NM_123456 Curated mRNA

NP_123456 Curated Protein

XM_123456 Predicted Transcript

XP_123456 Predicted Protein

Gene Records

NG_123456 Reference Genomic Sequence

Assemblies

NT_123456 Contig (Mouse and Human Genomes)

NC_123455 Chromosome (Microbial Genomes)


Curated refseq records nm np l.jpg

REFSEQ: This reference sequence was derived from M28668.1,

M55131.1.

On Feb 17, 2000 this sequence version replaced gi:4502784.

Summary: Cystic fibrosis transmembrane conductance regulator is

member 7 of the ATP-binding cassete sub-family C. The protein

functions as a chloride channel and controls the regulation of

other transport pathways. Mutations in this gene cause the

autosomal recessive disorder, cystic fibrosis (CF) and congenital

bilateral aplasia of the vas deferens (CBAVD). Alternative splice

variants have been described, many of which result from mutations

in the CFTR gene.

COMPLETENESS: full length.

Reviewed

Curated RefSeq Records: NM_, NP_

LOCUS NM_000492 6159 bp mRNA PRI 26-JUL-1999

DEFINITION Homo sapiens cystic fibrosis transmembrane conductance

regulator(CFTR) mRNA.

ACCESSION NM_000492

RefSeq Nucleotide

LOCUS NP_000483 1480 aa PRI 26-JUL-1999

DEFINITION cystic fibrosis transmembrane conductance regulator.

ACCESSION NP_000483

PID g4502785

VERSION NP_000483.1 GI:4502785

DBSOURCE REFSEQ: accession NM_000492.1

RefSeq Protein

COMMENT REFSEQ: This reference sequence was derived from M55131.

PROVISIONAL RefSeq: This is a provisional reference sequence

record that has not yet been subject to human review. The final

curated reference sequence record may be somewhat different from

this one.


Alignment generated transcripts xm xp l.jpg
Alignment Generated Transcripts: XM_, XP_

LOCUS XM_004980 6128 bp mRNA PRI 16-NOV-2000

DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator,

ATP-binding cassette (sub-family C, member 7) (CFTR), mRNA.

ACCESSION XM_004980

VERSION XM_004980.3 GI:13631444

mismatch


Refseq human contig nt l.jpg
RefSeq Human Contig: NT_

mRNA complement(join(1255889..1257642,1258986..1259091,

1259690..1259862,1271619..1271708,1281957..1282112,

1296780..1297028,1309837..1309937,1312742..1312969,

1313881..1314031,1317797..1317876,1320768..1321018,

1321687..1321724,1329492..1329620,1331893..1332616,

1334111..1334197,1336717..1336811,1364895..1365086,

1375727..1375909,1382442..1382534,1384204..1384450,

1387877..1388002,1389139..1389302,1390185..1390274,

1393436..1393651,1415408..1415516,1420187..1420297,

1444403..1444587))

/partial

/gene="CFTR"

/product="cystic fibrosis transmembrane conductance

regulator, ATP-binding cassette (sub-family C, member 7)"

/transcript_id="XM_004980.1"

/db_xref="LocusID:1080"

/db_xref="MIM:602421"

/note="derived by automated computational analysis using

gene prediction method: Acembly. Supporting evidence

includes similarity to: 9 proteins, 1 mRNAs See details in

AceView"

gene complement(1255889..1444587)

/gene="CFTR"

/note="CF; MRP7; ABC35; ABCC7"

/db_xref="LocusID:1080"

LOCUS NT_007935 1888399 bp DNA CON 16-NOV-2000

DEFINITION Homo sapiens chromosome 7 working draft sequence segment,

complete sequence.

ACCESSION NT_007935

VERSION NT_007935.1 GI:11422165

KEYWORDS HTG.

SOURCE human.

ORGANISM Homo sapiens

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini;

Hominidae; Homo.

REFERENCE 1 (bases 1 to 1888399)

AUTHORS International Human Genome Project collaborators.

TITLE Toward the complete sequence of the human genome

JOURNAL Unpublished

COMMENT GENOME ANNOTATION REFSEQ: NCBI contigs are derived from

assembled genomic sequence data. They may include both

draft and finished sequence.

COMPLETENESS: not full length.

CONTIG join(AC073042.3:1155..2680,gap(100),AC074390.2:119526..151445,

gap(100),AC074390.2:1..5245,gap(100),

complement(AC074390.2:17705..23645),gap(100),

AC074390.2:97658..119425,AC073042.3:106479..121155,

AC074390.2:164226..165036,AC073042.3:70628..79503,gap(100),

AC073042.3:4627..6382,gap(100),AC073042.3:2781..4526,gap(100),

complement(AC073042.3:183627..209083),gap(100),

AC073042.3:79604..88622,gap(100),AC073042.3:139234..160437,

gap(100),complement(AC073042.3:6483..8319),gap(100),

complement(AC073042.3:39354..45372),gap(100),

complement(AC073042.3:21461..24064),gap(100),

AC074390.2:156347..160294,gap(100),

complement(AC074390.2:5346..10750),gap(100),

complement(AC074390.2:153911..156246),gap(100),

complement(AC074390.2:23746..32402),gap(100),

complement(AC074390.2:151546..153810),gap(100),

complement(AC074390.2:57277..75275),gap(100),

complement(AC074390.2:75376..97557),gap(100),

Reordering draft sequence


Map view of refseqs l.jpg
Map View of RefSeqs

NT_

XM_

NM_



Refseq chromosomes nc l.jpg
RefSeq Chromosomes:NC_

LOCUS NC_002695 5498450 bp DNA circular BCT 02-OCT-2001

DEFINITION Escherichia coli O157:H7, complete genome.

ACCESSION NC_002695

VERSION NC_002695.1 GI:15829254

KEYWORDS .

SOURCE Escherichia coli O157:H7.

ORGANISM Escherichia coli O157:H7

Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;

Escherichia.

REFERENCE 1 (sites)

AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S.,

Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T.,

Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T.,

Sasakawa,C. and Shinagawa,H.

TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the

verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7

derived from the Sakai outbreak

JOURNAL Genes Genet. Syst. 74 (5), 227-239 (1999)

MEDLINE 20198780

PUBMED 10734605


Other ncbi derivative databases l.jpg
Other NCBI Derivative Databases

UniGene - gene oriented expressed sequence

clusters

LocusLink - central resource and interface for

known genes


Ncbi homepage l.jpg
NCBI Homepage


Ncbi homepage35 l.jpg

Mendelian Inheritance in Man

Entrez

Similarity

Searching

NCBI Homepage


Using entrez l.jpg

Using Entrez

An integrated database search and retrieval system


Entrez neighboring and hard links l.jpg

PubMed abstracts

Taxonomy

Genomes

Nucleotide sequences

Entrez: Neighboring and Hard Links

Word weight

3-D Structure

3 -D Structure

VAST

Phylogeny

(MMDB)

Protein sequences

BLAST

BLAST


Www entrez l.jpg
WWW Entrez

  • All of MEDLINE plus others

  • Abstracts

  • Links to online Journals

GenBank, EMBL, DDBJ

RefSeq, PDB

GenBank, DDBJ, EMBL translations

PDB, PIR, SWISS-PROT, PRF, RefSeq

Reference Genomes:

Graphical views, assembled sequence

and mapping data

NCBI’s MMDB - derived from PDB


Database searching with entrez l.jpg

Database Searching with Entrez

Using limits and field restriction to find mouse GAPD

Linking and neighboring with mouse GAPD


Entrez nucleotides l.jpg

Mouse

Entrez Nucleotides


Document summaries mouse all fields l.jpg
Document Summaries: Mouse[All Fields]

3 million records

Chicken not mouse !?


Entrez nucleotides limits preview index l.jpg

Mouse

Entrez Nucleotides: Limits: Preview/Index


Entrez nucleotides limits l.jpg

Accession

All Fields

Author Name

EC/RN Number

Feature key

Filter

Gene Name

Issue

Journal Name

Keyword

Modification Date

Organism

Page Number

Primary Accession

Properties

Protein Name

Publication Date

SeqID String

Sequence Length

Substance Name

Text Word

Title Word

Uid

Volume

Field Restriction

Entrez Nucleotides: Limits

Mouse

Exclude unwanted categories of sequences

Gene Location

Genomic DNA/RNA

Mitochondrion

Chloroplast

Molecule

Genomic DNA/RNA

mRNA

rRNA

Only From

RefSeq

GenBank

EMBL

DDBJ



Document summaries mouse organism l.jpg

2,976,070[All Fields]

-2,921,009[Organism]

55,061

Document Summaries: Mouse[Organism]



Adding terms preview index l.jpg

glyceraldehyde 3 phosphate dehydrogenase

Adding Terms: Preview/Index

Accession

All Fields

Author Name

EC/RN Number

Feature key

Filter

Gene Name

Issue

Journal Name

Keyword

Modification Date

Organism

Page Number

Primary Accession

Properties

Protein Name

Publication Date

SeqID String

Sequence Length

Substance Name

Text Word

Title Word

Uid

Volume

Search History



Displaying mouse gapd records l.jpg
Displaying Mouse GAPD Records

Summary

Brief

GenBank

ASN.1

FASTA

GI list

LinkOut

PubMed Links

Protein Links

Nucleotide Neighbors

PopSet Links

Structure Links

Genome Links

Taxonomy Links

OMIM Links

Formats

Links and neighbors (related records)



Fasta format l.jpg
FASTA Format

>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald

GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC

AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC

ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT

CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC

CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT

GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT

AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA

CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA

CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT

ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA

CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT

GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC

TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC

CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC

CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC

CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG

CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC

GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC

GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA

TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG

GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC

FASTA Definition Line

>gi|193425|gb|M60978.1|MUSGAPDS

>

gi number

Locus Name

Database Identifiers

gb GenBank

emb EMBL

dbj DDBJ

sp SWISS-PROT

pdb Protein Databank

pir PIR

prf PRF

ref RefSeq

Accession number


Abstract syntax notation asn 1 l.jpg

GenPept

GenBank

ASN.1

FASTA

Nucleotide

FASTA

Protein

Abstract Syntax Notation: ASN.1

Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate

dehydrogenase (Gapd-S) mRNA, and translated products" ,

update-date

std {

year 1994 ,

month 11 ,

day 9 } ,

source {

org {

taxname "Mus musculus" ,

common "house mouse" ,

db {

{

db "taxon" ,

tag

id 10090 } } ,


Slide53 l.jpg

Toolbox Sources

ftp> open ncbi.nlm.nih.gov

.

.

ftp> cd toolbox

ftp> cd ncbi_tools

ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools

NCBI Toolbox

/*****************************************************************************

*

* asn2ff.c

* convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs.

*

*****************************************************************************/

#include <accentr.h>

#include "asn2ff.h"

#include "asn2ffp.h"

#include "ffprint.h"

#include <subutil.h>

#include <objall.h>

#include <objcode.h>

#include <lsqfetch.h>

#include <explore.h>

#ifdef ENABLE_ID1

#include <accid1.h>

#endif

FILE *fpl;

Args myargs[] = {

{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},

{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},

{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},

{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},

{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},


Protein neighbors structure links l.jpg

Related Proteins

Cn3D GAPD Structure

Structure Links

Protein Neighbors-Structure Links






Entrez structures l.jpg

Entrez Structures

Molecular Modeling Database (MMDB) and Cn3D


Mmdb m olecular m odeling d ata b ase l.jpg
MMDB: Molecular Modeling Data Base

  • Derived from experimentally determined PDB records

  • Value added to PDB records including:

    • Addition of explicit chemical graph information

    • Validation

    • Inclusion of Taxonomy, Citation, and other information

    • Conversion to parseable ASN.1 data description language

  • Structure neighbors determined by

    Vector Alignment Search Tool (VAST)


Searching mmdb l.jpg

1CET

Searching MMDB


Structure summary l.jpg
Structure Summary

BLAST neighbors

VAST neighbors

Cn3D viewer




Structural alignments l.jpg
Structural Alignments

Chloroquine

NADH


Why do we need similarity searching l.jpg
Why do we need similarity searching?

  • Identification and annotation

    • Incomplete or no annotations (GenBank)

    • Incorrectly annotated sequences

  • Evolutionary relationships

    homologous molecules may

    have similar functions

but it ain’t necessarily so!


B asic l ocal a lignment s earch t ool l.jpg
Basic Local Alignment Search Tool

  • Widely used similarity search tool

  • Heuristic approach based on Smith Waterman algorithm

  • Finds best local alignments

  • Provides statistical significance

  • All combinations (DNA/Protein) query and database.

    • DNA vs DNA

    • DNA translation vs Protein

    • Protein vs Protein

    • Protein vs DNA translation

    • DNA translation vs DNA translation

  • www, email server, standalone, and networkclients


Local alignment statistics l.jpg
Local Alignment Statistics

High scores of local alignments between two random sequences

follow Extreme Value Distribution

For ungapped alignments:

Expected number with score S or greater

E = Kmne-S

or

E = mn2-S’

K = scale for search space

 = scale for scoring system

S’= bitscore = (S - lnK)/ln2

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html


Scoring systems l.jpg
Scoring Systems

  • Nucleic acids identity matrix

  • Proteins

    • Position Independent Matrices

      • PAM Matrices(Percent Accepted Mutation)

        • Implicit model of evolution

        • Higher PAM number all calculated from PAM1

        • PAM250 widely used

      • BLOSUM Matrices (BLOck SUbstition Matrices)

        • Empirically determined from alignment

        • of conserved blocks

        • Each includes information up to a certain level of identity

        • BLOSUM62 widely used

    • Position Specific Score Matrices (PSSM)

      • PSI and RPS BLAST


Blosum62 l.jpg

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

BLOSUM62

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X


Position specific substitution rates l.jpg
Position Specific Substitution Rates

Typical serine

Active site serine


Position specific score matrix pssm l.jpg
Position Specific Score Matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V

206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1

207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5

208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2

209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0

210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6

211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3

212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4

213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3

214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6

215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7

216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5

217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7

218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7

219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6

220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0

221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6

222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0

223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4

224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently

in these two positions

Active site nucleophile


Gapped alignments l.jpg
Gapped Alignments

  • Gapping provides more biologically realistic alignments

  • Statistical behavior not completely understood for gapped alignments

    • Gapped BLAST parameters must be found by simulations for each matrix

  • Affine gap costs = -(a+bk)

    • a = gap open penalty b = gap extend penalty

    • A gap of length 1 receives the score -(a+b)



ad