ncbi molecular biology resources
Download
Skip this Video
Download Presentation
NCBI Molecular Biology Resources

Loading in 2 Seconds...

play fullscreen
1 / 74

NCBI Molecular Biology Resources - PowerPoint PPT Presentation


  • 173 Views
  • Uploaded on

NCBI Molecular Biology Resources. A Field Guide. Nov. 6, 2001. NCBI Resources. About NCBI NCBI Sequence Databases Primary Database – GenBank Derivative Databases - RefSeq Entrez Databases and Text Searching BLAST Services Genomic Resources.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'NCBI Molecular Biology Resources' - toril


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ncbi molecular biology resources

NCBI Molecular Biology Resources

A Field Guide

Nov. 6, 2001

ncbi resources
NCBI Resources
  • About NCBI
  • NCBI Sequence Databases
    • Primary Database – GenBank
    • Derivative Databases - RefSeq
  • Entrez Databases and Text Searching
  • BLAST Services
  • Genomic Resources
the national center for biotechnology information ncbi
The National Center for Biotechnology Information (NCBI)
  • Created as a part of the National Library of Medicine in 1988
    • Establish public databases
    • Research in computational biology
    • Develop software tools for sequence analysis
    • Disseminate biomedical information
  • Tools: BLAST(1990), Entrez (1992)
  • GenBank (1992)
  • Free MEDLINE (PubMed, 1997)
  • Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq
molecular databases
Molecular Databases
  • Primary Databases
    • Original submissions by experimentalists
    • Database staff organize but don’t add additional information
      • Example:GenBank
  • Derivative Databases
    • Human curated
      • compilation and correction of data
      • Example:SWISS-PROT, NCBI RefSeq mRNA
    • Computationally Derived
      • Example:UniGene
    • Combinations
      • Example:NCBI Genome Assembly
what is genbank ncbi s primary sequence database
What is GenBank? NCBI’s Primary Sequence Database
  • Nucleotide only sequence database
  • Archival in nature
  • GenBank Data
    • Direct submissions individual records (BankIt, Sequin)
    • Batch submissions via email (EST, GSS, STS)
    • ftp accounts sequencing centers
  • Data shared nightly among three collaborating databases
    • GenBank
    • DNA Database of Japan (DDBJ).
    • European Molecular Biology Laboratory Database (EMBL) at EBI.
slide6

Entrez

NIH

NCBI

GenBank

  • Submissions
  • Updates
  • Submissions
  • Updates

EMBL

DDBJ

EBI

CIB

NIG

  • Submissions
  • Updates

SRS

EMBL

getentry

genbank

Release 126 October2001

13,602,262 Records

14,396,883,064 Nucleotides

80,000 + Species

  • full release every two months
  • incremental and cumulative updates daily
  • available only through internet

ftp://ncbi.nlm.nih.gov/genbank/

or

ftp://genbank.sdsc.edu/pub/

GenBank
genbank on ftp site
GenBank on FTP site

ftp> open ftp.ncbi.nlm.nih.gov

.

.

ftp> cd genbank

Release 125: 243 files; 55.23 Gigabytes uncompressed

genbank divisions
GenBank Divisions

Bulk Sequence Divisions

PATPatent

ESTExpressed Sequence Tags (133 files)

STSSequence Tagged Site

GSSGenome Survey Sequence (41 files)

HTGHigh Throughput Genome (25 files)

HTC High Throughput cDNA

CONContig

Traditional Divisions

BCT INV MAM PHG PLN PRI

ROD SYN UNA VRL VRT

est division e xpressed s equence t ags

5’

3’

make cDNA

library

80-100,000 unique

cDNA clones in library

EST Division: Expressed Sequence Tags

>IMAGE:275615 5\' mRNA sequence

GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC

TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT

TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA

GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC

TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC

AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT

TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

nucleus

30,000

genes

gatccantgccatacg

ctcgccaattcnntcg

>IMAGE:275615 3\', mRNA sequence

NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT

TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC

AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA

CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT

GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

  • - isolate unique clones
  • sequence once
  • from each end

80-100,000 RNA

gene products

sts division s equence t agged s ites
STS Division : Sequence Tagged Sites
  • Segment of gene, EST , mRNA or genomic DNA of known position (microsatellite)
  • PCR with STS primers gives unique product (one per genome)
  • Basis of Radiation Hybrid Mapping
    • UniGene
    • Genome Assembly
  • Related resource: Electronic PCR

http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi

rh mapping using stss

A

A

B

B

C

C

D

D

D

Human Chromosome

A

B

Hybrid Cells

PCR Results

A

B

C

D

+

+

-

+

-

-

+

+

+

+

-

-

RH mapping using STSs
epcr results hexokinase 1 est
ePCR Results Hexokinase 1 EST

SHGC-35892

dbSTS id: 44155, GenBank Accession: G29974

Organism: Homo sapiens

Primer1: CATACGACACGGCTCACAAA

Primer2: CTGTTTGTCTCGTGGGGG

STS location: 30..160 Chromosome: 10

Expected amplicon size: 129, Observed amplicon size: 130

Primers match in forward orientation

Query sequence:

1 TTTTTGAATT GGTACAAAGT TTACTAGGTC ATACGACACG GCTCACAAAG CGGTGGGAAA

61 TTCCAGTGAT GGCATTGTTT GTTGGTTGGT TCCTTTTATC CAAATGGAGA CAAGACACAT

121 TTCCGCAGAC GTGTCCACCT CCCCCCACGA GACAAACAGA ATGCAAGACT GTCACACGCG

181 GCTAGGACTG GTTCCACGGA CACACGATTT TGTGGCATTG ACACACCACG ATGCGATGCC

241 AGGCCACAGT GGGTGCCAGG AGGGGAGGAA GCAGCTAATG CTATGCCCAC ACTCGCCTTC

301 AGCATGTGCC CCGGGAGGAG GCCCGGCAGT GTCTGCTGGT GATAATACAT TTCACACGGG

361 GAGGGGGAAC CAAGGATGAG CTTTGGAGGC CAGAAGGCTG TCAGGTGGTG TG

genome sequencing

Whole BAC insert (or genome)

sonication

sequencing

cloning isolating

GSS division

assembly

Draft Sequence (HTG division)

Genome Sequencing
gss division g enome s urvey s equences

Genomic Clone (BAC)

GSS Division: Genome Survey Sequences
  • Genomic equivalent of ESTs
  • BAC and other first pass surveys
  • BAC end sequences
  • Whole Genome Shotgun (some)
  • RAPIDS and other anonymous loci

SP6 end

T7 end

htg division h igh t hroughput g enome records

phase 1

HTG

Acc = AC008701 gi = 6601005

phase 2

HTG

Acc = AC008701 gi = 6671909

PRI

phase 3

Acc = AC008701 gi = 7328720

HTG Division: High Throughput Genome Records

40,000 to > 350,000 bp

a simple genbank record
A Simple GenBank Record

LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000

DEFINITION Limulus polyphemus myosin III mRNA, complete cds.

ACCESSION AF062069

VERSION AF062069.2 GI:7144484

KEYWORDS .

SOURCE Atlantic horseshoe crab.

ORGANISM Limulus polyphemus

Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;

Xiphosura; Limulidae; Limulus.

REFERENCE 1 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein

JOURNAL J. Neurosci. (1998) In press

REFERENCE 2 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REFERENCE 3 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REMARK Sequence update by submitter

COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.

genbank record cont
GenBank Record, cont.

FEATURES Location/Qualifiers

source 1..3808

/organism="Limulus polyphemus"

/db_xref="taxon:6850"

/tissue_type="lateral eye"

CDS 258..3302

/note="N-terminal protein kinase domain; C-terminal myosin

heavy chain head; substrate for PKA"

/codon_start=1

/product="myosin III"

/protein_id="AAC16332.2"

/db_xref="GI:7144485"

/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ

BASE COUNT 1201 a 689 c 782 g 1136 t

ORIGIN

1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt

3781 aagatacagt aactagggaa aaaaaaaa

//

sequence and database identifiers

Sequence and Database Identifiers

Sequence

length

Modification Date

mol-type

mRNA (= cDNA)

rRNA

snRNA

DNA

Locus Name

GB Division

LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000

Locus, accession, gi, version

DEFINITION Limulus polyphemus myosin III mRNA, complete cds.

ACCESSION AF062069

Accession Number

VERSION AF062069.2 GI:7144484

DEF line (Title)

Accession.version

gi number

keywords source organism
Keywords, Source-organism
  • Legacy field
  • exception
  • EST
  • GSS
  • HTG

Accepted common name

KEYWORDS .

SOURCE Atlantic horseshoe crab.

ORGANISM Limulus polyphemus

Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;

Xiphosura; Limulidae; Limulus.

Scientific name

Taxonomic lineage according to GenBank

citation
Citation

REFERENCE 1 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein

JOURNAL J. Neurosci. (1998) In press

REFERENCE 2 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REFERENCE 3 (bases 1 to 3808)

AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,

Greenberg,R.M. and Smith,W.C.

TITLE Direct Submission

JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,

9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA

REMARK Sequence update by submitter

COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700.

Article

Submitter Block

Update history

Previous version

feature table
Feature Table

FEATURES Location/Qualifiers

source 1..3808

/organism="Limulus polyphemus"

/db_xref="taxon:6850"

/tissue_type="lateral eye"

CDS 258..3302

/note="N-terminal protein kinase domain;

C-terminal myosin heavy chain head; substrate for PKA"

/codon_start=1

/product="myosin III"

/protein_id="AAC16332.2"

/db_xref="GI:7144485"

/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDK

NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWL

"

Biosource

Coding

Sequence

Reading Frame

GenPept Protein Identifiers

sequence
Sequence

Indicates beginning of sequence data

BASE COUNT 1201 a 689 c 782 g 1136 t

ORIGIN

1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt

<sequence omitted>

3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata

3781 aagatacagt aactagggaa aaaaaaaa

//

End of record

ncbi derivative sequence databases refseq
NCBI Derivative Sequence Databases: RefSeq

NCBI Reference Sequences

mRNAs and Proteins

NM_123456 Curated mRNA

NP_123456 Curated Protein

XM_123456 Predicted Transcript

XP_123456 Predicted Protein

Gene Records

NG_123456 Reference Genomic Sequence

Assemblies

NT_123456 Contig (Mouse and Human Genomes)

NC_123455 Chromosome (Microbial Genomes)

curated refseq records nm np

REFSEQ: This reference sequence was derived from M28668.1,

M55131.1.

On Feb 17, 2000 this sequence version replaced gi:4502784.

Summary: Cystic fibrosis transmembrane conductance regulator is

member 7 of the ATP-binding cassete sub-family C. The protein

functions as a chloride channel and controls the regulation of

other transport pathways. Mutations in this gene cause the

autosomal recessive disorder, cystic fibrosis (CF) and congenital

bilateral aplasia of the vas deferens (CBAVD). Alternative splice

variants have been described, many of which result from mutations

in the CFTR gene.

COMPLETENESS: full length.

Reviewed

Curated RefSeq Records: NM_, NP_

LOCUS NM_000492 6159 bp mRNA PRI 26-JUL-1999

DEFINITION Homo sapiens cystic fibrosis transmembrane conductance

regulator(CFTR) mRNA.

ACCESSION NM_000492

RefSeq Nucleotide

LOCUS NP_000483 1480 aa PRI 26-JUL-1999

DEFINITION cystic fibrosis transmembrane conductance regulator.

ACCESSION NP_000483

PID g4502785

VERSION NP_000483.1 GI:4502785

DBSOURCE REFSEQ: accession NM_000492.1

RefSeq Protein

COMMENT REFSEQ: This reference sequence was derived from M55131.

PROVISIONAL RefSeq: This is a provisional reference sequence

record that has not yet been subject to human review. The final

curated reference sequence record may be somewhat different from

this one.

alignment generated transcripts xm xp
Alignment Generated Transcripts: XM_, XP_

LOCUS XM_004980 6128 bp mRNA PRI 16-NOV-2000

DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator,

ATP-binding cassette (sub-family C, member 7) (CFTR), mRNA.

ACCESSION XM_004980

VERSION XM_004980.3 GI:13631444

mismatch

refseq human contig nt
RefSeq Human Contig: NT_

mRNA complement(join(1255889..1257642,1258986..1259091,

1259690..1259862,1271619..1271708,1281957..1282112,

1296780..1297028,1309837..1309937,1312742..1312969,

1313881..1314031,1317797..1317876,1320768..1321018,

1321687..1321724,1329492..1329620,1331893..1332616,

1334111..1334197,1336717..1336811,1364895..1365086,

1375727..1375909,1382442..1382534,1384204..1384450,

1387877..1388002,1389139..1389302,1390185..1390274,

1393436..1393651,1415408..1415516,1420187..1420297,

1444403..1444587))

/partial

/gene="CFTR"

/product="cystic fibrosis transmembrane conductance

regulator, ATP-binding cassette (sub-family C, member 7)"

/transcript_id="XM_004980.1"

/db_xref="LocusID:1080"

/db_xref="MIM:602421"

/note="derived by automated computational analysis using

gene prediction method: Acembly. Supporting evidence

includes similarity to: 9 proteins, 1 mRNAs See details in

AceView"

gene complement(1255889..1444587)

/gene="CFTR"

/note="CF; MRP7; ABC35; ABCC7"

/db_xref="LocusID:1080"

LOCUS NT_007935 1888399 bp DNA CON 16-NOV-2000

DEFINITION Homo sapiens chromosome 7 working draft sequence segment,

complete sequence.

ACCESSION NT_007935

VERSION NT_007935.1 GI:11422165

KEYWORDS HTG.

SOURCE human.

ORGANISM Homo sapiens

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini;

Hominidae; Homo.

REFERENCE 1 (bases 1 to 1888399)

AUTHORS International Human Genome Project collaborators.

TITLE Toward the complete sequence of the human genome

JOURNAL Unpublished

COMMENT GENOME ANNOTATION REFSEQ: NCBI contigs are derived from

assembled genomic sequence data. They may include both

draft and finished sequence.

COMPLETENESS: not full length.

CONTIG join(AC073042.3:1155..2680,gap(100),AC074390.2:119526..151445,

gap(100),AC074390.2:1..5245,gap(100),

complement(AC074390.2:17705..23645),gap(100),

AC074390.2:97658..119425,AC073042.3:106479..121155,

AC074390.2:164226..165036,AC073042.3:70628..79503,gap(100),

AC073042.3:4627..6382,gap(100),AC073042.3:2781..4526,gap(100),

complement(AC073042.3:183627..209083),gap(100),

AC073042.3:79604..88622,gap(100),AC073042.3:139234..160437,

gap(100),complement(AC073042.3:6483..8319),gap(100),

complement(AC073042.3:39354..45372),gap(100),

complement(AC073042.3:21461..24064),gap(100),

AC074390.2:156347..160294,gap(100),

complement(AC074390.2:5346..10750),gap(100),

complement(AC074390.2:153911..156246),gap(100),

complement(AC074390.2:23746..32402),gap(100),

complement(AC074390.2:151546..153810),gap(100),

complement(AC074390.2:57277..75275),gap(100),

complement(AC074390.2:75376..97557),gap(100),

Reordering draft sequence

refseq chromosomes nc
RefSeq Chromosomes:NC_

LOCUS NC_002695 5498450 bp DNA circular BCT 02-OCT-2001

DEFINITION Escherichia coli O157:H7, complete genome.

ACCESSION NC_002695

VERSION NC_002695.1 GI:15829254

KEYWORDS .

SOURCE Escherichia coli O157:H7.

ORGANISM Escherichia coli O157:H7

Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;

Escherichia.

REFERENCE 1 (sites)

AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S.,

Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T.,

Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T.,

Sasakawa,C. and Shinagawa,H.

TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the

verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7

derived from the Sakai outbreak

JOURNAL Genes Genet. Syst. 74 (5), 227-239 (1999)

MEDLINE 20198780

PUBMED 10734605

other ncbi derivative databases
Other NCBI Derivative Databases

UniGene - gene oriented expressed sequence

clusters

LocusLink - central resource and interface for

known genes

using entrez

Using Entrez

An integrated database search and retrieval system

entrez neighboring and hard links

PubMed abstracts

Taxonomy

Genomes

Nucleotide sequences

Entrez: Neighboring and Hard Links

Word weight

3-D Structure

3 -D Structure

VAST

Phylogeny

(MMDB)

Protein sequences

BLAST

BLAST

www entrez
WWW Entrez
  • All of MEDLINE plus others
  • Abstracts
  • Links to online Journals

GenBank, EMBL, DDBJ

RefSeq, PDB

GenBank, DDBJ, EMBL translations

PDB, PIR, SWISS-PROT, PRF, RefSeq

Reference Genomes:

Graphical views, assembled sequence

and mapping data

NCBI’s MMDB - derived from PDB

database searching with entrez

Database Searching with Entrez

Using limits and field restriction to find mouse GAPD

Linking and neighboring with mouse GAPD

document summaries mouse all fields
Document Summaries: Mouse[All Fields]

3 million records

Chicken not mouse !?

entrez nucleotides limits

Accession

All Fields

Author Name

EC/RN Number

Feature key

Filter

Gene Name

Issue

Journal Name

Keyword

Modification Date

Organism

Page Number

Primary Accession

Properties

Protein Name

Publication Date

SeqID String

Sequence Length

Substance Name

Text Word

Title Word

Uid

Volume

Field Restriction

Entrez Nucleotides: Limits

Mouse

Exclude unwanted categories of sequences

Gene Location

Genomic DNA/RNA

Mitochondrion

Chloroplast

Molecule

Genomic DNA/RNA

mRNA

rRNA

Only From

RefSeq

GenBank

EMBL

DDBJ

adding terms preview index

glyceraldehyde 3 phosphate dehydrogenase

Adding Terms: Preview/Index

Accession

All Fields

Author Name

EC/RN Number

Feature key

Filter

Gene Name

Issue

Journal Name

Keyword

Modification Date

Organism

Page Number

Primary Accession

Properties

Protein Name

Publication Date

SeqID String

Sequence Length

Substance Name

Text Word

Title Word

Uid

Volume

Search History

displaying mouse gapd records
Displaying Mouse GAPD Records

Summary

Brief

GenBank

ASN.1

FASTA

GI list

LinkOut

PubMed Links

Protein Links

Nucleotide Neighbors

PopSet Links

Structure Links

Genome Links

Taxonomy Links

OMIM Links

Formats

Links and neighbors (related records)

fasta format
FASTA Format

>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald

GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC

AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC

ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT

CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC

CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT

GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT

AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA

CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA

CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT

ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA

CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT

GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC

TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC

CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC

CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC

CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG

CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC

GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC

GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA

TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG

GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC

FASTA Definition Line

>gi|193425|gb|M60978.1|MUSGAPDS

>

gi number

Locus Name

Database Identifiers

gb GenBank

emb EMBL

dbj DDBJ

sp SWISS-PROT

pdb Protein Databank

pir PIR

prf PRF

ref RefSeq

Accession number

abstract syntax notation asn 1

GenPept

GenBank

ASN.1

FASTA

Nucleotide

FASTA

Protein

Abstract Syntax Notation: ASN.1

Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate

dehydrogenase (Gapd-S) mRNA, and translated products" ,

update-date

std {

year 1994 ,

month 11 ,

day 9 } ,

source {

org {

taxname "Mus musculus" ,

common "house mouse" ,

db {

{

db "taxon" ,

tag

id 10090 } } ,

slide53

Toolbox Sources

ftp> open ncbi.nlm.nih.gov

.

.

ftp> cd toolbox

ftp> cd ncbi_tools

ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools

NCBI Toolbox

/*****************************************************************************

*

* asn2ff.c

* convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs.

*

*****************************************************************************/

#include <accentr.h>

#include "asn2ff.h"

#include "asn2ffp.h"

#include "ffprint.h"

#include <subutil.h>

#include <objall.h>

#include <objcode.h>

#include <lsqfetch.h>

#include <explore.h>

#ifdef ENABLE_ID1

#include <accid1.h>

#endif

FILE *fpl;

Args myargs[] = {

{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,\'a\',ARG_FILE_IN,0.0,0,NULL},

{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,\'e\',ARG_BOOLEAN,0.0,0,NULL},

{"Input asnfile in binary mode","F",NULL,NULL,TRUE,\'b\',ARG_BOOLEAN,0.0,0,NULL},

{"Output Filename","stdout", NULL,NULL,TRUE,\'o\',ARG_FILE_OUT,0.0,0,NULL},

{"Show Sequence?","T", NULL ,NULL ,TRUE,\'h\',ARG_BOOLEAN,0.0,0,NULL},

protein neighbors structure links

Related Proteins

Cn3D GAPD Structure

Structure Links

Protein Neighbors-Structure Links
entrez structures

Entrez Structures

Molecular Modeling Database (MMDB) and Cn3D

mmdb m olecular m odeling d ata b ase
MMDB: Molecular Modeling Data Base
  • Derived from experimentally determined PDB records
  • Value added to PDB records including:
    • Addition of explicit chemical graph information
    • Validation
    • Inclusion of Taxonomy, Citation, and other information
    • Conversion to parseable ASN.1 data description language
  • Structure neighbors determined by

Vector Alignment Search Tool (VAST)

structure summary
Structure Summary

BLAST neighbors

VAST neighbors

Cn3D viewer

structural alignments
Structural Alignments

Chloroquine

NADH

why do we need similarity searching
Why do we need similarity searching?
  • Identification and annotation
    • Incomplete or no annotations (GenBank)
    • Incorrectly annotated sequences
  • Evolutionary relationships

homologous molecules may

have similar functions

but it ain’t necessarily so!

b asic l ocal a lignment s earch t ool
Basic Local Alignment Search Tool
  • Widely used similarity search tool
  • Heuristic approach based on Smith Waterman algorithm
  • Finds best local alignments
  • Provides statistical significance
  • All combinations (DNA/Protein) query and database.
    • DNA vs DNA
    • DNA translation vs Protein
    • Protein vs Protein
    • Protein vs DNA translation
    • DNA translation vs DNA translation
  • www, email server, standalone, and networkclients
local alignment statistics
Local Alignment Statistics

High scores of local alignments between two random sequences

follow Extreme Value Distribution

For ungapped alignments:

Expected number with score S or greater

E = Kmne-S

or

E = mn2-S’

K = scale for search space

 = scale for scoring system

S’= bitscore = (S - lnK)/ln2

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

scoring systems
Scoring Systems
  • Nucleic acids identity matrix
  • Proteins
    • Position Independent Matrices
      • PAM Matrices(Percent Accepted Mutation)
        • Implicit model of evolution
        • Higher PAM number all calculated from PAM1
        • PAM250 widely used
      • BLOSUM Matrices (BLOck SUbstition Matrices)
        • Empirically determined from alignment
        • of conserved blocks
        • Each includes information up to a certain level of identity
        • BLOSUM62 widely used
    • Position Specific Score Matrices (PSSM)
        • PSI and RPS BLAST
blosum62

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

BLOSUM62

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

position specific substitution rates
Position Specific Substitution Rates

Typical serine

Active site serine

position specific score matrix pssm
Position Specific Score Matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V

206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1

207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5

208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2

209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0

210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6

211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3

212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4

213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3

214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6

215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7

216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5

217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7

218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7

219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6

220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0

221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6

222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0

223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4

224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently

in these two positions

Active site nucleophile

gapped alignments
Gapped Alignments
  • Gapping provides more biologically realistic alignments
  • Statistical behavior not completely understood for gapped alignments
    • Gapped BLAST parameters must be found by simulations for each matrix
  • Affine gap costs = -(a+bk)
    • a = gap open penalty b = gap extend penalty
    • A gap of length 1 receives the score -(a+b)
ad