Databases
Download
1 / 129

Databases - PowerPoint PPT Presentation


  • 141 Views
  • Uploaded on

Databases. Where to get data?. GenBank http://www.ncbi.nlm.nih.gov Protein Databases SWISS-PROT: http://www.expasy.ch/sprot PDB: http://www.pdb.gov/ And many others. Bibliography. Growth in genome sequencing. gaps. Working Draft Sequence. The reagent: databases.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Databases' - lang


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Where to get data
Where to get data?

  • GenBank

    • http://www.ncbi.nlm.nih.gov

  • Protein Databases

    • SWISS-PROT: http://www.expasy.ch/sprot

    • PDB: http://www.pdb.gov/

  • And many others




Working draft sequence

gaps

Working Draft Sequence


The reagent databases
The reagent: databases

  • Organized array of information

  • Place where you put things in, and (if all is well) you should be able to get them out again.

  • Resource for other databases and tools.

  • Simplify the information space by specialization.

  • Bonus: Allows you to make discoveries.


Contains files or tables, each containing numerous records and fields

Simplest form, either a large single text file or collection of text files

Commonest type, stores the data within a number of tables (with records and fields). Each table will link each other by a shared file called a key


Flat file and fields

Relational database model

The operators are written in query-specific languages based on relational algebra

Structured Query Language (SQL) is commonly used


  • XML (eXtensible Markup Language) is now a general tool for storage of data and information. HTML and XHTML are subsets of XML.

  • The key feature is to use identifiers called tabs

  • <title> Understanding Bioinformatics </ title>

  • <publisher> tag can be defined and used to identify book publishers

  • Extraction from XML file is similar to database querying.


Databases

GenBank flat file storage of data and information. HTML and XHTML are subsets of XML.

PDB file

Interaction RecordTitle of a book

Book

Databases

Information system

Query system

Storage System

Data


Databases1

Boxes storage of data and information. HTML and XHTML are subsets of XML.

Oracle

MySQL

PC binary files

Unix text files

Bookshelves

Databases

Information system

Query system

Storage System

Data


Databases2
Databases storage of data and information. HTML and XHTML are subsets of XML.

A List you look atA catalogueindexed filesSQLgrep

Information system

Query system

Storage System

Data


Databases3

The UBC library storage of data and information. HTML and XHTML are subsets of XML.GoogleEntrezSRS

Databases

Information system

Query system

Storage System

Data


Bioinformatics information space july 17 1999
Bioinformatics Information Space storage of data and information. HTML and XHTML are subsets of XML.July 17, 1999

  • Nucleotide sequences: 4,456,822

  • Protein sequences: 706,862

  • 3D structures: 9,780

  • Human Unigene Clusters: 75,832

  • Maps and Complete Genomes: 10,870

  • Different species node: 52,889

  • dbSNP 6,377

  • RefGenes 515

  • human contigs > 250 kb 341 (4.9MB)

  • PubMed records: 10,372,886

  • OMIM records: 10,695


The challenge of the information space
The challenge of the information space: storage of data and information. HTML and XHTML are subsets of XML.

Feb 10 2004

Nucleotide records 36,653,899

Protein sequences 4,436,362

3D structures 19,640

Interactions & complexes 52,385

Human Unigene Cluster 118,517

Maps and Complete Genomes 6,948

Different taxonomy Nodes 283,121

Human dbSNP 13,179,601

Human RefSeq records 22,079

bp in Human Contigs > 5,000 kb (116) 2,487,920,000

PubMed records 12,570,540

OMIM records 15,138


From a cbw student course evaluation
From a CBW student course evaluation: storage of data and information. HTML and XHTML are subsets of XML.

“I could probably live the rest of my life happily without ever

seeing the ‘growth of GenBank’

curve … again.”


Databases4

Primary (archival) storage of data and information. HTML and XHTML are subsets of XML.

GenBank/EMBL/DDBJ

UniProt

PDB

Medline (PubMed)

BIND

Secondary (curated)

RefSeq

Taxon

UniProt

OMIM

SGD

Databases


Http nar oupjournals org content vol31 issue1
http://nar.oupjournals.org/content/vol31/issue1/ storage of data and information. HTML and XHTML are subsets of XML.


Tools of trade for the armchair scientist
Tools of trade storage of data and information. HTML and XHTML are subsets of XML.for the “armchair scientist”

  • Databases

    • PubMed and other NCBI databases

    • Biochemical databases

    • Protein domain databases

    • Structural databases

    • Genome comparison databases

  • Tools

    • CDD / COGs

    • VAST / FSSP



Types of databases
Types of databases NAR database web site

  • Archival or Primary Data

    • Text: PubMed

    • DNA Sequence: GenBank

    • Protein Sequence: Entrez Proteins, TREMBL

    • Protein Structures: PDB

  • Curated or Processed Data

    • DNA sequences : RefSeq, LocusLink, OMIM

    • Protein Sequences: SWISS-PROT, PIR

    • Protein Structures : SCOP, CATH, MMDB

    • Genomes: Entrez Genomes, COGs

Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases


4 ways to access protein and DNA sequences NAR database web site

[1] LocusLink with RefSeq

[2] Entrez

[3] UniGene

UniGene collects expressed sequence tags (ESTs)

into clusters, in an attempt to form one gene per cluster.

Use UniGene to study where your gene is expressed

in the body, when it is expressed, and see its abundance.

[4] ExPASy SRS


4 ways to access protein and DNA sequences NAR database web site

[1] LocusLink with RefSeq

[2] Entrez

[3] UniGene

[4] ExPASy SRS

There are many bioinformatics servers outside NCBI.

Try ExPASy’s sequence retrieval system at

http://www.expasy.ch/

(ExPASy = Expert Protein Analysis System)

Or try ENSEMBL at www.ensembl.org for a premier

human genome web browser.


National Center for Biotechnology NAR database web site

Information (NCBI)

www.ncbi.nlm.nih.gov

Page 24


The national center for biotechnology information ncbi
The National Center for Biotechnology Information (NCBI) NAR database web site

  • Created as a part of the National Library of Medicine, National Institutes of Health in 1988

    • Establish public databases

    • Research in computational biology

    • Develop software tools for sequence analysis

    • Disseminate biomedical information

  • Tools: BLAST(1990), Entrez (1992)

  • GenBank (1992)

  • Free MEDLINE (PubMed, 1997)

  • Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq


What is genbank
What is GenBank? NAR database web site

  • Archival nucleotide sequence database

  • Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served”

  • Data are shared nightly among three collaborating databases:

    • GenBankat NCBI - Bethesda, Maryland, USA

    • DNA Database of Japan(DDBJ) at NIG - Mishima, Japan

    • European Molecular Biology LaboratoryDatabase(EMBL) at EBI - Hinxton, UK


Some guiding principles of working with genbank
Some guiding principles of working with GenBank NAR database web site

  • GenBank is a nucleotide-centric view of the information space

  • GenBank is a repository of all publically available sequences

  • In GenBank, records are grouped for various reasons

  • Data in GenBank is only as good as what you put in


Ncbi databases and their links

Word Weight NAR database web site

Phylogeny

3-D Structure

VAST

BLAST

BLAST

NCBI databases and their links

Article Abstracts

Medline

3 D Structure

Taxonomy

MMDB

Genomes

Protein Sequences

Nucleotide Sequences


Fig. 2.5 NAR database web site

Page 25

www.ncbi.nlm.nih.gov


Fig. 2.5 NAR database web site

Page 25


  • PubMed is… NAR database web site

  • National Library of Medicine's search service

  • 16 million citations in MEDLINE

  • links to participating online journals

  • PubMed tutorial (via “Education” on side bar)

Page 24


  • Entrez NAR database web siteintegrates…

  • the scientific literature;

  • DNA and protein sequence databases;

  • 3D protein structure data;

  • population study data sets;

  • assemblies of complete genomes

Page 24


Entrez is a search and retrieval system NAR database web site

that integrates NCBI databases

Page 24


Entrez: NAR database web siteAn integrated search and retrieval system


  • BLAST is… NAR database web site

  • Basic Local Alignment Search Tool

  • NCBI's sequence similarity search tool

  • supports analysis of DNA and protein databases

  • 100,000 searches per day

Page 25


  • OMIM is… NAR database web site

  • Online Mendelian Inheritance in Man

  • catalog of human genes and genetic disorders

  • edited by Dr. Victor McKusick, others at JHU

Page 25


Contents NAR database web site

Additional info in OMIM

Associated LocusLink record

External resources

Extensive references to literature

OMIM record for Presenilin 1 (PSEN1)

Each record provides a state of the art summary of current knowledge


Omim search results by titles

alzheimer AND presenilin 1 NAR database web site

OMIM Search Results by Titles


Multiple Maps NAR database web site

STSs, ESTs, etc.

Gene Name

Entrez Genome: Gene Location

View of chromosome 14


Integrated View of Chromosome 7 NAR database web site

Entrez Genomes Map Viewer

Chromosome 7

GenBank Map Contig Map STS Map

Multiple Maps

STSs, ESTs, etc.


Gene Name NAR database web site

Entrez Genome: Gene Location

View of chromosome 14


Location of PSEN1 and surrounding genes NAR database web site

Entrez Genome: Gene Location

Entrez Genomes Map Viewer

Chromosome 14 Cytogenetic map


  • Books is… NAR database web site

  • searchable resource of on-line books

Page 26


  • TaxBrowser is… NAR database web site

  • browser for the major divisions of living organisms

  • (archaea, bacteria, eukaryota, viruses)

  • taxonomy information such as genetic codes

  • molecular data on extinct organisms

Page 26


  • Structure site includes… NAR database web site

  • Molecular Modelling Database (MMDB)

  • biopolymer structures obtained from

  • the Protein Data Bank (PDB)

  • Cn3D (a 3D-structure viewer)

  • vector alignment search tool (VAST)

Page 26


PDB NAR database web site

  • Protein DataBase

    • Protein and NA3D structures

    • Sequencepresent

    • YAFFF


HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2

COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3

COMPND 2 ATF/CREB SITE DNA 1DGC 4

SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5

AUTHOR T.J.RICHMOND 1DGC 6

REVDAT 1 22-JUN-94 1DGC 0 1DGC 7

JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8

JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9

JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10

JRNL TITL 3 FLEXIBILITY 1DGC 11

JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12

JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13

REMARK 1 1DGC 14

REMARK 2 1DGC 15

REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16

REMARK 3 1DGC 17

REMARK 3 REFINEMENT. 1DGC 18

REMARK 3 PROGRAM X-PLOR 1DGC 19

REMARK 3 AUTHORS BRUNGER 1DGC 20

REMARK 3 R VALUE 0.216 1DGC 21

REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22

REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23

REMARK 3 1DGC 24

REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25

REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26

REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27

REMARK 3 PERCENT COMPLETION 98.2 1DGC 28

REMARK 3 1DGC 29

REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30

REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31

REMARK 4 1DGC 32

REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33

REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34

REMARK 5 1DGC 35

REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36

REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37

REMARK 6 1DGC 38

REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39

REMARK 7 1DGC 40

REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41

REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42

REMARK 8 1DGC 43

REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44

REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45

REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46

REMARK 9 1DGC 47

REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48

REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49

REMARK 10 1DGC 50

REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51

REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52

REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53

REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54

REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55

REMARK 10 1DGC 56

REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57

REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58

REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59

SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60

SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61

SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62

SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63

SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64

SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65

SEQRES 2 B 19 A T C T C C 1DGC 66

HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67

CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68

ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69

ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70

ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71

SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72

SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73

SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74

ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75

ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76

ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916

ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917

TER 844 C B 9 1DGC 918

MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919

END 1DGC 920

PDB

  • HEADER

  • COMPND

  • SOURCE

  • AUTHOR

  • DATE

  • JRNL

  • REMARK

  • SECRES

  • ATOM COORDINATES



Accession numbers are labels for sequences 1DGC 1DGC 2

NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences.

You may want to acquire information beginning with a

query such as the name of a protein of interest, or the

raw nucleotides comprising a DNA sequence of interest.

DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence

or other record relevant to molecular data.

Page 26


What is an accession number? 1DGC 1DGC 2

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequence

NT_030059 Genomic contig

Rs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)

NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq protein

AAC02945 GenBank protein

Q28369 SwissProt protein

1KT7 Protein Data Bank structure record

DNA

RNA

protein

Page 27


Four ways to access DNA and 1DGC 1DGC 2

protein sequences

[1] Entrez Gene with RefSeq

[2] UniGene

[3] European Bioinformatics Institute (EBI)

and Ensembl (separate from NCBI)

[4] ExPASy Sequence Retrieval System

(separate from NCBI)

Note: LocusLink at NCBI was recently retired.

The third printing of the book has updated

these sections (pages 27-31).

Page 27


4 ways to access protein and DNA sequences 1DGC 1DGC 2

[1] Entrez Gene with RefSeq

Entrez Gene is a great starting point: it collects

key information on each gene/protein from

major databases. It covers all major organisms.

RefSeq provides a curated, optimal accession number for each DNA (NM_006744)

or protein (NP_007635)

Page 27


From the NCBI home 1DGC 1DGC 2

page, type “rbp4”

and hit “Go”

Pevsner

Fig. 2.7

Page 29


revised 1DGC 1DGC 2

Fig. 2.7

Page 29



Locus Name 1DGC 1DGC 2

Accession Number

gi Number

Medline ID

Protein Sequence

[rest of protein sequence deleted for brevity]

GenPept ID

Nucleotide Sequence

[rest of nucleotide sequence deleted for brevity]

GenBank Record


Locus accession nid and protein id
LOCUS, Accession, NID and protein_id 1DGC 1DGC 2

LOCUS: Unique string of 10 letters and numbers in

the database. Not maintained amongst databases,

and is therefore a poor sequence identifier.

ACCESSION: A unique identifier to that record, citable

entity; does not change when record is updated. A good

record identifier, ideal for citation in publication.

VERSION: : New system where the accession and version play the same function as the accession and gi number.

Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.

PID: Protein Identifier: g, e or d prefix to gi number.

Can have one or two on one CDS.

Protein gi: Geninfo identifier (gi), a unique integer which

will change every time the sequence changes.

protein_id: Identifier which has the same

structure and function as the nucleotide Accession.version numbers, but slightlt different format.


Entrez Gene (top of page) 1DGC 1DGC 2

Note that links to

many other RBP4

database entries

are available

revised

Fig. 2.8

Page 30




Fig. 2.9 1DGC 1DGC 2

Page 32


Fig. 2.9 1DGC 1DGC 2

Page 32


Fig. 2.9 1DGC 1DGC 2

Page 32


FASTA format 1DGC 1DGC 2

Fig. 2.10

Page 32


What is an accession number? 1DGC 1DGC 2

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequence

NT_030059 Genomic contig

Rs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)

NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq protein

AAC02945 GenBank protein

Q28369 SwissProt protein

1KT7 Protein Data Bank structure record

DNA

RNA

protein

Page 27


NCBI’s important RefSeq project: best representative sequences

RefSeq (accessible via the main page of NCBI)

provides an expertly curated accession number that

corresponds to the most stable, agreed-upon “reference”

version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NG_######

Complete chromosome NC_######

Genomic contig NT_######

mRNA (DNA format) NM_###### e.g. NM_006744

Protein NP_###### e.g. NP_006735

Page 29-30


NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

AccessionMoleculeMethodNote

AC_123456 Genomic Mixed Alternate complete genomic

AP_123456 Protein Mixed Protein products; alternate

NC_123456 Genomic Mixed Complete genomic molecules

NG_123456 Genomic Mixed Incomplete genomic regions

NM_123456 mRNA Mixed Transcript products; mRNA

NM_123456789 mRNA Mixed Transcript products; 9-digit

NP_123456 Protein Mixed Protein products;

NP_123456789 Protein Curation Protein products; 9-digit

NR_123456 RNA Mixed Non-coding transcripts

NT_123456 Genomic Automated Genomic assemblies

NW_123456 Genomic Automated Genomic assemblies

NZ_ABCD12345678 Genomic Automated Whole genome shotgun data

XM_123456 mRNA Automated Transcript products

XP_123456 Protein Automated Protein products

XR_123456 RNA Automated Transcript products

YP_123456 Protein Auto. & Curated Protein products

ZP_12345678 Protein Automated Protein products


Four ways to access DNA and protein sequences

protein sequences

[1] Entrez Gene with RefSeq

[2] UniGene

[3] European Bioinformatics Institute (EBI)

and Ensembl (separate from NCBI)

[4] ExPASy Sequence Retrieval System

(separate from NCBI)

Page 31


protein protein sequences

DNA

RNA

complementary DNA

(cDNA)

UniGene

Fig. 2.3

Page 23

In genetics, complementary DNA (cDNA) is DNA synthesized from a mature mRNA template in a reaction catalyzed by the enzyme reverse transcriptase.


Expressed sequence tag
Expressed Sequence Tag protein sequences

What Are ESTs and How Are They Made?

ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns, the intervening DNA sequences interrupting the protein coding sequence of a gene.


STS protein sequences

Sequenced Tagged Sites, are operationally

unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome.

Also see: http://www.ncbi.nlm.nih.gov/dbSTS/http://www.ncbi.nlm.nih.gov/genemap/


UniGene: unique genes via ESTs protein sequences

  • • Find UniGene at NCBI:

  • www.ncbi.nlm.nih.gov/UniGene

  • UniGene clusters contain many expressed sequence

  • tags (ESTs), which are DNA sequences (typically

  • 500 base pairs in length) corresponding to the mRNA

  • from an expressed gene. ESTs are sequenced from a

  • complementary DNA (cDNA) library.

  • • UniGene data come from many cDNA libraries.

  • Thus, when you look up a gene in UniGene

  • you get information on its abundance

  • and its regional distribution.

Pages 20-21


Cluster sizes in UniGene protein sequences

This is a gene with

1 EST associated;

the cluster size is 1

Fig. 2.3

Page 23


Cluster sizes in UniGene protein sequences

This is a gene with

10 ESTs associated;

the cluster size is 10


Cluster sizes in UniGene (human) protein sequences

Cluster size (ESTs)Number of clusters

1 42,800

2 6,500

3-4 6,500

5-8 5,400

9-16 4,100

17-32 3,300

500-1000 2,128

2000-4000 233

8000-16,000 21

16,000-30,000 8

UniGene build 194, 8/06


UniGene: unique genes via ESTs protein sequences

Conclusion: UniGene is a useful tool to look up

information about expressed genes. UniGene

displays information about the abundance of a

transcript (expressed gene), as well as its regional

distribution of expression (e.g. brain vs. liver).

We will discuss UniGene further later

(gene expression).

Page 31


Five ways to access DNA and protein sequences

protein sequences

[1] Entrez Gene with RefSeq

[2] UniGene

[3] European Bioinformatics Institute (EBI)

and Ensembl (separate from NCBI)

[4] ExPASy Sequence Retrieval System

(separate from NCBI)

Page 31


Ensembl to access protein and DNA sequences protein sequences

Try Ensembl at www.ensembl.org for a premier

human genome web browser.

We will encounter Ensembl as we study the human genome,

BLAST, and other topics.


click protein sequences

human


enter protein sequences

RBP4


Five ways to access DNA and protein sequences

protein sequences

[1] Entrez Gene with RefSeq

[2] UniGene

[3] European Bioinformatics Institute (EBI)

and Ensembl (separate from NCBI)

[4] ExPASy Sequence Retrieval System

(separate from NCBI)

Page 33


ExPASy to access protein and DNA sequences protein sequences

ExPASy sequence retrieval system

(ExPASy = Expert Protein Analysis System)

Visit http://www.expasy.ch/

Page 33


Fig. 2.11 protein sequences

Page 33


Example of how to access sequence data: protein sequences

HIV-1 pol

There are many possible approaches. Begin at the main

page of NCBI, and type an Entrez query: hiv-1 pol

Page 34


Searching for HIV-1 protein sequencespol:

Following the “genome” link yields

a manageable three results

Page 34


Example of how to access sequence data: protein sequences

HIV-1 pol

For the Entrez query: hiv-1 pol

there are about 40,000 nucleotide or protein records

(and >100,000 records for a search for “hiv-1”),

but these can easily be reduced in two easy steps:

--specify the organism, e.g. hiv-1[organism]

--limit the output to RefSeq!

Page 34


over 100,000 protein sequences

nucleotide entries

for HIV-1

only 1 RefSeq


Examples of how to access sequence data: protein sequences

histone

query for “histone” # results

protein records 21847

RefSeq entries 7544

RefSeq (limit to human) 1108

NOT deacetylase 697

At this point, select a reasonable candidate (e.g.

histone 2, H4) and follow its link to Entrez Gene.

There, you can confirm you have the right gene/protein.

8-12-06


Access to Biomedical Literature protein sequences

Page 35


PubMed at NCBI protein sequences

to find literature

information


PubMed is the NCBI gateway to MEDLINE. protein sequences

MEDLINE contains bibliographic citations

and author abstracts from over 4,600 journals

published in the United States and in 70 foreign

countries.

It has >14 million records dating back to 1966.

Page 35


MeSH is the acronym for "Medical Subject Headings." protein sequences

MeSH is the list of the vocabulary terms used

for subject analysis of biomedical literature at NLM.

MeSH vocabulary is used for indexing journal articles

for MEDLINE.

The MeSH controlled vocabulary imposes uniformity

and consistency to the indexing of biomedical literature.

Page 35


PubMed search strategies protein sequences

Try the tutorial (“education” on the left sidebar)

Use boolean queries (capitalize AND, OR, NOT)

lipocalin AND disease

Try using “limits”

Try “Links” to find Entrez information and external resources

Obtain articles on-line via Welch Medical Library

(and download pdf files):

http://www.welch.jhu.edu/

Page 35


1 AND 2 protein sequences

1

2

lipocalin AND disease

(60 results)

1 OR 2

1

2

lipocalin OR disease

(1,650,000 results)

1 NOT 2

1

2

lipocalin NOT disease

(530 results)

Fig. 2.12

Page 34

8/04


Article contents: protein sequences

“globin” is

absent

“globin” is

present

Search result:

false positive

(article does not

discuss globins)

“globin” is

found

true positive

false negative

(article discusses

globins)

“globin” is

not found

true negative

8/06


Protein sequence motif is a descriptor of a protein family
Protein sequence motif protein sequencesis a descriptor of a protein family

  • Glutamine amidotransferase class I

  • [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA]

  • [C is the active site residue]

  • Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]

  • [C is the active site residue]


Searching mmdb
Searching MMDB protein sequences


Principles of structural alignment protein sequences

  • Dali: http://www.ebi.ac.uk/dali/ Looks for minimal RMSD between Ca atoms. Calculate Ca - Ca distance matrices, then identifies the longest alignable segments

  • VAST (Vector Alignment Search Tool) http://www.ncbi.nlm.nih.gov/Structure/looks for pairs of secondary structure elements (a-helices, b-strands) that have similar orientation and connectivity



Vast structure neighbors
VAST Structure Neighbors protein sequences


Structure summary
Structure Summary protein sequences

BLAST neighbors

VAST neighbors

Cn3D viewer


Cn3d displaying structures
Cn3D : Displaying Structures protein sequences

Chloroquine


Structure neighbors
Structure Neighbors protein sequences


Use of structural alignments
Use of structural alignments protein sequences

Chloroquine

NADH


PDB protein sequences

  • Protein DataBase

    • Protein and NA3D structures

    • Sequencepresent

    • YAFFF


HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2

COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3

COMPND 2 ATF/CREB SITE DNA 1DGC 4

SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5

AUTHOR T.J.RICHMOND 1DGC 6

REVDAT 1 22-JUN-94 1DGC 0 1DGC 7

JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8

JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9

JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10

JRNL TITL 3 FLEXIBILITY 1DGC 11

JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12

JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13

REMARK 1 1DGC 14

REMARK 2 1DGC 15

REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16

REMARK 3 1DGC 17

REMARK 3 REFINEMENT. 1DGC 18

REMARK 3 PROGRAM X-PLOR 1DGC 19

REMARK 3 AUTHORS BRUNGER 1DGC 20

REMARK 3 R VALUE 0.216 1DGC 21

REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22

REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23

REMARK 3 1DGC 24

REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25

REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26

REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27

REMARK 3 PERCENT COMPLETION 98.2 1DGC 28

REMARK 3 1DGC 29

REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30

REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31

REMARK 4 1DGC 32

REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33

REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34

REMARK 5 1DGC 35

REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36

REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37

REMARK 6 1DGC 38

REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39

REMARK 7 1DGC 40

REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41

REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42

REMARK 8 1DGC 43

REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44

REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45

REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46

REMARK 9 1DGC 47

REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48

REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49

REMARK 10 1DGC 50

REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51

REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52

REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53

REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54

REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55

REMARK 10 1DGC 56

REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57

REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58

REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59

SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60

SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61

SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62

SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63

SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64

SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65

SEQRES 2 B 19 A T C T C C 1DGC 66

HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67

CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68

ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69

ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70

ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71

SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72

SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73

SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74

ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75

ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76

ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916

ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917

TER 844 C B 9 1DGC 918

MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919

END 1DGC 920

PDB

  • HEADER

  • COMPND

  • SOURCE

  • AUTHOR

  • DATE

  • JRNL

  • REMARK

  • SECRES

  • ATOM COORDINATES


Uniprot
UniProt 15-JUL-93 1DGC 1DGC 2

  • New protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database.

  • Data in UniProt is primarily derived from coding sequence annotations in EMBL (GenBank/DDBJ) nucleic acid sequence data.

  • UniProt is a Flat-File database just like EMBL and GenBank

  • Flat-File format is SwissProt-like, or EMBL-like


Swiss prot
Swiss-Prot 15-JUL-93 1DGC 1DGC 2


Swiss prot1
Swiss-Prot 15-JUL-93 1DGC 1DGC 2

  • SWISS-PROT incorporates:

    • Function of the protein

    • Post-translational modification

    • Domains and sites.

    • Secondary structure.

    • Quaternary structure.

    • Similarities to other proteins;

    • Diseases associated with deficiencies in the protein

    • Sequence conflicts, variants, etc.

  • SWISS-PROT incorporates:

    • Function of the protein

    • Post-translational modification

    • Domains and sites.

    • Secondary structure.

    • Quaternary structure.

    • Similarities to other proteins;

    • Diseases associated with deficiencies in the protein

    • Sequence conflicts, variants, etc.


ad