What are we looking for

What are we looking for PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

Download Presentation

What are we looking for

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


1. ©CMBI 2001 What are we looking for? Data & databases

2. ©CMBI 2002 Your questions Lookup Compare Predict

3. ©CMBI 2002 Your questions Lookup Is the gene known for my protein (or vice versa)? On which chromosome is the gene located? What sequence patterns are present in my protein? Are the mutations known which cause this disease? To what class or family does my protein belong? What is known about this family?

4. ©CMBI 2002 Your questions Compare Are there protein sequences in the database which resemble the protein I cloned? How can I optimally align the members of this protein family? Are these two proteins similar?

5. ©CMBI 2002 Sequence similarity

6. ©CMBI 2002 Alignment

7. Are these structures similar?

8. ©CMBI 2002 Your questions Predict Can I predict the active site residues of this enzyme? Why are these patients ill? Can I make a 3D model for my protein? Can I predict a (better) drug for this target? How can I improve the thermostability of this protein? (protein engineering) How can I predict the genes located on this genome?

9. ©CMBI 2002 How to find the answers to these questions? Outline Morning Data in databases Afternoon Programs (tools) to search these databases Knowledge how to search the databases with these tools (hands-on)

10. ©CMBI 2002 Biological Databases The number of databases - DBCAT currently lists over 500 databases The size of databases - Grows exponentially - EMBL database: New entries entered at 6.3 sec/seq! (July 2001) Content doubles every 8-9 months!Content doubles every 8-9 months!

11. EMBL july 2001 ca. 12 million entries (incl NEW & EST giga = 10exp9 6.3 sec/seq => 10 seq/minEMBL july 2001 ca. 12 million entries (incl NEW & EST giga = 10exp9 6.3 sec/seq => 10 seq/min

12. ©CMBI 2002 Primary and Secondary Databases Primary databases REAL EXPERIMENTAL DATA Biomolecular sequences or structures and associated annotation information (organism, function, mutation linked to disease, functional/structural patterns, bibliographic etc.) Secondary databases DERIVED INFORMATION Fruits of analyses of sequences in the primary sources (patterns, blocks, profiles etc. which represent the most conserved features of multiple alignments) Primary database contains: Annotation, administration, etc., but also REAL EXPERIMENTAL DATA Secondary database contains: Data from primary database(s), extra annotation and administration, and added value (calculation, annotation, etc). Most conserved features of the multiple alignment such that they are able to provide potent discriminators of family members for newly determined sequences. Primary database contains: Annotation, administration, etc., but also REAL EXPERIMENTAL DATA Secondary database contains: Data from primary database(s), extra annotation and administration, and added value (calculation, annotation, etc). Most conserved features of the multiple alignment such that they are able to provide potent discriminators of family members for newly determined sequences.

13. ©CMBI 2002 Primary Databases Sequence Information DNA: EMBL, Genbank, DDBJ Protein: SwissProt, TREMBL, PIR, OWL Genome Information GDB, MGD, ACeDB Structure Information PDB, NDB, CCDB/CSD DDBJ - The DNA Data Bank of Japan. The OWL database is a non-redundant protein sequence database produced from the following source databases: SWISSPROT PIR(1-3) GenBank translations NRL-3D PIR(1-3) - The Protein Identification Resource EMBL - The European Molecular Biology DNA Sequence Database PDB - The Protein Databank (3D structures) AceDB was originally developed for the C. elegans genome project , from which its name was derived (A C. elegans DataBase). However, the tools in it have been generalized to be much more flexible and the same software is now used for many different genomic databases from bacteria to fungi to plants to man Cambridge Crystallogrphic Database (CCDB) Primary database contains: Annotation, administration, etc., but also REAL EXPERIMENTAL DATA DDBJ - The DNA Data Bank of Japan. The OWL database is a non-redundant protein sequence database produced from the following source databases: SWISSPROT PIR(1-3) GenBank translations NRL-3D PIR(1-3) - The Protein Identification Resource EMBL - The European Molecular Biology DNA Sequence Database PDB - The Protein Databank (3D structures) AceDB was originally developed for the C. elegans genome project , from which its name was derived (A C. elegans DataBase). However, the tools in it have been generalized to be much more flexible and the same software is now used for many different genomic databases from bacteria to fungi to plants to man Cambridge Crystallogrphic Database (CCDB) Primary database contains: Annotation, administration, etc., but also REAL EXPERIMENTAL DATA

14. ©CMBI 2002 Secondary Databases Sequence-related Information ProSite, Enzyme, REBase Genome-related Information OMIM, TransFac Structure-related Information DSSP, HSSP, FSSP, PDBFinder Pathway Information KEGG, Pathways PROSITE - A Dictionary of Protein Sites and Patterns (1492 patterns (oct 2001)) EC-Enzyme - The EC Enzyme Classification Database OMIM - Online Mendelain Inheritance in Man SWISS-2DPAGE - Two-dimensional Polyacrylamide Gel Electrophoresis Database REBASE - The Restriction Enzyme Database Refbase - A Protein Sequence Citation Database KEGG: Kyoto Encyclopedia of Genes and Genomes DSSP database of sec struct assignments (and much more) for all of the entries in the PDB. 15xxx entries. HSSP homology-derived struct of proteins; derived db; merging struct (2&3D) and seq info (1D). Implied sec & tert struct. 15xxx entries. FSSP families of structurally similar proteins. Structural alignment of proteins in PDB. 3439 entries. Secondary database contains: Data from primary database(s), extra annotation and administration, and added value (calculation, annotation, etc). PROSITE - A Dictionary of Protein Sites and Patterns (1492 patterns (oct 2001)) EC-Enzyme - The EC Enzyme Classification Database OMIM - Online Mendelain Inheritance in Man SWISS-2DPAGE - Two-dimensional Polyacrylamide Gel Electrophoresis Database REBASE - The Restriction Enzyme Database Refbase - A Protein Sequence Citation Database KEGG: Kyoto Encyclopedia of Genes and Genomes DSSP database of sec struct assignments (and much more) for all of the entries in the PDB. 15xxx entries. HSSP homology-derived struct of proteins; derived db; merging struct (2&3D) and seq info (1D). Implied sec & tert struct. 15xxx entries. FSSP families of structurally similar proteins. Structural alignment of proteins in PDB. 3439 entries. Secondary database contains: Data from primary database(s), extra annotation and administration, and added value (calculation, annotation, etc).

15. ©CMBI 2002 Databases Data must be in certain format for the programs to recognize them. Every database can have its own format, but some data elements are essential for every database: 1. Unique identifier, or accession code 2. Name of depositor 3. Literature references 4. Deposition date 5. The real data

16. ©CMBI 2002 3 examples 1. SwissProt 2. EMBL 3. PDB

17. ©CMBI 2002 Quality of databases SwissProt Data is only entered by annotation experts EMBL, PDB Everybody can submit data Data are accepted the way they are submitted

18. ©CMBI 2002 SwissProt database Database of protein sequences Produced by Amos Bairoch (University of Geneva) and the EMBL Data Library Data derived from: translations of DNA sequences (from EMBL Database) adapted from the PIR collection extracted from the literature and directly submitted by researchers SwissProt & SwissNew July 2001: ~86,600 entries, ~15,000 new entries / year Swissnew: 53,000 entries Ca. 200 Annotation experts worldwide Keyword-organised flatfile 31 miljoen/86593 = 357 aa 20 annotators doen 99% vanhet werk. 200 remote experts doen de 1% probleemgevallen.31 miljoen/86593 = 357 aa 20 annotators doen 99% vanhet werk. 200 remote experts doen de 1% probleemgevallen.

19. ©CMBI 2002 SwissProt records (1) ID identification line ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH. ID CRAM_CRAAB STANDARD; PRT; 46 AA. Format for the ENTRY_NAME: NAME_SPECIES (? 10 characters) For number of organisms (16) SPECIES has a recognizable name: HUMAN, MOUSE, CHICK, BOVIN, YEAST, ECOLI…. N.B. The ID can change, e.g. serotonine receptors have got a new nomenclature SWISSPROT:CRAA_HUMAN; alpha crystallin a chain. SWISSPROT:CRAA_HUMAN; alpha crystallin a chain.

20. ©CMBI 2002 SwissProt records (2) AC accession number AC P01542; AC is unique: Name, sequence, everything can change but AC stays the same DT deposition date DT 21-JUL-1986 (Rel. 01, Created) DT 30-MAY-2000 (Rel. 39, Last sequence update) DT 30-MAY-2000 (Rel. 39, Last annotation update) 1) You can not see what the last annotation update was 2) No depositor record (Implicit: author of first reference) DT relevance ivm protein vs DNA sequencing DT relevance ivm protein vs DNA sequencing

21. ©CMBI 2002 SwissProt records (3) DE description DE CRAMBIN. DE 6-phosphofructo-2-kinase 1 (EC 2.7.1.105) (Phosphofructokinase 2 I) 1) General descriptive information 2) Free-format GN gene name GN THI2. OS & OC & OG OS Crambe abyssinica (Abyssinian crambe). OC Eukaryota; Viridiplantae; Embryophyta;Tracheophyta;Spermatophyta; OC Magnoliophyta; eudicotyledons; Rosidae; eurosids II; Brassicales; OC Brassicaceae; Crambe. Organism Species; Organism Classification; OrGanelle Organel m.n. mitochondrion & chloroplastOrganel m.n. mitochondrion & chloroplast

22. ©CMBI 2002 SwissProt records (4) RN References RN [1] RP SEQUENCE. RX MEDLINE; 82046542. RA Teeter M.M., Mazer J.A., L'Italien J.J.; RT "Primary structure of the hydrophobic plant protein crambin."; RL Biochemistry 20:5437-5443(1981). CC Comments or notes CC -!- FUNCTION: THE FUNCTION OF THIS HYDROPHOBIC PLANT SEED PROTEIN CC IS NOT KNOWN. CC -!- MISCELLANEOUS: TWO ISOFORMS EXISTS, A MAJOR FORM PL (SHOWN HERE) CC AND A MINOR FORM SI. CC -!- SIMILARITY: BELONGS TO THE PLANT THIONIN FAMILY. RP = keyword, waar de referentie over gaat, bijvoorbeeld: STRUCTURE, SEQUENCE. Als [RP= Sequence], is de eerste auteur van deze referentie de depositor van de data. CC CATALYTIC ACTIVITY CC TISSUE SPECIFICITYRP = keyword, waar de referentie over gaat, bijvoorbeeld: STRUCTURE, SEQUENCE. Als [RP= Sequence], is de eerste auteur van deze referentie de depositor van de data. CC CATALYTIC ACTIVITY CC TISSUE SPECIFICITY

23. ©CMBI 2002 SwissProt records (5) DR Database Cross Reference DR PIR; A01805; KECX. DR PDB; 1CRN; 16-APR-87. DR PDB; 1CBN; 31-JAN-94. DR PDB; 1CCM; 31-OCT-93. DR PDB; 1CCN; 31-JAN-94. DR PDB; 1CNR; 31-AUG-94. DR PDB; 1AB1; 12-AUG-97. DR INTERPRO; IPR001010; -. DR PFAM; PF00321; plant_thionins; 1. DR PRINTS; PR00287; THIONIN. DR PROSITE; PS00271; THIONIN; 1. KW Keyword Not standardized (under control of depositor) KW Thionin; 3D-structure. C-C-x(5)-R-x(2)-[FY]-x(2)-C [The three C's are involved in disulfide bonds] plant thionins signature InterPro - Integrated Resource of ProteinDomains and Functional Sites PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of OWL. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: thedatabase thus provides a useful adjunct to PROSITE. C-C-x(5)-R-x(2)-[FY]-x(2)-C [The three C's are involved in disulfide bonds] plant thionins signature InterPro - Integrated Resource of ProteinDomains and Functional Sites PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of OWL. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: thedatabase thus provides a useful adjunct to PROSITE.

24. ©CMBI 2002 SwissProt records (6) FT Feature table data FT DISULFID 3 40 FT DISULFID 4 32 FT DISULFID 16 26 FT VARIANT 22 22 P -> S (IN ISOFORM SI). FT VARIANT 25 25 L -> I (IN ISOFORM SI). FT STRAND 2 3 FT HELIX 7 16 FT TURN 17 19 FT HELIX 23 30 FT TURN 31 31 FT STRAND 33 34 FT TURN 42 43

25. ©CMBI 2002 Feature table Other features: post-translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included. FT CONFLICT 33 33 MISSING (IN REF. 2). FT MUTAGEN 123 123 G->R,L,M: DNA BINDING LOST. FT MOD_RES 11 11 PHOSPHORYLATION (BY PKC). FT LIPID 1 1 MYRISTATE. FT CARBOHYD 103 103 GLUCOSYLGALACTOSE. FT METAL 87 87 COPPER (POTENTIAL). FT BINDING 14 14 HEME (COVALENT). FT PROPEP 27 28 ACTIVATION PEPTIDE. FT DOMAIN 22 788 EXTRACELLULAR (POTENTIAL). FT ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS. FT VARSPLIC 194 196 GRP -> DVR (IN SHORT FORM).FT VARSPLIC 194 196 GRP -> DVR (IN SHORT FORM).

26. ©CMBI 2002 SwissProt records (7) SQ sequence header SQ SEQUENCE 46 AA; 4736 MW; 919E68AF159EF722 CRC64; Sequence data TTCCPSIVAR SNFNVCRLPG TPEALCATYT GCIIIPGATC PGDYAN // Termination line Dit getal - 919E68AF159EF722 CRC64 – is de zogenaamde check-sum. Een getal berekend door de computer om te controleren of de data nog correct is.Dit getal - 919E68AF159EF722 CRC64 – is de zogenaamde check-sum. Een getal berekend door de computer om te controleren of de data nog correct is.

27. ©CMBI 2002 EMBL database Nucleotide database EMBL & EMNEW July 2001: EMBL: 3,951,820 entries, EMNEW: 323,703 EMEST*: 8,092,600, EMNEWEST*: 619,777 *) EMEST/EMNEWEST = EST-section of EMBL, EST = expressed sequence tag EMBL records follows roughly same scheme as SwissProt Obligatory deposit of sequence in EMBL (or SwissProt) before publication EMEST = EST-sectie van de EMBL database. EST – expressed sequence tag.EMEST = EST-sectie van de EMBL database. EST – expressed sequence tag.

28. ©CMBI 2002 Protein Data Bank (PDB) Databank for macromolecular structure data (3-dimensional coordinates) Obligatory deposit of coordinates in the PDB before publication ~16,000 entries (October 2001) PDB file is a keyword-organised flat-file (80 column) human readable every line starts with a keyword (3-6 letters) platform independent Started ca. 25 years ago (on punche cards!) Naast eiwittenook DNA & RNA!!Naast eiwittenook DNA & RNA!!

29. ©CMBI 2002 PDB records (1) Filename= accession number= PDB Code 1) Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN) 2) Be aware: 0HYK means entry HYK does not contain coordinates HEADER describes molecule & gives deposition date HEADER PLANT SEED PROTEIN 30-APR-81 1CRN 1CRND 1 CMPND name of molecule COMPND CRAMBIN 1CRN 4 SOURCE organism SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED 1CRN 5

30. ©CMBI 2002 PDB records (2) AUTHOR AUTHOR W.A.HENDRICKSON,M.M.TEETER 1CRN 6 The depositor JRNL JRNL AUTH M.BLABER,X.-J.ZHANG,B.W.MATTHEWS 111L 10 JRNL TITL STRUCTURAL BASIS OF ALPHA-HELIX PROPENSITY AT TWO 111L 11 JRNL TITL 2 SITES IN T4 LYSOZYME 111L 12 JRNL REF SCIENCE V. 260 1637 1993 111L 13 JRNL REFN ASTM SCIEAS US ISSN 0036-8075 038 111L 14 REMARK Not standardized: many different REMARK records & subrecords! REMARK 1 REFERENCE 3 1CRNC 10 REMARK 1 AUTH M.M.TEETER,W.A.HENDRICKSON 1CRN 16 REMARK 1 TITL HIGHLY ORDERED CRYSTALS OF THE PLANT SEED PROTEIN 1CRN 17 REMARK 1 TITL 2 CRAMBIN 1CRN 18 REMARK 1 REF J.MOL.BIOL. V. 127 219 1979 1CRN 19 REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 070 1CRN 20 REMARK 2 1CRN 21 REMARK 2 RESOLUTION. 1.5 ANGSTROMS. 1CRN 22

31. ©CMBI 2002 PDB records (3) SEQRES Sequence of protein; Be aware: Not always all 3D-coordinates are present for all the amino acids in SEQRES!! SEQRES 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51 SEQRES 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52 SEQRES 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53 SEQRES 4 46 CYS PRO GLY ASP TYR ALA ASN 1CRN 54 HET & FORMUL metals, cofactors, ions, etc. HET NAD A 1 44 NAD CO-ENZYME 4MDH 219 HET SUL A 2 5 SULFATE 4MDH 220 HET NAD B 1 44 NAD CO-ENZYME 4MDH 221 HET SUL B 2 5 SULFATE 4MDH 222 FORMUL 3 NAD 2(C21 H28 N7 O14 P2) 4MDH 223 FORMUL 4 SUL 2(O4 S1) 4MDH 224 FORMUL 5 HOH *471(H2 O1) 4MDH 225 HEADER OXIDOREDUCTASE(NAD(A)-CHOH(D)) 12-APR-89 4MDH 4MDH 3 COMPND CYTOPLASMIC MALATE DEHYDROGENASE (E.C.1.1.1.37) 4MDH 4 SOURCE PORCINE (SUS $SCROFA) HEART 4MDH 5HEADER OXIDOREDUCTASE(NAD(A)-CHOH(D)) 12-APR-89 4MDH 4MDH 3 COMPND CYTOPLASMIC MALATE DEHYDROGENASE (E.C.1.1.1.37) 4MDH 4 SOURCE PORCINE (SUS $SCROFA) HEART 4MDH 5

32. ©CMBI 2002 PDB records (4) HELIX/SHEET/TURN Secondary structure elements as provided by the crystallographer (subjective) HELIX 1 H1 ILE 7 PRO 19 1 3/10 CONFORMATION RES 17,19 1CRN 55 SHEET 2 S1 2 CYS 32 ILE 35 -1 1CRN 58 TURN 1 T1 PRO 41 TYR 44 1CRN 59 SSBOND disulfide bridges SSBOND 1 CYS 3 CYS 40 1CRN 60 SSBOND 2 CYS 4 CYS 32 1CRN 61 CRYST1, ORIGX1, ORIGX2, ORIGX3, SCALE1, SCALE2, SCALE3 crystallographic parameters CRYST1 40.960 18.650 22.520 90.00 90.77 90.00 P 21 2 1CRN 63 ORIGX1 1.000000 0.000000 0.000000 0.00000 1CRN 64 ORIGX2 0.000000 1.000000 0.000000 0.00000 1CRN 65 ORIGX3 0.000000 0.000000 1.000000 0.00000 1CRN 66 SCALE1 .024414 0.000000 -.000328 0.00000 1CRN 67 SCALE2 0.000000 .053619 0.000000 0.00000 1CRN 68 SCALE3 0.000000 0.000000 .044409 0.00000 1CRN 69

33. ©CMBI 2002 PDB records (5) ATOM one line for each atom with its unique name and its x,y,z coordinates ATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70 ATOM 2 CA THR 1 16.967 12.784 4.338 1.00 10.80 1CRN 71 ATOM 3 C THR 1 15.685 12.755 5.133 1.00 9.19 1CRN 72 ATOM 4 O THR 1 15.268 13.825 5.594 1.00 9.85 1CRN 73 ATOM 5 CB THR 1 18.170 12.703 5.337 1.00 13.02 1CRN 74 ATOM 6 OG1 THR 1 19.334 12.829 4.463 1.00 15.06 1CRN 75 ATOM 7 CG2 THR 1 18.150 11.546 6.304 1.00 14.23 1CRN 76 ATOM 8 N THR 2 15.115 11.555 5.265 1.00 7.81 1CRN 77 ATOM 9 CA THR 2 13.856 11.469 6.066 1.00 8.31 1CRN 78 ATOM 10 C THR 2 14.164 10.785 7.379 1.00 5.80 1CRN 79 ATOM 11 O THR 2 14.993 9.862 7.443 1.00 6.94 1CRN 80 TER record terminates the amino acid chain ATOM 325 OD1 ASN 46 11.982 4.849 15.886 1.00 11.00 1CRN 394 ATOM 326 ND2 ASN 46 13.407 3.298 15.015 1.00 10.32 1CRN 395 ATOM 327 OXT ASN 46 12.703 4.973 10.746 1.00 7.86 1CRN 396 TER 328 ASN 46 1CRN 397 One TER-record per molecule. So if protein is a dimer, than two TER-records.One TER-record per molecule. So if protein is a dimer, than two TER-records.

34. ©CMBI 2002 PDB records (6) HETATM atomic coordinate records for atoms within “HET & FORMUL”-lines (metals, cofactors, ions, …) and for water molecules HETATM 5158 AP NAD B 1 42.641 30.361 41.284 1.00 26.73 4MDH5495 HETATM 5159 AO1 NAD B 1 43.440 31.570 40.868 1.00 20.69 4MDH5496 HETATM 5160 AO2 NAD B 1 41.161 30.484 41.376 1.00 33.73 4MDH5497 HETATM 5207 O HOH 0 15.379 1.907 3.295 1.00 58.12 4MDH5544 HETATM 5208 O HOH 1 58.861 0.984 17.024 1.00 37.58 4MDH5545 HETATM 5209 O HOH 2 24.384 1.184 74.398 1.00 35.92 4MDH5546

35. ©CMBI 2002

  • Login