1 / 47

Structure Databases

Structure Databases. DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section , Vrije Universiteit, Amsterdam Some pics were token from http://www.umanitoba.ca/afs/plant_science/courses. The dictionary definition. Main Entry: da·ta·base

lacy-wiley
Download Presentation

Structure Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, AmsterdamSome pics were token from http://www.umanitoba.ca/afs/plant_science/courses

  2. The dictionary definition Main Entry: da·ta·base Pronunciation: 'dA-t&-"bAs, 'da- also 'dä-Origin: circa 1962 : a usually large collection of data organized especially for rapid search and retrieval (as by a computer) - Webster dictionary

  3. WHAT is a database? • A collection of data that needs to be: • Structured (standardized data representation) • Searchable • Updated (periodically) • Cross referenced • Challenge: • To change “meaningless” data into useful information that can be accessed and analysed the best way possible.

  4. Organizing data into knowledge HOW would YOU organise all biological sequences so that the biological information is optimally accessible? You need an appropriate database management system (DBMS)

  5. DBMS Database • Internal organization • Controls speed and flexibility • A unity of programs that • Store • Extract • Modify Store Extract Modify USER(S)

  6. DBMS organisation types • Flat file databases (flat DBMS) • Simple, restrictive, table • Hierarchical databases (hierarchical DBMS) • Simple, restrictive, tables • Relational databases (RDBMS) • Complex,versatile, tables • Object-oriented databases (ODBMS) • Complex, versatile, objects

  7. A flat file database • Collection of records, each containing several data fields. • Disadvantageous • Redundancy • Force single view of the data (‘organizer’ and ‘attributes’) Cell_Stock : "SK11.pEA215.3"Species  "Escherichia coli"Plasmid  "pEA215.3"Experiment       "SK11"Freezer  "AG334 -80C"Box      "Pisum ESTs II"Gridded  "Rack(BF7) Box(Pisum ESTs II)"Cell_Stock : "SK11.pI206KS"Species  "Escherichia coli"Plasmid  "pI206KS"Experiment       "SK11"Freezer  "AG334 -80C"Box      "Pisum ESTs II"Gridded  "Rack(BF7) Box(Pisum ESTs II)"Cell_Stock : "SK11.pEA46.2"Species  "Escherichia coli"Plasmid  "pEA46.2"Experiment       "SK11"Freezer  "AG334 -80C"Box      "Pisum ESTs II"Gridded  "Rack(BF7) Box(Pisum ESTs II)" . . . .

  8. Relational databases • Data is stored in multiple related tables • Data relationships across tables can be either many-to-one or many-to-many • A few rules allow the database to be viewed in many ways • Lets convert the “course details” to a relational database

  9. Our flat file database FLAT DATABASE 2 Course details Name Depart. Course E1 E2 E3 P1 P2 Student 1 Chemistry Biology A B B A C ….. Student 1 Chemistry Maths C C B A A ….. Student 1 Chemistry English A A A A A ….. . . . . Student 2 Ecology Biology A B A A A ….. Student 2 Ecology Maths A D A A A ….. . . . .

  10. cID Course sID Name dID sID cID E1 E2 E3 P1 P2 1 Biology 1 Student1 1 1 1 A B B A C ….. 2 Maths 2 Student2 2 1 2 C C B A A ….. 3 English 1 3 A A A A A ….. . . . . 2 1 A B A A A ….. dID Department 2 2 A D A A A ….. 1 Chemistry . . . . 2 Ecology Foreign keys Primary keys Normalization 1: remove repeating records (rows)

  11. sID cID gID wID cID Course sID Name dID 1 1 1 1 1 Biology 1 Student1 1 1 1 2 2 2 Maths 2 Student2 2 1 1 2 3 3 English gID Grade 1 1 1 4 1 A 1 1 3 5 2 B 2 1 1 1 1 E1 3 C 2 1 1 2 2 E2 dID Department 2 1 2 3 3 E3 1 Chemistry 2 1 1 4 4 P1 2 Ecology 2 1 1 5 5 P2 Normalization 2: remove repeating records (columns) wID Project

  12. Relational Databases • What have we achieved? • No repeating information • Less storage space • Better reality representation • Easy modification/management • Easy usage of any combination of records Remember the DBMS has programs to access and edit this information so ignore the human reading limitation of the primary keys

  13. Accessing database information • A request for data from a database is called a query • Queries can be of three forms: • Choose from a list of parameters • Query by example (QBE) • QBE build wizard allows which data to display • Query language

  14. Query Languages • The standard • SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language) • Developed by IBM in 1974; introduced commercially in 1979 by Oracle Corp. • Standard interactive and programming language for getting information from and updating a database. • RDMS (SQL), ODBMS (Java, C++, OQL etc)

  15. Querying our biological relational database • Many view are possible … Plasmid View Plasmid     Species           Cell StockpEA25       Escherichia coli  SK10.2.pEA25pEA46.2     Escherichia coli  SK11.pEA46.2pEA207.2    Escherichia coli  SK11.pEA207.2pEA214.6    Escherichia coli  MB123.pEA214.6pEA215.3    Escherichia coli  SK11.pEA215.3pEA238.2    Escherichia coli  MB123.3.PEA238.2pEA238.11   Escherichia coli  MB123.3.pEA238.11pEA277.11   Escherichia coli  SK11.pEA277.11pEA303.4    Escherichia coli  SK11.pEA303.4pEA315.2    Escherichia coli  MB123.3.pEA315.2 peB4        Escherichia coli  VB1.eB4 Experiment View Experiment  Cell Stock       Box               FreezerSK4         SK4.pPS-IAA4-5   Pisum ESTs I      AG334 -80CSK4         SK4.pPS-IAA6     Pisum ESTs I      AG334 -80CSK4         SK4.pTic110      Pisum ESTs I      AG334 -80CSK4         SK4.pToc34       Pisum ESTs I      AG334 -80CSK4         SK4.pToc86       Pisum ESTs I      AG334 -80CSK5         SK5.pAB96.3      Pisum ESTs I      AG334 -80CSK5         SK5.pABR17.10    Pisum ESTs I      AG334 -80CSK5         SK5.pABR18.2     Pisum ESTs I      AG334 -80CSK5         SK5.pI39         Pisum ESTs I      AG334 -80CSK5         SK5.pI49KS       Pisum ESTs I      AG334 -80CSK5         SK5.pI176KS      Pisum ESTs I      AG334 -80CSK5         SK5.pI225KS      Pisum ESTs I      AG334 -80C

  16. Distributed databases • From local to global attitude • Data appears to be in one location but is most definitely not • A definition: Two or more data files in different locations, periodically synchronized by the DBMS to keep data in all locations consistent (A,B,C) • An intricate network for combining and sharing information • Administrators praise fast network technologies!!! • Users praise the internet!!!

  17. Data warehouse • Periodically, one imports data from databases and store it (locally) in the data warehouse. • Now a local database can be created, containing for instance protein family data (sequence, structure, function and pathway/process data integrated with the gene expression and other experimental data). • Disadvantage: expensive, intensive, needs to be updated. • Advantage: easy control of integrated data-mining pipeline.

  18. So why do biologists care?

  19. Three main reasons • Database proliferation • Dozens to hundreds at the moment • More and more scientific discoveries result from inter-database analysis and mining • Rising complexity of required data-combinations • E.g. translational medicine: “from bench to bedside” (genomic data vs. clinical data)

  20. Biological databases • Like any other database • Data organization for optimal analysis • Data is of different types • Raw data (DNA, RNA, protein sequences) • Curated data (DNA, RNA and protein annotated sequences and structures, expression data)

  21. Raw Biological dataNucleic Acids (DNA)

  22. Raw Biological dataAmino acid residues (proteins)

  23. Curated Biological Data DNA, nucleotide sequences Gene boundaries, topology Gene structure Introns, exons, ORFs, splicing Mass spectometry Identify unknown compounds Expression data

  24. Curated Biological Data Proteins, residue sequences Mass spectometry (metabolomics, proteomics) Extended sequence information MCTUYTCUYFSTYRCCTYFSCD Secondary structure Post-Translational protein Modification (PTM) Hydrophobicity, motif data Protein-protein interaction

  25. Curated Biological data3D Structures, folds

  26. Biological Databases The NAR Database Issue: http://www.oxfordjournals.org/nar/database/c/

  27. Distributed information • Pearson’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.

  28. A few biological databases • Nucleotide DatabasesAlternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, IMGT • Genome Databases Human, Mouse, Yeast, C.elegans, FLYBASE, Parasites • Protein DatabasesSwiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT • Structure DatabasesPDB, MSD, FSSP, DALI • Microarray DatabaseArrayExpress • Literature DatabasesMEDLINE, Software Biocatalog, Flybase Archives • Alignment Databases BAliBASE, Homstrad, FSSP

  29. Structural Databases • Protein Data Bank (PDB) http://www.rcsb.org/pdb/ • Structural Classification of Proteins (SCOP) http://scop.berkeley.edu http://scop.mrc-lmb.cam.ac.uk/scop/

  30. PDB • 3D Macromolecular structural data • Data originates from NMR or X-ray crystallography techniques • Total no of structures 48.891 (date: this morning) • If the 3D structure of a protein is solved ... they have it

  31. PDB content

  32. PDB information • The PDB files have a standard format • Key features • Informative descriptors

  33. PDB-mirror on the WWW e.g.1AE5

  34. Example output: 1AE5

  35. Protein Structure Initiative (PSI) Aims at determination of the 3D structure of all Proteins • Organize known protein sequences into families. • Select family representatives as targets. • Solve the 3D structure of targets by X-ray crystallography or NMR spectroscopy. • Build models for other proteins by homology to solved 3D structures. + many structures solved; - many redundant structures (40%)

  36. SCOP • Structural Classification Of Proteins • 3D Macromolecular structural data grouped based on structural classification • Data originates from the PDB • Current version (v1.73) • 34494 PDB Entries (Feb 2008). • 97178 Domains

  37. SCOP levels bottom-up • Family: Clear evolutionarily relationshipProteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%. • Superfamily: Probable common evolutionary originProteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily. • Fold: Major structural similarityProteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.

  38. SCOP-mirror on the WWW …

  39. Enter SCOP at the top of the hierarchy

  40. Keyword search of SCOP entries

  41. CATH • Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. • Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. • Topology level clusters structures according to their toplogical connections and numbers of secondary structures. • The Homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.

  42. CATH-mirror on the WWW …

  43. DSSP • Dictionary of secondary structure of proteins • The DSSP database comprises the secondary structures of all PDB entries • DSSP is actually software that translates the PDB structural co-ordinates into secondary (standardized) structure elements • A similar example is STRIDE

  44. WHY bother??? • Researchers create and use the data • Use of known information for analyzing new data • New data needs to be screened • Structural/Functional information • Extends the knowledge and information on a higher level than DNA or protein sequences

  45. In the end …. • Computers can figure out all kinds of problems, except the things in the world that just don't add up. • James Magary • We should add: • For that we employ the human brain, experts and experience.

  46. Bio-databases: A short word on problems • Even today we face some key limitations • There is no standard format • Every database or program has its own format • There is no standard nomenclature • Every database has its own names • Data is not fully optimized • Some datasets have missing information without indications of it • Data errors • Data is sometimes of poor quality, erroneous, misspelled • Error propagation resulting from computer annotation

  47. What to take home • Databases are a collection of data • Need to access and maintain easily and flexibly • Biological information is vast and sometimes very redundant • Distributed databases bring it all together with quality controls, cross-referencing and standardization • Computers can only create data, they do not give answers • Review-suggestion: “Integrating biological databases”, Stein, Nature 2003

More Related