1 / 24

ProtoNet Automatic Hierarchical Classification of Proteins Brief Database overview December 2003

ProtoNet Automatic Hierarchical Classification of Proteins Brief Database overview December 2003. ProtoNet Database overview. Present Configuration. 2 Unix servers: Red Hat Linux 7.1, FreeBSD 4.7 the first serves as a web server the second as a database server

billie
Download Presentation

ProtoNet Automatic Hierarchical Classification of Proteins Brief Database overview December 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ProtoNet Automatic HierarchicalClassification of Proteins Brief Database overviewDecember 2003

  2. ProtoNet Database overview Present Configuration • 2 Unix servers: Red Hat Linux 7.1, FreeBSD 4.7 the first serves as a web server the second as a database server • Each server is dual-PIII 1000Mhz w/ 2GB RAM • Database engine: PostgreSQL 7.2

  3. ProtoNet Database overview Present Configuration • Disks situation: 130GB (w/ mirror), 265GB (RAID 5) • Both at critical utilization levels (93%)

  4. ProtoNet Database overview Database Figures: • Around 150 tables, consuming ~160GB • 2 major protein (sequence) sources • 3 protein motif sources • 8 keyword sources • Multiple versions of each source are kept and handled under release management (3 releases so far) – to be mentioned later • 7 clusterings (cluster-tree setups)

  5. ProtoNet Database overview AVA Blast Keywords Motif Systems Annotation Systems Proteins Annotated Clustering Clusters Tree Clustering Research Satellite systems Web Interface Database setup process:

  6. ProtoNet Database overview Proteins Building the PROTEINS table: • Protein Sources: Swiss-Prot (~130K sequences) TrEMBL (~950K sequences) • Each protein record comes with many attributes: • Swiss-Prot ID + Accession numbers • Description • Amino-Acid sequence + Molecular Weight + Length • Keywords + Motifs (features) • Linkage information to external databases (PDB, Taxonomy, Publication journals, etc) • and more… • We also calculate additional attributes, such as the theoretical Iso-Electric point (based on AA sequence).

  7. ProtoNet Database overview AVA Blast Proteins Building the PROTEINPROTEIN table: • Protein Sources: Swiss-Prot (~130K sequences) TrEMBL (~950K sequences) • Each pair of Swiss-Prot proteins is checked for itsBLAST e-score, which serves as a distance measure • The set of distance measures between all proteins serves as the base metric for the clustering process (to be mentioned later) • TrEMBL sequences are ignored at this stage, both due to space and computation limitations and also due to potential noise they would probably inflict on the clustering process (we haven’t had the chance to actually test this hypothesis)

  8. ProtoNet Database overview AVA Blast Proteins Clusters Tree Clustering Building the CLUSTERS table: • The clustering process: • All initial proteins are defined as singleton clusters • Closest pair of clusters is merged into a new cluster (distance between clusters is the average distance of all of their proteins) • The distance between the merged clusters induces a time measure, representing their “death-time” which is equal to the new-born cluster’s “birth-time” • Each cluster gets additional attributes, such as: size, depth, height, average length of proteins, and more… • The process repeats until a distance-cutoff is met • The overall result is the clusters tree

  9. ProtoNet Database overview AVA Blast Proteins Clusters Tree Clustering Building the CLUSTERS table: clustidfatheridsizeheightdepthbirthtimedeathtime… • The overall result is the clusters tree

  10. ProtoNet Database overview clustidfatheridsizeheightdepthbirthtimedeathtime… Clusters Tree Building the CLUSTERS table: • The overall result is the clusters tree

  11. ProtoNet Database overview clustidfatheridsizeheightdepthbirthtimedeathtime… Clusters Tree Building the CLUSTERS table: At this point TrEMBL comes into the picture: Each TrEMBL sequence is “hung” a-posterior on the tree, using our Classify-Your-Protein algorithm. This algorithm either finds a suitable cluster or declares the TrEMBL sequence as a singleton. Note: Leaves are no longer singletons! Several DFS-based orderings of the clusters in the tree are kept, and allow efficient (and elegant!) way to query the clusters hierarchy. We also calculate and attach additional attributes to each cluster, such as: amountofclusters_atbirth, amountofsingletons_atbirthaverlength, varlengthaverdepth, vardepthdfsnum, dfsnum_ofnextnondescsubtreesize, lifetime, …

  12. ProtoNet Database overview Keywords Motif Systems Annotation Systems Extracting annotation from external sources: • Notes: • External databases are downloaded from the Internet in varying file formats (all sorts of strange ASCII formats, XML, …) • Generally speaking, the more “Biological” the database is, the more peculiar is its format • Most biological databases are indexed uniquely, but not all with an integer • Being dynamic and developing, the external databases tend to change their format every now and then! • PERL Parsing scripts were written for each external source…

  13. ProtoNet Database overview Motif Systems Building the MOTIFS + PROTEINMOTIF tables: • Motif Sources: InterPro (PRINTS, Pfam, PROSITE, ProDom, Smart, TIGRFAMs, PIR SuperFamily, SuperFamily) Swiss-Prot (FEATUREs) TrEMBL (FEATUREs) • Each motif record comes with the following attributes: • ID in external system • Description • Starting AA position • Ending AA position

  14. ProtoNet Database overview Motif Systems Annotation Systems Building the KEYWORDS + PROTEINKEYWORD tables: • Keyword Sources: Swiss-Prot, TrEMBL, Enzyme, PDB, SCOP, • GO, NCBI Taxonomy, InterPro • Biological protein properties are extracted from each database, and are stored in suitable tables (proteinec, proteinpdb, pdbscop, proteinspecies, ...) • Keywords are generated for each protein from its associations to these biological properties • These keywords are also grouped into keyword-types • Statistics (quantity/frequency) are calculated for each keyword and keyword-type, according to the proteinkeyword table

  15. ProtoNet Database overview AVA Blast Keywords Motif Systems Annotation Systems Proteins Clusters Tree Clustering Annotated Clustering Combining annotations with the clustering

  16. ProtoNet Database overview Annotated Clustering Combining annotations with the clustering • In the process of combining the annotations with the clustering, we “forget” where each keyword came from. • Statistics are calculated for each keyword in the clustering, and for selected cluster-keyword pairs ( TP, FN, specificity, sensitivity, significance measures). • Several tables are maintained for this stage, some of which are huge:clusterings, clusteringprotein, clusteringkeyword, clusteringkeytype, clusters, clusterkeyword • Space/Time limitations enforce us to be selective with regards to: • Which keyword-types enter this phase in the first place • Which cluster-keyword pairs are analyzed

  17. ProtoNet Database overview Release management • RELEASE is a set of clustering-trees and protein / motif / keyword sources, each of a certain version. • Some of the releases were built explicitly for the web users. A default clustering of the most updated release is exposed to the standard user. Advanced users can switch between the various releases and clustering trees. • In order to allow multiple releases, we keep more than one version of each of the external (imported) databases + clustering trees. • This is also important for research work (that can also be comparative).

  18. ProtoNet Database overview Release management • In order to efficiently manage multiple versions (of external databases), each record contains a special number field which represents a bits-vector. • Bit in position i is turned ON if (and only if) the record is part of version i of the corresponding source. • This serves as a CONDENSATION for the data in the tables, and was introduced due to our space problems. • For example, if a protein-sequence record appears (exactly the same) in versions 2 and 3 of Swiss-Prot, the bits-vector integer is equal to 6 (000001102) • This bits-vector field is unnecessary for the clustering tables, since each clustering is associated to a release (+versions) and the tuples are (generally) non-repetitive. This implies having NO CONDENSATION in the cluster-related tables, and hence they consume A LOT of disk space (especially the cluster-keyword table).

  19. ProtoNet Database overview Problems… ProtoNet’s database is a HUGE setup, which seems to be beyond our technical capabilities. • We are limited in: • Men power: We starve for a steady, experienced database and system developers team, dedicated solely to this project (rather than preoccupied research students).This is the only way this project can fulfill its objective – to be an up-to-date homepage for Proteomics! • Disk space • CPU power: Both for serving outside users and for quicker release upgrades. • Robust, trustworthy and efficient database engine (PostgreSQL seems to fail on all three categories!)

  20. ProtoNet Database overview

  21. ProtoNet Database overview

  22. ProtoNet Database overview

  23. ProtoNet Database overview Thanks! Made by Uri Inbar and Hillel Fleischer

More Related