  1. Tutorial: Bioinformatics Resources (http://pir.georgetown.edu/pirwww/workshop/bioinfo_resource.html) Bio-Trac 25 (Proteomics: Principles and Methods) October 3, 2008 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

  2. What is Bioinformatics? computer + mouse = bioinformatics(information) (biology) • NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000) - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualizesuch data.

  3. Molecular Biology Database Collection 1078 key databases of 14 categories (http://nar.oxfordjournals.org/cgi/content/full/36/suppl_1/D2)

  4. Database Collection in Nucleic Acids Res.

  5. Online Access to Database Collection http://pir.georgetown.edu/pirwww/workshop/2005_database_update.html 2008 http://www.oxfordjournals.org/nar/database/cap/

  6. Overview Database Contents, Search and Retrieval • Text search / Information retrieval • Sequence & genomics databases • Protein family databases • Databases of protein functions • Databases of protein structures • Proteomics databases Lab session

  7. Integrated one-stop search Entrez Text Searches (http://www.ncbi.nlm.nih.gov/Entrez/) Lab

  8. Literature mining PubMed Literature Database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed) PMID:14640721 Lab

  9. iProLINK: Protein Literature Mining Resource RLIMS-P: Text mining for protein phosphorylation BioThesaurus: Gene/protein name thesaurus: synonyms, ambiguous names… http://pir.georgetown.edu/iprolink/ Lab

  10. BioThesaurus:Gene/protein name searches - synonyms, ambiguous names… Synonyms: CRYAA crystallin, alpha A CRYA1 HSPB4… http://pir.georgetown.edu/iprolink/biothesaurus Lab

  11. RLIMS-P: Text mining for protein phosphorylation http://pir.georgetown.edu/iprolink/rlimsp/ Lab

  12. PIR Text Search (I) (http://pir.georgetown.edu/pirwww/search/textsearch.html) Googletype search vs. Booleansearches: AND, OR, NOT Lab

  13. Search: alpha crystallin A chain that are in protein families? PIR Text Search (II) null = absent; not null = present Search for synonyms Lab

  14. PIR Text Search (III) Search: what crystallins are enzymes and what families they belong to? Can you find which crystallins have 3D structure determined? Argininosuccinate lyase (EC Lab

  15. UniProt Text Search http://www.uniprot.org/ Find proteins related to diabetes and with 3D-structure determined? Lab

  16. Search continues… Lab

  17. I. Sequence & Genomics Databases • NCBI Resources • GenBank: An annotated collection of all publicly available nucleotide and protein sequences. • RefSeq: NCBI non-redundant set of reference sequences, including genomic DNA, transcript (RNA), and protein products • Entrez Gene: Gene-centered information at NCBI. • UniGene: Unified clusters of ESTs and full-length mRNA sequences . • OMIM: Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders. • UniProtConsortium Database: Universal protein resource, a central repository of protein sequence and function. • Model Organism Genome Databases: MGD, RGD, SGD, Flybase… • GeneCards: Integrated database of human genes, maps, proteins and diseases. • SNP Consortium Database (dbSNP); International HapMapProject: Genes associated with human diseases (http://www.oxfordjournals.org/nar/database/cap/)

  18. 6.6 million New! UUW UniProt Consortium Databases Universal Protein Resource (http://www.uniprot.org) Since October 2002 Since July 2008

  19. Lab UniProt Report (I) Sections of the record Entry View: Sequence & Annotation http://www.uniprot.org/uniprot/P02493

  20. UniProt Report (II) – sequence and features Lab

  21. UniProt Report (III) – UniRef90 http://www.uniprot.org/uniref/?query=member%3aP02493+identity:0.9

  22. Entrez Gene – Gene centric information http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq

  23. OMIM:Online Mendelian inheritance in man Autosomal recessive congenital progressive cataract Juvenile cataract of Down syndrome (http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580)

  24. II. Protein Family Databases • Whole Proteins • PIRSF: Nonoverlapping Classification of Full Length Proteins Based on Evolutionary Relationship • COG (Clusters of Orthologous Groups) of Complete Genomes • PANTHER: Proteins Classified into Families/Subfamilies of Shared Function • ProtoNet: Automatic Hierarchical Classification of Proteins • Protein Domains • Pfam: Alignments and HMM Models of Protein Domains • SMART: Protein Domain Identification and Annotation • CDD: Conserved Domain Database • Protein Motifs • PROSITE: Protein Patterns and Profiles • BLOCKS: Protein Sequence Motifs and Alignments • PRINTS: Compendium of Protein Fingerprints (a group of conserved motifs) • Integrated Family Databases • InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF, SuperFamily…

  25. Protein Clustering Initial version COGs:(http://www.ncbi.nlm.nih.gov/COG/) New version: Includes Eukaryotic Clusters - KOGs

  26. Lab PIRSF: Full Length ClassificationiProClass Family Report (http://pir.georgetown.edu/cgi-bin/ipcSF?id=SF002280)

  27. Domain Classification – Pfam Domain (http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=CRYAA_RABIT) (http://pir.georgetown.edu/cgi-bin/ipcEntry?id=P02493)

  28. Pfam Domain (http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF00525)

  29. Protein Motifs: PROSITE –A database of protein families and domains. It consists of biologically significant sites, patterns and profiles. (http://us.expasy.org/prosite/)

  30. Integrated Family Classification InterPro: An integrated resource unifying PROSITE, PRINTS, ProDom, Pfam, SMART, and TIGRFAMs, PIRSF. (http://www.ebi.ac.uk/interpro/search.html) Mapping of families

  31. III. Databases of Protein Functions • Metabolic Pathways, Enzymes, and Compounds • Enzyme Classification: Classification and Nomenclature of Enzyme-Catalysed Reactions (EC-IUBMB) • KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways • LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes • EcoCyc: Encyclopedia of E. coli Genes and Metabolism • MetaCyc: Metabolic Encyclopedia (Metabolic Pathways) • BRENDA: Enzyme Database • UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways • Inter-Molecular Interactions and Regulatory Pathways • IntAct: Protein interaction data from literature and user submission • BIND: Descriptions of interactions, molecular complexes and pathways • DIP: Catalogs experimentally determined interactions between proteins • Reactome - A curated knowledgebase of biological pathways • BioCarta: Biological pathways of human and mouse • GO: Gene Ontology Consortium Database • Pathway Resources - Pathguide

  32. Biological Pathway Resource Collection http://www.pathguide.org/ • Protein-protein interactions • Metabolic pathways • Signaling pathways • Pathway diagrams • Transcription factors / gene regulatory networks • Protein-compound interactions • Genetic interaction networks

  33. Pathway Commons Search across multiple pathway databases; common format for global analysis http://www.pathwaycommons.org/pc/home.do

  34. Lab KEGG Metabolic & Regulatory Pathways • KEGG is a suite of databases and associated software, integrating our current knowledge • on molecular interaction networks, the information of genes and proteins, and of chemical • compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html) (http://www.genome.ad.jp/dbget-bin/show_pathway?hsa00220+

  35. BioCyc: EcoCyc/MetaCyc Metabolic Pathways • The BioCyc Knowledge Library is a collection of Pathway/Genome Databases (http://biocyc.org/)

  36. BioCarta Cellular Pathways (http://www.biocarta.com/index.asp)

  37. Reactome:http://www.reactome.org/ • Collaboration of CSHL, EBI and GO Consortium • Curated resource of core pathways and reactions in human biology • Authored by biological researchers of field experts • Cross-referenced with NCBI, Ensembl and UniProt, HapMap, KEGG… • Inferred orthologous events in 22 non-human species (mouse, rat…)

  38. Transforming Growth Factor (TGF) beta signaling [Homo sapiens] (http://reactome.org/cgi-bin/eventbrowser?DB=gk_current&FOCUS_SPECIES=Homo%20sapiens&ID=170834&) Reactome: events and objects (including modified forms and complex) Event ->REACT_6879.1: Activated type I receptor phosphorylates R-SMAD directly [Homo sapiens] Object -> REACT_7364.1: Phospho-R-SMAD [cytosol] Event -> REACT_6760.1: Phospho-R-SMAD forms a complex with CO-SMAD [Homo sapiens] Object -> REACT_7344.1: Phospho-R-SMAD:CO-SMAD complex [cytosol] Event -> REACT_6726.1: The phospho-R-SMAD:CO-SMAD transfers to the nucleus Object -> REACT_7382.2: Phospho-R-SMAD:CO-SMAD complex [nucleoplasm] ……

  39. Protein-Protein Interaction Database - IntAct (http://www.ebi.ac.uk/intact/)

  40. Gene Ontology (GO) (http://www.geneontology.org/) - Molecular Function - Biological Process - Cellular Component

  41. IV. Databases of Protein Structures • Protein Structure • PDB: Structure Determined by X-ray Crystallography and NMR • PDBsum: Summaries and analyses of PDB structures • MMDB: NCBI’s database of 3D structures, part of NCBI Entrez • SWISS-MODEL Repository: Database of annotated protein 3D models • ModBase: Annotated comparative protein structure models • Structure Classification • CATH: Hierarchical Classification of Protein Domain Structures • SCOP: Familial and Structural Protein Relationships • FSSP: Protein Fold Classification Based on Structure--Structure Alignment

  42. PDB: Experimental 3D Structure Repository Rat gamma-crystallin (chain A, B.) Can you do a text search at PIR to find this (CRGE_RAT)? (http://www.rcsb.org/pdb/) Lab

  43. PDBsum: Pictorial Database to Provide Summary and Analysis to PDB Entries Search 3-D structure summary 2-D structure summary (http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/)

  44. Protein Structural Classification (1) CATH: Hierarchical domain classification of protein structures (http://www.cathdb.info/)

  45. Protein Structural Classification (2) SCOP:comprehensive description of structural and evolutionary relationships between all proteins whose structure is known. (http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html)

  46. SWISS-MODEL Repository http://swissmodel.expasy.org/ http://swissmodel.expasy.org/repository/ A database of annotated three-dimensional comparative protein structure models(http://swissmodel.expasy.org/repository/smr.php?sptr_ac=CRBA1_MOUSE&job=2)

  47. VI. Proteomic Resources • GELBANK (http://gelbank.anl.gov): 2D-gel patterns of species with completed genomes. • SWISS-2DPAGE (http://www.expasy.org/ch2d/): index of 2D-gels • PEP (http://cubic.bioc.columbia.edu/ pep/): Predictions for Entire Proteomes: summarized analyses of protein sequences • Integr8 (http://www.ebi.ac.uk/integr8/): A browser for information relating to completed genomes and proteomes, based on data contained in Genome Reviews and the UniProt proteome sets • PRIDE (http://www.ebi.ac.uk/pride/): PRoteomics IDEntifications database Expression Profiling databases • GPMdb (http://gpmdb.thegpm.org/): Mass spec proteomics Databases • PeptideAtlas (http://www.peptideatlas.org/): compendium of peptides identified in a large set of tandem mass spectrometry proteomic experiments • HUPO (http://www.hupo.org/): Human Proteome Organization to foste international proteomics initiatives.

  48. Lab 2D-Gel Image Databases (http://us.expasy.org/ch2d/) Part of WORLD-2DPAGE: index to 2-D PAGE databases and services (http://us.expasy.org/swiss-2dpage/ac=P02489)

  49. GPMdb: MS Data Search (http://gpmdb.thegpm.org/) Craig, et al., J Proteome Res. 2004, 3:1234-42.

  50. HUPO Plasma Proteome Project PRIDE: centralized, standards compliant, public data repository for proteomics data http://www.ebi.ac.uk/pride/