Protein information resource
1 / 57

Protein Information Resource - PowerPoint PPT Presentation

  • Uploaded on

Protein Information Resource. Oversight and Scientific Advisory Board Meeting November 14, 2005 Georgetown University Medical Center. Welcome and Introduction. Vassilios Papadopoulos, Ph.D. Associate Vice President & Director, Biomedical Graduate Research Organization

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Protein Information Resource' - darin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Protein information resource

Protein Information Resource

Oversight and Scientific Advisory Board Meeting

November 14, 2005

Georgetown University Medical Center

Welcome and introduction

Welcome and Introduction

Vassilios Papadopoulos, Ph.D.

Associate Vice President &

Director, Biomedical Graduate Research Organization

Georgetown University Medical Center

David States, M.D., Ph.D.

Chair, PIR Oversight and Scientific Advisory Board

Professor & Director of Bioinformatics, University of Michigan

Pir uniprot overview project overview organization infrastructure

PIR/UniProt OverviewProject Overview, Organization, Infrastructure

Cathy H. Wu, Ph.D.

Director, PIR

Professor, Georgetown University Medical Center

Protein information resource pir
Protein Information Resource (PIR)

Integrated Protein Informatics Resource for Genomic/Proteomic Research

  • UniProt Universal Protein Resource:Central Resource of Protein Sequence and Function

  • PIRSF Family Classification System: Protein Classification and Functional Annotation

  • iProClass Integrated Protein Database: Data Integration and Protein Mapping

  • Cyber Infrastructure (Interoperability and Dissemination): Ontology, XML, Object/Relational DB, J2EE Architecture

Uniprot universal protein resource
UniProt: Universal Protein Resource

Central Resource of Protein Sequence and Function

  • International Consortium

    • Protein Information Resource (PIR)

    • European Bioinformatics Institute (EBI)

    • Swiss Institute of Bioinformatics (SIB)


    • Phase I (09/02-08/05): $6 Million Annual

    • Bridge (09/05-?/06): $6.6M

    • Phase II (?/06-?/09): $6.6-8.0(?)M


Uniprot databases

UniProt Archive (UniParc)

Comprehensive sequence archive with sequence history

Produced at EBI

UniProt Reference Clusters (UniRef)

Non-redundant reference clusters for sequence search

Produced at PIR

UniProt Knowledgebase (UniProtKB)

Integration of PIR-PSD, Swiss-Prot and TrEMBL databases

Stable, comprehensive, fully classified, richly and accurately annotated knowledgebase

UniProtKB/Swiss-Prot: Produced at SIB

UniProtKB/TrEMBL: Produced at EBI

Literature-based and automated annotation at SIB, PIR, EBI

UniProt Databases

Uniprot management structure
UniProt Management Structure

  • Scientific Advisory Panel (SAP) to be established by NHGRI

Uniprot project coordination
UniProt Project Coordination

  • UniProt email discussion groups

    • Project Liaisons and Ad hoc teams

  • Tri-weekly teleconference calls

  • Tri-annual face-to-face Consortium meetings

    • January 12-13, 2006 at Geneva

    • April 10-11, 2006 at Georgetown University

  • Exchange visits of scientific and technical staff

    • Five PIR staff at SIB (1-2 weeks, Nov 05) for annotation integration

  • Retreats

France, 2004

Uniprot activities at pir
UniProt Activities at PIR

  • Integration of PIR-PSD into UniProtKB Swiss-Prot/TrEMBL

    • Incorporation of unique PIR entries

    • Incorporation of PIR annotations: references, experimental features with literature evidence tag

  • Functional annotation of UniProtKB proteins

    • Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins

    • Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site)

  • Production of UniRef100/90/50 databases =>Enhancement & scaling

  • Creation of UniProt web site and help system => Unified UniProt web site & user community interaction

Pirsf classification system
PIRSF Classification System

Protein Classification and Functional Annotation

  • PIRSF: Evolutionary relationships of proteins from super- to sub-families

  • Curated families with name rules and site rules

  • Curation platform with classification/visualization tools

  • Deliverables: UniProtKB annotations, InterPro families, PIRSF reports, PIRSF curation platform

PIRSF Work Group Meeting, April 2003

Iproclass integrated protein database
iProClass Integrated Protein Database

Data Integration and Protein Mapping

  • Data integration from >90 databases

  • Underlying data warehouse for protein ID/name/bibliography mapping

  • Integration of protein family, function, structure for functional annotation

  • Rich link (link + summary) for value-added reports of UniProt proteins

Funded by NSF

Iprolink literature mining resource
iProLINK Literature Mining Resource

  • Bibliography report: Annotated bibliography for UniProtKB proteins

  • BioThesaurus reports: Protein and gene names for UniProtKB proteins

  • RLIMS-P program: Tag PubMed abstracts for phosphorylation objects

  • Protein ontology DAG: PIRSF-based ontology

Funded by NSF

Protein information resource

NIAID Proteomic Admin Center

  • NIAID Proteomic Master Catalog & Complete Proteomes

  • iProXpress for Protein Function and Pathway Analysis

    • Gene/Peptide-Protein Mapping

    • Sequence Analysis & Data Mining

    • Function/ Pathway Discovery


Funded by NIAID

Bioinformatics infrastructure
Bioinformatics Infrastructure

  • NCI caBIG: PIR grid-enablement (Programming access to UniProtKB)

  • NSF TeraGrid: All-against-all BLAST (UniProtKB related sequences)

  • PIR Bioinformatics Framework

    • Software Framework: J2EE n-Tier Architecture with Object Models

    • Database Distribution: XML, FASTA, Relational (Oracle 9i, MySQL)

    • Other Deliverables: Object Models, Web Services

Funded by NCI

Computing environment
Computing Environment

  • Computers: Two Sun V880, IBM P690, 100-CPU Linux Cluster, Compaq 4100 Alpha

  • Networking: Internet2, GU Network (1Gbps)

  • GU UIS Advanced Research Computing

Pir environment
PIR Environment

  • Funding: ~$3Million Annual Total (2/3 UniProt, 1/3 Other)

  • Home Institution: Georgetown University Medical Center (GUMC)

  • Subcontract: National Biomedical Research Foundation (NBRF)

  • New Location: Off-Campus (GU North Campus), 6250 SQFT

Suite 1200, 3300 Whitehaven Street NW, Washington, DC 20007

Pir organization
PIR Organization

  • 25 Staff Members

    • 14 GU, 11 NBRF

  • 22 FTEs

    • 12.7 GU, 9.3 NBRF

  • 17 with Doctorate Degree

  • 11 GU Faculty

    • 2 Professors

    • 1 Research Associate Professor

    • 6 Research Assistant Professors

    • 2 Research Instructors

Pir community interactions since 2004
PIR Community Interactions(since 2004)

  • Presentations and Invited Seminars

    • NIH Proteomics Workshop (Bi-Annual) – Bioinformatics Day

    • Conference Demos/Posters: ISMB-05, US HUPO-05, SOFG04

    • Over 20 Invited Presentations: Keystone, Human Brain Project Satellite Symposium, PDB Symposium, HUPO-05

    • Policy Forums, Committees: NSF Plant Cyberinfrastructure, NIH Protein Structure Initiative, HUPO Proteomics Standards Initiative

  • Publications: Over 25 Refereed Papers and Book Chapters

  • Collaborations and Interactions

    • Collaborated and interacted with over 10 research institutions

    • Hosted face-to-face meetings for NIAID/caBIG projects

  • Paper and Grant Reviews

    • Reviewed over 20 papers for referred journals and conferences

    • Served on NSF/NIH grant review panels

Pir georgetown interactions
PIR-Georgetown Interactions

  • Teaching

    • Courses: Bioinformatics (BCHB 521), Advanced Bioinformatics (BCHB 621)

    • Lectures: Medical Biochemistry, Protein Biomarker, Introductory Biology

  • Mentoring

    • Mentored 9 graduate students (PhD students, MS Internship projects)

  • Intercampus Seminars

  • Proposal Submission by PIR Young Investigators as PI

    • Six proposals to federal and other agencies

Pir uniprot summary statistics

PIR/UniProt – Summary & Statistics

Database Growth

Database Usage

Unified UniProt WebSite

PIR UniProt Consortium Interactions

Peter McGarvey, Ph.D.

Protein information resource

Customer Email &

550 UniProt emails

720 PIR emails

1 Day Turnaround

“PIR is a wonderful resource.” – Craig

“Thank you for your prompt response, as always UniProt is on the ball!” – Fiona

Protein information resource

PIR/UniProt – Unified UniProt Web Site

  • Dec. 03, Three Synchronized Sites based on PIR Design

  • Nov. 04, Established Goals for Unified Web Sites.

  • 2005, Back-end Data and Software Platform Developed.

  • Nov. 05, PIR Playing a Lead Role in Developing Specifications for the Interface.

  • June 06, Release of Unified UniProt Web Site Hosted by PIR and EBI

Pir uniprot consortium interactions
PIR/UniProt - Consortium Interactions

  • UniProt liaison group (discussion of high-level issues)

  • UniProt web site committee (Unified UniProt web site planning)

  • UniProt Link committee (working with external databases)

  • UniProt help-mail (answering user inquiries)

  • UniProt document committee (documentation, tutorials and FAQs)

  • UniProt XML group (XML documentation and maintenance)

  • UniProt group for automatic annotation pipeline

  • Manual curation of Swiss-Prot template sequences

  • Manual curation of site rules and controlled vocabularies

  • Development of automatic annotation rules

  • Development of protein naming guidelines

  • Incorporation of new protein families into InterPro

  • PIR routinely visits or hosts colleagues from EBI and SIB for discussions.

  • Biweekly update of UniRef, UniParc and UniProtKB databases

Protein classification and annotation

Protein Classification and Annotation

Darren Natale, Ph.D.

Team Lead, Protein Science, PIR

Research Assistant Professor, GUMC

Protein curation activities
Protein Curation Activities

  • PIRSF – classification of homeomorphic proteins based on evolutionary relationships

  • PIRNR – family-based “Name Rules” that define the parameters for propagating specific name, EC and GO annotation to members

  • PIRSR – family-based “Site Rules” that define the parameters for propagating specific feature annotation to members

Specialized tools i
Specialized Tools (I)

  • Pfam/PIRSF Hierarchy

  • Domain Relatives

  • Domain Composition


Preserves these three features in a navigable format

In edit mode, allows easy creation, destruction, and movement of PIRSFs

Specialized tools ii
Specialized Tools (II)



Phylogenetic Tree Classification/Annotation Alignment

PIR Tree and Alignment Viewer (PIRTAV)

HPS = 3-hexulose-6-phosphate synthase

KGPDC = 3-keto-L-gulonate 6-phosphate decarboxylase

Pirsf curation pipeline
PIRSF Curation Pipeline

  • Uncurated level – computer-generated

  • Preliminary Curation Level

    • Curate membership (principle tools: BLAST results, iterative blastclust, on-the-fly HMM)

    • Curate domain architecture

    • Select seeds

  • Full Curation Level

    • Curate name and some references

    • Optional: write abstract indicating function, structure, etc.

(Full level only)

After name review session and HMM performance check, all information (HMM, membership, annotation) is sent to EBI for integration into InterPro.

Pirnr curation pipeline
PIRNR Curation Pipeline

  • Start with PIRSF curated to Full level

  • Define match criteria for application of the rule

  • Review protein name, synonyms, EC numbers, GO terms

  • Find those that are appropriate to propagate to members that match rule criteria

After review of propagable information, send match conditions, exclusion conditions, and propagated fields to EBI for inclusion into automatic annotation pipeline. Results are displayed in EBI’s UniProt entry extended view.

Pirsr curation pipeline
PIRSR Curation Pipeline

  • Start with PIRSF with curated membership and seeds. At least one member must have solved structure.

  • Edit seed-to-structure alignment to define and retain conserved regions covering pertinent residues

  • Build Site HMM from concatenated conserved regions

  • Define feature annotation using controlled vocabulary with evidence attribution

Apply rules to PIRSF members, create log files to send to SIB (UniProtKB/Swiss-Prot) or EBI (UniProtKB/TrEMBL). Results are incorporated into UniProtKB flat files.

Progress on protein curation activities
Progress on Protein Curation Activities













162 Preliminary

693 Full

352 Full + Desc

35 Active

34 Metal/Binding

14 Misc.




Impact measurements
Impact Measurements

  • PIRSFs integrated into InterPro

    • Sent:

    • PIRSF-unique:

  • PIRNR touches on UniProtKB/TrEMBL

    • Entries:

    • Annotation lines:

  • PIRSR touches on UniProtKB

    • Entries:

    • Feature lines:





41,000 ( 9,800)

100,000 (27,000)

Increasing throughput impact
Increasing Throughput & Impact


With Structure


Active + Ligand

To InterPro



Increased specificity




  • Emphasize Full/InterPro

  • Rules to EBI

  • Active sites

  • Comprehensive coverage

  • Curation “push”

  • Propagation at PIR

  • Add ligand-binding

All three will be integrated into the Swiss-Prot annotation platform

All three will be integrated into the Swiss-Prot annotation platform

All three will be integrated into the Swiss-Prot annotation platform

Uniref databases

UniRef Databases

Hongzhan Huang, Ph.D.

Bioinformatics Team Lead

Protein Information Resource, GUMC

Uniref uniprot reference clusters
UniRef (UniProt Reference Clusters)

  • Non-Redundant Reference Clusters for Sequence Searching

  • Derived from UniProtKB and Selected UniParc Sources

    • UniRef100: 100% sequence identity

    • UniRef90: 90% sequence identity (1/3 size reduction from UniRef100)

    • UniRef50: 50% sequence identity (2/3 size reduction)

Release 6.4 (Nov 05)




  • The most comprehensive sequence dataset for sequence similarity search

    • 3,176K sequences in UniRef100 vs. 3,022K sequences in NCBI nr

  • Source Sequences

    • Complete UniProtKB - Splice Variants as separate entries

    • Selected UniParc (e.g. Ensembl and RefSeq)

  • Non-Redundancy

    • Combine identical sequences from all species

    • Merge sub-fragments

Uniref90 uniref50
UniRef90 & UniRef50

  • Reduced sequence datasets for faster sequence similarity search

  • Representative sequence for each cluster

  • Clustering Algorithm

    • CD-HIT: Fast, top down, non-overlapping

    • PIR’s parallelized version running on Linux Cluster

UniRef90: 1/3 size reduction UniRef50: 2/3 size reduction

Uniref50 sequence classification
UniRef50 Sequence Classification

  • Completely automated, biweekly-updated classification of all proteins

  • How good are the UniRef50 clusters?

    • Evaluated by all-against-all BLAST search results

    • 98% of the clusters are of good quality: each sequence matches every other sequences within the cluster

  • Problematic clusters

    • One long sequence bridges two or more non-related sub-clusters.

    • May be resulted from incorrect gene models, domain-fusion, polyprotein

    • New algorithm will be developed with length/overlap parameters to detect and regroup such clusters.

Usages of uniref clusters

PIRCF Families

(Computer-generated Families)

UniRef50 Clusters

PIRSF Families

Merge related clusters

Checked by curator

Usages of UniRef Clusters

  • UniRef90/50 for comprehensive automated classification of proteins

    • Faster searches and less cluttered similarity search outputs

    • More even sampling of sequence space and reduction of search bias

  • UniRef for integrity check of database annotation

    • Uniref100 to annotate EST sequences

    • UniRef50 to detect incorrect gene models

  • UniRef90/50 for PIRSF family classification

    • UniRef90 to recruit new PIRSF family members

    • UniRef50 to create new PIRSF families

Literature mining

Literature Mining

Zhang-Zhi Hu, M.D.

Associate Team Lead, Protein Science, PIR

Research Assistant Professor, GUMC

I prolink an integrated resource for protein literature mining

Complete UniProtKB bibliography mapping

RLIMS-P text mining tool for protein phosphorylation

BioThesaurus: protein/gene names

iProLINKAn Integrated Resource for Protein Literature Mining

Pir uniprot protein bibliography
PIR/UniProt Protein Bibliography

  • 355,629 unique citations (PMID) are in iProClass for 2.4 million UniProtKB entries.

  • 166,950(47%)citations are currently in UniProtKB.

  • The additional 188,679 (53%) unique citations are taken from sources such as GeneRIF, SGD, MGI.

Bibliography report:

  • curated citations

  • user submitted

  • computationally mapped

Protein information resource

BioThesaurus report

BioThesaurus– comprehensive collection of gene/protein names from multiple sources and their associations with database entities.

Applications of BioThesaurus

  • Gene/protein names mapping

    • Search synonyms

    • Resolve name ambiguity

  • Database annotation

    • Error detection: conflicting names in UniProtKB

  • Literature mining

    • Query expansion: synonyms and text-variants allow for expanded search results


IAPP named

in 18 entries

R ule based li terature m ining s ystem for protein p hosphorylation




PMID mapping

Rule-based LIterature Mining System for Protein Phosphorylation


RLIMS-P report – PMID:1939059

MEDLINE abstract (PubMed ID)



Phosphorylation feature extraction

UniProtKB entry mapping

  • 1876UniProtKB entries are currently annotated with 4042phosphorylation sites.

  • 105Kunique citations (PMID) are in UniProtKB/Swiss-Prot

  • Batch processing by RLIMS-P yielded 4690abstracts with phosphorylation information, 913 of them with site information, including 214in UniProtKB entries with no annotated phosphorylation features.

UniProtKB site feature annotation & evidence attribution

Protein information resource

NIAID Biodefense Proteomics Program

  • 7 Proteomics Research Centers: Identifying Targets for Therapeutic Interventions “..discovering targets for potential candidates for the next generation of vaccines, therapeutics, and diagnostics”

  • Administrative Resource Center: Support research centers, public distribution of results and protocols

    ..establish a Scientific Working Group, Interoperability Working Group, Data infrastructure and promote awareness of the project so scientists worldwide can utilize these resources.

Administrative resource
Administrative Resource

  • Project Management - Social & Scientific Systems (SSS)

    • Meetings and Communications

    • Web Portal

    • NIAID Annual Meeting at PIR May 2006

  • Scientific Coordination - PIR & VBI

    • Scientific Advisory Working Group (SWG)

    • Interoperability Working Group (IWG)

  • Data Infrastructure – PIR & VBI

    • Proteomic Database: Storage and Retrieval (VBI)

    • Data Management and Analysis Tools (PIR/VBI)

    • Integrated Protein Knowledge System (PIR)

Protein information resource




Data Integration

at Admin Center

Master Catalog &

Complete Proteomes


Protein ID




Integrated Data

at VBI

Data Exchange Format

Controlled Vocabulary


Multiple Data Types

from Proteomics

Research Centers

Nci cabig projects

NCI caBIG™ Projects

Baris E. Suzek

Associate Bioinformatics Team Lead

Protein Information Resource, GUMC

About cabig
About caBIG

  • The cancer Biomedical Informatics Grid - WWW of cancer research

  • National Cancer Institute (NCI) and over 50 cancer centers

  • Goals:

    • Breaking down technical and collaborative barriers within the cancer community

    • Facilitating connectivity and sharing of information through common standards and unifying architecture

  • Addressing not only syntactic but also semantic interoperability


Pir activities in cabig
PIR Activities in caBIG

  • Domain Workspaces

    • Clinical Trial Management Systems

    • Integrative Cancer Research Workspace

      • PIR Developer Project: Grid Enablement of PIR

      • PIR Adopter Project (Tester): SEED Genome Annotation Tool

      • PIR Participant (Consultant): Protein informatics tools, databases

    • Tissue Banks and Pathology Tools Workspace

  • Cross Cutting Workspaces

    • Architecture

    • Vocabularies and Common Data Elements

      • PIR Participant: Protein models, objects, vocabularies, ontologies

Grid enablement of pir
Grid-Enablement of PIR

  • Goal: UniProt Knowledgebase (UniProtKB) serves as the central protein information resource for cancer research

  • One of four caBIG reference projects

    • PIR (Georgetown University)

    • caTIES (University of Pittsburg)

    • rProteomics (Duke University)

    • caArray (NCICB/Georgetown)

  • First phase completed

    • UniProKB is searchable through caGrid browser

  • Second phase to be developed

    • Expose more information from PIR/UniProt databases to caBIG

    • Increase semantic/syntactic interoperability with other services

Current Architecture caGrid 0.5

Pir seed adoption
PIR SEED Adoption

  • SEED Genome Annotation Tool

    • Developer: U Chicago/Argonne National Lab

    • Open source and distributed framework for genome annotation

    • Support subsystems annotation and metabolic reconstructions

    • Explore functional coupling based on genome context, metabolic pathway, and phylogenetic profile

  • PIR roles

    • Assist development of use cases

    • Create test procedures and test the system

    • Develop user manual