Protein information resource
This presentation is the property of its rightful owner.
Sponsored Links
1 / 57

Protein Information Resource PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on
  • Presentation posted in: General

Protein Information Resource. Oversight and Scientific Advisory Board Meeting November 14, 2005 Georgetown University Medical Center. Welcome and Introduction. Vassilios Papadopoulos, Ph.D. Associate Vice President & Director, Biomedical Graduate Research Organization

Download Presentation

Protein Information Resource

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Protein information resource

Protein Information Resource

Oversight and Scientific Advisory Board Meeting

November 14, 2005

Georgetown University Medical Center


Welcome and introduction

Welcome and Introduction

Vassilios Papadopoulos, Ph.D.

Associate Vice President &

Director, Biomedical Graduate Research Organization

Georgetown University Medical Center

David States, M.D., Ph.D.

Chair, PIR Oversight and Scientific Advisory Board

Professor & Director of Bioinformatics, University of Michigan


Pir uniprot overview project overview organization infrastructure

PIR/UniProt OverviewProject Overview, Organization, Infrastructure

Cathy H. Wu, Ph.D.

Director, PIR

Professor, Georgetown University Medical Center


Protein information resource pir

Protein Information Resource (PIR)

Integrated Protein Informatics Resource for Genomic/Proteomic Research

  • UniProt Universal Protein Resource:Central Resource of Protein Sequence and Function

  • PIRSF Family Classification System: Protein Classification and Functional Annotation

  • iProClass Integrated Protein Database: Data Integration and Protein Mapping

  • Cyber Infrastructure (Interoperability and Dissemination): Ontology, XML, Object/Relational DB, J2EE Architecture

http://pir.georgetown.edu


Uniprot universal protein resource

UniProt: Universal Protein Resource

Central Resource of Protein Sequence and Function

  • International Consortium

    • Protein Information Resource (PIR)

    • European Bioinformatics Institute (EBI)

    • Swiss Institute of Bioinformatics (SIB)

  • NIH U01 Grant (NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR)

    • Phase I (09/02-08/05): $6 Million Annual

    • Bridge (09/05-?/06): $6.6M

    • Phase II (?/06-?/09): $6.6-8.0(?)M

http://www.uniprot.org

NHGRI


Uniprot databases

UniProt Archive (UniParc)

Comprehensive sequence archive with sequence history

Produced at EBI

UniProt Reference Clusters (UniRef)

Non-redundant reference clusters for sequence search

Produced at PIR

UniProt Knowledgebase (UniProtKB)

Integration of PIR-PSD, Swiss-Prot and TrEMBL databases

Stable, comprehensive, fully classified, richly and accurately annotated knowledgebase

UniProtKB/Swiss-Prot: Produced at SIB

UniProtKB/TrEMBL: Produced at EBI

Literature-based and automated annotation at SIB, PIR, EBI

UniProt Databases


Uniprot management structure

UniProt Management Structure

  • Scientific Advisory Panel (SAP) to be established by NHGRI


Uniprot project coordination

UniProt Project Coordination

  • UniProt email discussion groups

    • Project Liaisons and Ad hoc teams

  • Tri-weekly teleconference calls

  • Tri-annual face-to-face Consortium meetings

    • January 12-13, 2006 at Geneva

    • April 10-11, 2006 at Georgetown University

  • Exchange visits of scientific and technical staff

    • Five PIR staff at SIB (1-2 weeks, Nov 05) for annotation integration

  • Retreats

France, 2004


Uniprot activities at pir

UniProt Activities at PIR

  • Integration of PIR-PSD into UniProtKB Swiss-Prot/TrEMBL

    • Incorporation of unique PIR entries

    • Incorporation of PIR annotations: references, experimental features with literature evidence tag

  • Functional annotation of UniProtKB proteins

    • Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins

    • Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site)

  • Production of UniRef100/90/50 databases =>Enhancement & scaling

  • Creation of UniProt web site and help system => Unified UniProt web site & user community interaction


Pirsf classification system

PIRSF Classification System

Protein Classification and Functional Annotation

  • PIRSF: Evolutionary relationships of proteins from super- to sub-families

  • Curated families with name rules and site rules

  • Curation platform with classification/visualization tools

  • Deliverables: UniProtKB annotations, InterPro families, PIRSF reports, PIRSF curation platform

PIRSF Work Group Meeting, April 2003


Iproclass integrated protein database

iProClass Integrated Protein Database

Data Integration and Protein Mapping

  • Data integration from >90 databases

  • Underlying data warehouse for protein ID/name/bibliography mapping

  • Integration of protein family, function, structure for functional annotation

  • Rich link (link + summary) for value-added reports of UniProt proteins

Funded by NSF


Iprolink literature mining resource

iProLINK Literature Mining Resource

  • Bibliography report: Annotated bibliography for UniProtKB proteins

  • BioThesaurus reports: Protein and gene names for UniProtKB proteins

  • RLIMS-P program: Tag PubMed abstracts for phosphorylation objects

  • Protein ontology DAG: PIRSF-based ontology

Funded by NSF


Protein information resource

NIAID Proteomic Admin Center

  • NIAID Proteomic Master Catalog & Complete Proteomes

  • iProXpress for Protein Function and Pathway Analysis

    • Gene/Peptide-Protein Mapping

    • Sequence Analysis & Data Mining

    • Function/ Pathway Discovery

http://pir.georgetown.edu/

proteomics/

Funded by NIAID


Bioinformatics infrastructure

Bioinformatics Infrastructure

  • NCI caBIG: PIR grid-enablement (Programming access to UniProtKB)

  • NSF TeraGrid: All-against-all BLAST (UniProtKB related sequences)

  • PIR Bioinformatics Framework

    • Software Framework: J2EE n-Tier Architecture with Object Models

    • Database Distribution: XML, FASTA, Relational (Oracle 9i, MySQL)

    • Other Deliverables: Object Models, Web Services

Funded by NCI


Computing environment

Computing Environment

  • Computers: Two Sun V880, IBM P690, 100-CPU Linux Cluster, Compaq 4100 Alpha

  • Networking: Internet2, GU Network (1Gbps)

  • GU UIS Advanced Research Computing


Pir environment

PIR Environment

  • Funding: ~$3Million Annual Total (2/3 UniProt, 1/3 Other)

  • Home Institution: Georgetown University Medical Center (GUMC)

  • Subcontract: National Biomedical Research Foundation (NBRF)

  • New Location: Off-Campus (GU North Campus), 6250 SQFT

Suite 1200, 3300 Whitehaven Street NW, Washington, DC 20007


Pir organization

PIR Organization

  • 25 Staff Members

    • 14 GU, 11 NBRF

  • 22 FTEs

    • 12.7 GU, 9.3 NBRF

  • 17 with Doctorate Degree

  • 11 GU Faculty

    • 2 Professors

    • 1 Research Associate Professor

    • 6 Research Assistant Professors

    • 2 Research Instructors


Pir community interactions since 2004

PIR Community Interactions(since 2004)

  • Presentations and Invited Seminars

    • NIH Proteomics Workshop (Bi-Annual) – Bioinformatics Day

    • Conference Demos/Posters: ISMB-05, US HUPO-05, SOFG04

    • Over 20 Invited Presentations: Keystone, Human Brain Project Satellite Symposium, PDB Symposium, HUPO-05

    • Policy Forums, Committees: NSF Plant Cyberinfrastructure, NIH Protein Structure Initiative, HUPO Proteomics Standards Initiative

  • Publications: Over 25 Refereed Papers and Book Chapters

  • Collaborations and Interactions

    • Collaborated and interacted with over 10 research institutions

    • Hosted face-to-face meetings for NIAID/caBIG projects

  • Paper and Grant Reviews

    • Reviewed over 20 papers for referred journals and conferences

    • Served on NSF/NIH grant review panels


Pir georgetown interactions

PIR-Georgetown Interactions

  • Teaching

    • Courses: Bioinformatics (BCHB 521), Advanced Bioinformatics (BCHB 621)

    • Lectures: Medical Biochemistry, Protein Biomarker, Introductory Biology

  • Mentoring

    • Mentored 9 graduate students (PhD students, MS Internship projects)

  • Intercampus Seminars

  • Proposal Submission by PIR Young Investigators as PI

    • Six proposals to federal and other agencies


Pir uniprot summary statistics

PIR/UniProt – Summary & Statistics

Database Growth

Database Usage

Unified UniProt WebSite

PIR UniProt Consortium Interactions

Peter McGarvey, Ph.D.


Uniprot universal protein resource1

UniProt: Universal Protein Resource

http://www.uniprot.org


Database growth

Database Growth


Protein information resource

Customer Email [email protected] & [email protected]

550 UniProt emails

720 PIR emails

1 Day Turnaround

“PIR is a wonderful resource.” – Craig

“Thank you for your prompt response, as always UniProt is on the ball!” – Fiona


Protein information resource

PIR/UniProt – Unified UniProt Web Site

  • Dec. 03, Three Synchronized Sites based on PIR Design

  • Nov. 04, Established Goals for Unified Web Sites.

  • 2005, Back-end Data and Software Platform Developed.

  • Nov. 05, PIR Playing a Lead Role in Developing Specifications for the Interface.

  • June 06, Release of Unified UniProt Web Site Hosted by PIR and EBI


Pir uniprot consortium interactions

PIR/UniProt - Consortium Interactions

  • UniProt liaison group (discussion of high-level issues)

  • UniProt web site committee (Unified UniProt web site planning)

  • UniProt Link committee (working with external databases)

  • UniProt help-mail (answering user inquiries)

  • UniProt document committee (documentation, tutorials and FAQs)

  • UniProt XML group (XML documentation and maintenance)

  • UniProt group for automatic annotation pipeline

  • Manual curation of Swiss-Prot template sequences

  • Manual curation of site rules and controlled vocabularies

  • Development of automatic annotation rules

  • Development of protein naming guidelines

  • Incorporation of new protein families into InterPro

  • PIR routinely visits or hosts colleagues from EBI and SIB for discussions.

  • Biweekly update of UniRef, UniParc and UniProtKB databases


Protein classification and annotation

Protein Classification and Annotation

Darren Natale, Ph.D.

Team Lead, Protein Science, PIR

Research Assistant Professor, GUMC


Protein curation activities

Protein Curation Activities

  • PIRSF – classification of homeomorphic proteins based on evolutionary relationships

  • PIRNR – family-based “Name Rules” that define the parameters for propagating specific name, EC and GO annotation to members

  • PIRSR – family-based “Site Rules” that define the parameters for propagating specific feature annotation to members


Specialized tools i

Specialized Tools (I)

  • Pfam/PIRSF Hierarchy

  • Domain Relatives

  • Domain Composition

DAG

Preserves these three features in a navigable format

In edit mode, allows easy creation, destruction, and movement of PIRSFs


Specialized tools ii

Specialized Tools (II)

HPS

KGPDC

Phylogenetic Tree Classification/Annotation Alignment

PIR Tree and Alignment Viewer (PIRTAV)

HPS= 3-hexulose-6-phosphate synthase

KGPDC= 3-keto-L-gulonate 6-phosphate decarboxylase


Pirsf curation pipeline

PIRSF Curation Pipeline

  • Uncurated level – computer-generated

  • Preliminary Curation Level

    • Curate membership (principle tools: BLAST results, iterative blastclust, on-the-fly HMM)

    • Curate domain architecture

    • Select seeds

  • Full Curation Level

    • Curate name and some references

    • Optional: write abstract indicating function, structure, etc.

(Full level only)

After name review session and HMM performance check, all information (HMM, membership, annotation) is sent to EBI for integration into InterPro.


Pirnr curation pipeline

PIRNR Curation Pipeline

  • Start with PIRSF curated to Full level

  • Define match criteria for application of the rule

  • Review protein name, synonyms, EC numbers, GO terms

  • Find those that are appropriate to propagate to members that match rule criteria

After review of propagable information, send match conditions, exclusion conditions, and propagated fields to EBI for inclusion into automatic annotation pipeline. Results are displayed in EBI’s UniProt entry extended view.


Pirsr curation pipeline

PIRSR Curation Pipeline

  • Start with PIRSF with curated membership and seeds. At least one member must have solved structure.

  • Edit seed-to-structure alignment to define and retain conserved regions covering pertinent residues

  • Build Site HMM from concatenated conserved regions

  • Define feature annotation using controlled vocabulary with evidence attribution

Apply rules to PIRSF members, create log files to send to SIB (UniProtKB/Swiss-Prot) or EBI (UniProtKB/TrEMBL). Results are incorporated into UniProtKB flat files.


Progress on protein curation activities

Progress on Protein Curation Activities

1207

1001

83

428DE/GO/EC

342DE/GO

157DE

561

420

251

112

38

14

162 Preliminary

693 Full

352 Full + Desc

35 Active

34 Metal/Binding

14 Misc.

4222

1595

1266


Impact measurements

Impact Measurements

  • PIRSFs integrated into InterPro

    • Sent:

    • PIRSF-unique:

  • PIRNR touches on UniProtKB/TrEMBL

    • Entries:

    • Annotation lines:

  • PIRSR touches on UniProtKB

    • Entries:

    • Feature lines:

1,775

840

60,300

281,400

41,000 ( 9,800)

100,000 (27,000)


Increasing throughput impact

Increasing Throughput & Impact

Curated

With Structure

Full

Active + Ligand

To InterPro

AutoAnno

Active

Increased specificity

PIRSF

PIRNR

PIRSR

  • Emphasize Full/InterPro

  • Rules to EBI

  • Active sites

  • Comprehensive coverage

  • Curation “push”

  • Propagation at PIR

  • Add ligand-binding

All three will be integrated into the Swiss-Prot annotation platform

All three will be integrated into the Swiss-Prot annotation platform

All three will be integrated into the Swiss-Prot annotation platform


Uniref databases

UniRef Databases

Hongzhan Huang, Ph.D.

Bioinformatics Team Lead

Protein Information Resource, GUMC


Uniref uniprot reference clusters

UniRef (UniProt Reference Clusters)

  • Non-Redundant Reference Clusters for Sequence Searching

  • Derived from UniProtKB and Selected UniParc Sources

    • UniRef100: 100% sequence identity

    • UniRef90: 90% sequence identity (1/3 size reduction from UniRef100)

    • UniRef50: 50% sequence identity (2/3 size reduction)

Release 6.4 (Nov 05)


Uniref100

Sub-fragments

UniRef100

  • The most comprehensive sequence dataset for sequence similarity search

    • 3,176K sequences in UniRef100 vs. 3,022K sequences in NCBI nr

  • Source Sequences

    • Complete UniProtKB - Splice Variants as separate entries

    • Selected UniParc (e.g. Ensembl and RefSeq)

  • Non-Redundancy

    • Combine identical sequences from all species

    • Merge sub-fragments


Uniref90 uniref50

UniRef90 & UniRef50

  • Reduced sequence datasets for faster sequence similarity search

  • Representative sequence for each cluster

  • Clustering Algorithm

    • CD-HIT: Fast, top down, non-overlapping

    • PIR’s parallelized version running on Linux Cluster

UniRef90: 1/3 size reduction UniRef50: 2/3 size reduction


Uniref50 sequence classification

UniRef50 Sequence Classification

  • Completely automated, biweekly-updated classification of all proteins

  • How good are the UniRef50 clusters?

    • Evaluated by all-against-all BLAST search results

    • 98% of the clusters are of good quality: each sequence matches every other sequences within the cluster

  • Problematic clusters

    • One long sequence bridges two or more non-related sub-clusters.

    • May be resulted from incorrect gene models, domain-fusion, polyprotein

    • New algorithm will be developed with length/overlap parameters to detect and regroup such clusters.


Usages of uniref clusters

PIRCF Families

(Computer-generated Families)

UniRef50 Clusters

PIRSF Families

Merge related clusters

Checked by curator

Usages of UniRef Clusters

  • UniRef90/50 for comprehensive automated classification of proteins

    • Faster searches and less cluttered similarity search outputs

    • More even sampling of sequence space and reduction of search bias

  • UniRef for integrity check of database annotation

    • Uniref100 to annotate EST sequences

    • UniRef50 to detect incorrect gene models

  • UniRef90/50 for PIRSF family classification

    • UniRef90 to recruit new PIRSF family members

    • UniRef50 to create new PIRSF families


Literature mining

Literature Mining

Zhang-Zhi Hu, M.D.

Associate Team Lead, Protein Science, PIR

Research Assistant Professor, GUMC


I prolink an integrated resource for protein literature mining

Complete UniProtKB bibliography mapping

RLIMS-P text mining tool for protein phosphorylation

BioThesaurus: protein/gene names

iProLINKAn Integrated Resource for Protein Literature Mining


Pir uniprot protein bibliography

PIR/UniProt Protein Bibliography

  • 355,629 unique citations (PMID) are in iProClass for 2.4 million UniProtKB entries.

  • 166,950(47%)citations are currently in UniProtKB.

  • The additional 188,679 (53%) unique citations are taken from sources such as GeneRIF, SGD, MGI.

Bibliography report:

  • curated citations

  • user submitted

  • computationally mapped


Protein information resource

BioThesaurus report

BioThesaurus– comprehensive collection of gene/protein names from multiple sources and their associations with database entities.

Applications of BioThesaurus

  • Gene/protein names mapping

    • Search synonyms

    • Resolve name ambiguity

  • Database annotation

    • Error detection: conflicting names in UniProtKB

  • Literature mining

    • Query expansion: synonyms and text-variants allow for expanded search results

IAPP

IAPP named

in 18 entries


R ule based li terature m ining s ystem for protein p hosphorylation

kinase

substrate

sites

PMID mapping

Rule-based LIterature Mining System for Protein Phosphorylation

RLIMS-P –

RLIMS-P report – PMID:1939059

MEDLINE abstract (PubMed ID)

P12957

RLIMS-P

Phosphorylation feature extraction

UniProtKB entry mapping

  • 1876UniProtKB entries are currently annotated with 4042phosphorylation sites.

  • 105Kunique citations (PMID) are in UniProtKB/Swiss-Prot

  • Batch processing by RLIMS-P yielded 4690abstracts with phosphorylation information, 913 of them with site information, including 214in UniProtKB entries with no annotated phosphorylation features.

UniProtKB site feature annotation & evidence attribution


Protein information resource

NIAID Biodefense Proteomics Program

Peter McGarvey, Ph.D.


Protein information resource

NIAID Biodefense Proteomics Program

  • 7 Proteomics Research Centers: Identifying Targets for Therapeutic Interventions “..discovering targets for potential candidates for the next generation of vaccines, therapeutics, and diagnostics”

  • Administrative Resource Center: Support research centers, public distribution of results and protocols

    ..establish a Scientific Working Group, Interoperability Working Group, Data infrastructure and promote awareness of the project so scientists worldwide can utilize these resources.


Administrative resource

Administrative Resource

  • Project Management - Social & Scientific Systems (SSS)

    • Meetings and Communications

    • Web Portal

    • NIAID Annual Meeting at PIR May 2006

  • Scientific Coordination - PIR & VBI

    • Scientific Advisory Working Group (SWG)

    • Interoperability Working Group (IWG)

  • Data Infrastructure – PIR & VBI

    • Proteomic Database: Storage and Retrieval (VBI)

    • Data Management and Analysis Tools (PIR/VBI)

    • Integrated Protein Knowledge System (PIR)


Protein information resource

Proteomics Program Interaction Map


Protein information resource

iProClass

PIRSF

UniProt

Data Integration

at Admin Center

Master Catalog &

Complete Proteomes

at GU-PIR

Protein ID

Peptide/Protein

Sequence

Mapping

Integrated Data

at VBI

Data Exchange Format

Controlled Vocabulary

Ontology

Multiple Data Types

from Proteomics

Research Centers


Nci cabig projects

NCI caBIG™ Projects

Baris E. Suzek

Associate Bioinformatics Team Lead

Protein Information Resource, GUMC


About cabig

About caBIG

  • The cancer Biomedical Informatics Grid - WWW of cancer research

  • National Cancer Institute (NCI) and over 50 cancer centers

  • Goals:

    • Breaking down technical and collaborative barriers within the cancer community

    • Facilitating connectivity and sharing of information through common standards and unifying architecture

  • Addressing not only syntactic but also semantic interoperability

  • https://cabig.nci.nih.gov


Pir activities in cabig

PIR Activities in caBIG

  • Domain Workspaces

    • Clinical Trial Management Systems

    • Integrative Cancer Research Workspace

      • PIR Developer Project: Grid Enablement of PIR

      • PIR Adopter Project (Tester): SEED Genome Annotation Tool

      • PIR Participant (Consultant): Protein informatics tools, databases

    • Tissue Banks and Pathology Tools Workspace

  • Cross Cutting Workspaces

    • Architecture

    • Vocabularies and Common Data Elements

      • PIR Participant: Protein models, objects, vocabularies, ontologies


Grid enablement of pir

Grid-Enablement of PIR

  • Goal: UniProt Knowledgebase (UniProtKB) serves as the central protein information resource for cancer research

  • One of four caBIG reference projects

    • PIR (Georgetown University)

    • caTIES (University of Pittsburg)

    • rProteomics (Duke University)

    • caArray (NCICB/Georgetown)

  • First phase completed

    • UniProKB is searchable through caGrid browser

  • Second phase to be developed

    • Expose more information from PIR/UniProt databases to caBIG

    • Increase semantic/syntactic interoperability with other services

Current Architecture caGrid 0.5


Pir seed adoption

PIR SEED Adoption

  • SEED Genome Annotation Tool

    • Developer: U Chicago/Argonne National Lab

    • Open source and distributed framework for genome annotation

    • Support subsystems annotation and metabolic reconstructions

    • Explore functional coupling based on genome context, metabolic pathway, and phylogenetic profile

  • PIR roles

    • Assist development of use cases

    • Create test procedures and test the system

    • Develop user manual


  • Login