Anastasia nikolskaya pir protein information resource georgetown university medical center
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND FAMILY CLASSIFICATION. Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center. Problem:. Most new protein sequences come from genome sequencing projects Many have unknown functions

Download Presentation

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Anastasia nikolskaya pir protein information resource georgetown university medical center

FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:

ANNOTATION AND FAMILY CLASSIFICATION

Anastasia Nikolskaya

PIR (Protein Information Resource),

Georgetown University Medical Center


Overview

Problem:

  • Most new protein sequences come from genome sequencing projects

  • Many have unknown functions

  • Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect

Overview

Functional Analysis of Protein Sequences:

  • Homology-based (sequence analysis, structure analysis)

  • Non-homology (genome context, phylogenetic distribution)

Solution for Large-scale Annotation:

  • Highly curated and annotated protein classification system

  • Automatic annotation of sequences based on protein families

PIRSF Protein Classification System

  • Full-length protein family classification based on evolution

  • Highly annotated, optimized for annotation propagation

  • Functional predictions for uncharacterized proteins

  • Used to facilitate and standardize annotations in UniProt


Proteomics and bioinformatics

Proteomics and Bioinformatics

  • Data: Gene expression profilingGenome-wide analysis of gene expression

  • Data: Protein-protein interaction

  • Data: Structural genomics3D structures of all protein families

  • Data: Genome projects (Sequencing)

  • ….

Bioinformatics

Computational analysis and integration of these data

Making predictions (function etc), reconstructing pathways


What s in it for me

What’s In It For Me?

  • When an experiment yields a sequence (or a set of sequences), we need to find out as much as we can about this protein and its possible function from available data

  • Especially important for poorly characterized or uncharacterized (“hypothetical”) proteins

  • More challenging for large sets of sequences generated by large-scale proteomics experiments

  • The quality of this assessment is often critical for interpreting experimental results and making hypothesis for future experiments

    Sequence function


Anastasia nikolskaya pir protein information resource georgetown university medical center

Genomic DNA Sequence

Gene

Gene

Gene Recognition

Exon1

Promoter

5' UTR

Intron

Exon2

Exon3

3' UTR

Intron

A

C

C

T

A

G

A

G

A

A

T

A

A

A

T

T

G

G

T

C

A

T

G

A

A

T

A

A

A

Protein Sequence

Exon1

Exon2

Exon3

Structure

Determination

Function

Analysis

Family Classification

Protein Family

Molecular Evolution

Gene Network

Metabolic Pathway

Protein Structure

Work with Protein, not DNA Sequence

DNA

Sequence

Gene

Protein Sequence

Function


The changing face of protein science

20th century

Few well-studied proteins

Mostly globular with enzymatic activity

Biased protein set

21st century

Many “hypothetical” proteins (Most new proteins come from genome sequencing projects, many have unknown functions)

Various, often with no enzymatic activity

Natural protein set

The Changing Face of Protein Science

Credit: Dr. M. Galperin, NCBI


Knowing the complete genome sequence

Knowing the Complete Genome Sequence

Advantages:

  • All encoded proteins can be predicted and identified

  • The missing functions can be identified and analyzed

  • Peculiarities and novelties in each organism can be studied

  • Predictions can be made and verified

Challenge:

  • Accurate assignment of known or predicted functions (functional annotation)


Anastasia nikolskaya pir protein information resource georgetown university medical center

E. coli M. jannaschii S. cerevisiae H. sapiens

Characterized experimentally 2046 97 3307 10189

Characterized by similarity 1083 1025 1055 10901

Unknown, conserved 285 211 1007 2723

Unknown, no similarity 874 411 966 7965

from Koonin and Galperin, 2003, with modifications


Functional annotation for different groups of proteins

Functional Annotationfor Different Groups of Proteins

  • Experimentally characterized

    • Find up-to-date information, accurate interpretation

  • Characterized by similarity (“knowns”) =closely related to experimentally characterized

    • Avoid propagation of errors

  • Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins)

    • Extract maximum possible information, avoid errors and overpredictions

    • Most value-added (fill the gaps in metabolic pathways, etc)

  • “Unknowns” (conserved or unique)

    • Rank by importance


Anastasia nikolskaya pir protein information resource georgetown university medical center

How are Protein Sequences Annotated?

“regular approach”

Protein Sequence

Function

Automatic assignmentbased on sequence similarity (best BLAST hit):

gene name, protein name, function

Large-scale functional annotation of sequences based simply on BLAST best hit has pitfalls;

results are far from perfect

To avoid mistakes, need human intervention (manual annotation)

Quality vs Quantity


Functional annotation for different groups of proteins1

Functional Annotationfor Different Groups of Proteins

  • Experimentally characterized

    • Find up-to-date information, accurate interpretation

  • Characterized by similarity (“knowns”) =closely related to experimentally characterized

    • Avoid propagation of errors

  • Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins)

    • Extract maximum possible information, avoid errors and overpredictions

    • Most value-added (fill the gaps in metabolic pathways, etc)

  • “Unknowns” (conserved or unique)

    • Rank by importance


Problems in functional assignments for knowns

Problems in Functional Assignments for “Knowns”

  • Misinterpreted experimental results (e.g. suppressors, cofactors)

  • Biologically senseless annotations

    Arabidopsis: separation anxiety protein-like

    Helicobacter: brute force protein

    Methanococcus: centromere-binding protein

    Plasmodium: frameshift

  • “Goofy” mistakes of sequence comparison (e.g. abc1/ABC)

  • Multi-domain organization of proteins

  • Low sequence complexity (coiled-coil, transmembrane, non-globular regions)

  • Enzyme evolution:

  • Divergence in sequence and function (minor mutation in active site)

  • Non-orthologous gene displacement: Convergent evolution


Anastasia nikolskaya pir protein information resource georgetown university medical center

Problems in Functional Assignments for “Knowns”:multi-domain organization of proteins

ACT domain

New sequence

BLAST

Chorismate

mutase

Chorismate mutase domain

ACT domain

In BLAST output, top hits are to chorismate mutases ->

The name “chorismate mutase” is automatically assigned to new sequence.ERROR ! (protein gets erroneous name, EC number, assigned to erroneous pathway, etc)


Problems in functional assignments for knowns1

Problems in Functional Assignments for “Knowns”

Previous low quality annotations lead to

propagation of mistakes


Functional annotation for different groups of proteins2

Functional Annotationfor Different Groups of Proteins

  • Experimentally characterized

    • Find up-to-date information, accurate interpretation

  • Characterized by similarity (“knowns”) =closely related to experimentally characterized

    • Avoid propagation of errors

  • Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins)

    • Extract maximum possible information, avoid errors and overpredictions

    • Most value-added (fill the gaps in metabolic pathways, etc)

  • “Unknowns” (conserved or unique)

    • Rank by importance


Functional prediction i sequence and structure analysis homology based methods

Functional Prediction:I. Sequence and Structure Analysis (homology-based methods)

in non-obvious cases:

  • Sophisticated database searches (PSI-BLAST, HMM)

  • Detailed manual analysis of sequence similarities

  • Structure-guided alignments and structure analysis

Often, only general function can be predicted:

  • Enzyme activity can be predicted, the substrate remains unknown(ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases)

  • Helix-turn-helix motif proteins(predicted transcriptional regulators)

  • Membrane transporters


Anastasia nikolskaya pir protein information resource georgetown university medical center

Using Sequence Analysis:

Hints

  • Proteins (domains) with different 3D folds are not homologous (unrelated by origin). Proteins with similar 3D folds are usually (but not always) homologous

  • Those amino acids that are conserved in divergent proteins within a (super)family are likely to be functionally important (catalytic or binding sites, ect).

  • Reaction chemistry often remains conserved even when sequence diverges almost beyond recognition


Anastasia nikolskaya pir protein information resource georgetown university medical center

Using Sequence Analysis:

Hints

  • Prediction of 3D fold (if distant homologs have known structures!) and of general biochemical function is much easier than prediction of exact biological function

  • Sequence analysis complements structuralcomparisons and can greatly benefit from them

  • Comparative analysis allows us to find subtle sequence similarities in proteins that would not have been noticed otherwise

Credit: Dr. M. Galperin, NCBI


Structural genomics structure based functional predictions

Structural Genomics: Structure-Based Functional Predictions

Protein Structure Initiative: Determine 3D structures of all protein families

Methanococcus jannaschii MJ0577 (Hypothetical Protein)

Contains bound ATP => ATPase or ATP-Mediated Molecular Switch

Confirmed by biochemical experiments


Anastasia nikolskaya pir protein information resource georgetown university medical center

Crystal Structure is Not a Function!

Credit: Dr. M. Galperin, NCBI


Functional prediction ii computational analysis beyond homology

Functional Prediction:II. Computational Analysis Beyond Homology

  • Phylogenetic distribution (comparative genomics)

    • Wide - most likely essential

    • Narrow - probably clade-specific

    • Patchy - most intriguing

  • Domain association – “Rosetta Stone”

  • Genome context (gene neighborhood,operonorganization)

Clues: specific to niche, pathway type


Using genome context for functional prediction

Using Genome Context for Functional Prediction

SEED analysis tool

(by FIG)

Embden-Meyerhof and Gluconeogenesis pathway:

6-phosphofructokinase (EC 2.7.1.11)


Functional prediction problem areas

Functional Prediction: Problem Areas

  • Identification of protein-coding regions

  • Delineation of potential function(s) for distant paralogs

  • Identification of domains in the absence of close homologs

  • Analysis of proteins with low sequence complexity


What to do with a new protein sequence

What to do with a new protein sequence

  • Basic:

    - Domain analysis (SMART = most sensitive; PFAM= most complete, CDD)

  • BLAST

  • Curated protein family databases (PIRSF, InterPro, COGs)

  • Literature (PubMed) from links from individual entries on BLAST output (look for SwissProt entries first)

  • If not sufficient:

  • PSI-BLAST

  • Refined PubMed search using gene/protein names, synonyms,

  • function and other terms you found

  • Genome neighborhood (prokaryotes)

  • Advanced:

  • Multiple sequence alignments (manual)

  • Structure-guided alignments and structure analysis

    - Phylogenetic tree reconstruction


Anastasia nikolskaya pir protein information resource georgetown university medical center

Case Study:

Prediction Verified: GGDEF domain

  • Proteins containing this domain: Caulobacter crescentus PleD controls swarmer cell - stalk cell transition (Hecht and Newton, 1995). In Rhizobium leguminosarum, Acetobacter xylinum, required for cellulose biosynthesis (regulation)

  • Predicted to be involved in signal transduction because it is found in fusions with other signaling domains (receiver, etc)

  • In Acetobacter xylinum, cyclic di-GMP is a specific nucleotide regulator of cellulose synthase (signalling molecule). Multidomain protein with GGDEF domain was shown to have diguanylate cyclase activity (Tal et al., 1998)

  • Detailed sequence analysis tentatively predicts GGDEF to be a diguanylate cyclase domain (Pei and Grishin, 2001)

  • Complementation experiments prove diguanylate cyclase activity of GGDEF (Ausmees et al., 2001)


The need for classification

Facilitates:

  • Good quality and large-scale

  • Systematic correction of annotation errors

  • Protein name standardization

  • Functional predictions for uncharacterized proteins

The Need for Classification

Problem:

  • Most new protein sequences come from genome sequencing projects

  • Many have unknown functions

  • Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect

  • Manual annotation of individual proteins is not efficient

Solution:

  • Highly curated and annotated protein classification system

  • Automatic annotation of sequences based on protein families

This all works only if the system is optimized for annotation


Levels of protein classification

Levels of Protein Classification


Protein evolution

Protein Evolution

Domain: Evolutionary/Functional/Structural Unit

Domain shuffling

Sequence changes

With enough similarity, one can trace back to a common origin

What about these?


Consequences of domain shuffling

CM?

PDH?

PDT?

CM/PDH?

CM/PDT?

Consequences of Domain Shuffling

PIRSF006786

PIRSF001501

CM = chorismate mutase

PDH = prephenate dehydrogenase

PDT = prephenate dehydratase

ACT = regulatory domain

CM (AroQ type)

PDH

CM (AroQ type)

PDH

PIRSF001499

ACT

PDH

PIRSF005547

PDT

ACT

PIRSF001424

CM (AroQ type)

PDT

ACT

PIRSF001500


Anastasia nikolskaya pir protein information resource georgetown university medical center

-

-

-

-

Acylphosphatase

ZnF

ZnF

YrdC

Peptidase M22

Whole Protein = Sum of its Parts?

PIRSF006256

On the basis of domain composition alone, biological function was predicted to be:

● RNA-binding translation factor

● maturation protease

Actual function:

● [NiFe]-hydrogenase maturation factor,

carbamoyltransferase

Full-length protein functional annotation is best done using annotated full-length protein families


Practical classification of proteins setting realistic goals

BUT

Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity)

THUS

The further we extend the classification, the finer is the domain structure we need to consider

SO

We need to compromise between the depth of analysis and protein integrity

Practical classification of proteins:setting realistic goals

We strive to reconstruct the natural classification of proteins to the fullest possible extent

OR …

Credit: Dr. Y. Wolf, NCBI


Complementary approaches

Domain Classification

Allows ahierarchythat can trace evolution to thedeepest possible level, the last point of traceable homology and common origin

Can usually annotate onlygeneral biochemical function

Full-length protein Classification

Cannot build a hierarchy deep along the evolutionary tree because ofdomain shuffling

Can usually annotatespecific biological function(preferred to annotate individual proteins)

Complementary Approaches

  • Can map domains onto proteins

  • Can classify proteins even when domains are not defined


Levels of protein classification1

Levels of Protein Classification


Anastasia nikolskaya pir protein information resource georgetown university medical center

Full-length protein classification

PIRSF

Domain classification

Pfam

SMART

CDD

Protein Classification Databases

  • Mixed

  • TIGRFAMS

  • COGs

  • Based on structural fold

  • SCOP

InterPro: integrates various types of classification databases


Anastasia nikolskaya pir protein information resource georgetown university medical center

CM

ACT

PDT

InterPro

Integrated resource for protein families, domains and sites. Combines a number of databases: PROSITE, PRINTS, Pfam, SMART, ProDom, TIGRFAMs, PIRSF

SF001500

Bifunctional chorismate mutase/ prephenate dehydratase


The ideal system

The Ideal System…

  • Comprehensive: each sequence is classified either as a member of a family or as an “orphan” sequence

  • Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology

  • Allows for simultaneous use of the full-length protein and domain information (domains mapped onto proteins)

  • Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families

  • Expertly curated membership, family name, function, background, etc.

  • Evidence attribution (experimental vs predicted)


Pirsf classification system

http://pir.georgetown.edu/

PIRSF Classification System

  • PIRSF:

    • Reflectsevolutionary relationshipsof full-lengthproteins

    • Anetworkstructure fromsuperfamiliestosubfamilies

  • Definitions:

    • Homeomorphic Family:Basic Unit

    • Homologous: Common ancestry, inferred by sequence similarity

    • Homeomorphic: Full-length similarity & common domain architecture

    • Hierarchy:Flexible number of levels with varying degrees of sequence conservation

    • Network Structure: allows multiple parents

  • Advantages:

    • Annotate both general biochemical and specific biological functions

    • Accurate propagation of annotation and development of standardized protein nomenclature and ontology


Anastasia nikolskaya pir protein information resource georgetown university medical center

PIRSF Classification System

A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.


Creation and curation of pirsfs

Creation and Curation of PIRSFs

UniProtKB proteins

New proteins

Unassigned proteins

Automatic Procedure

Automatic clustering

  • Computer-Generated (Uncurated) Clusters

  • Preliminary Curation (4,700 PIRSFs)

    • Membership

    • Signature Domains

  • Full Curation (3,300 PIRSFs)

    • Family Name, Description, Bibliography

    • PIRSF Name Rules

Preliminary Homeomorphic Families

Orphans

Map domains on Families

Automatic placement

Merge/split

clusters

Add/remove members

Computer-assisted Manual Curation

Curated Homeomorphic Families

Protein name rule/site rule

Name, refs, description

Final Homeomorphic Families

Create hierarchies (superfamilies/subfamilies)

Build and test HMMs


Pirsf family report curated protein family information

PIRSF Family Report:Curated Protein Family Information

Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF

Phylogenetic tree and alignment view allows further sequence analysis


Pirsf protein classification platform for protein analysis and annotation

PIRSF Protein Classification: Platform for Protein Analysis and Annotation

  • Improves automatic annotationquality

  • Serves as a protein analysis platform for broad range of users

  • Matching a protein sequence to a curated protein family rather than searching against a protein database

  • Provides value-added information by expert curators, e.g., annotation of uncharacterized hypothetical proteins (functional predictions)


Family driven protein annotation

Family-Driven Protein Annotation

Objective: Optimize for protein annotation

  • PIRSF Classification Name

    • Reflects the function when possible

    • Indicates the maximum specificity that still describes the entire group

    • Standardized format

    • Name tags: validated, tentative, predicted, functionally heterogeneous

  • Hierarchy

    Subfamilies increase specificity

    (kinase -> sugar kinase -> hexokinase)


Anastasia nikolskaya pir protein information resource georgetown university medical center

Family-Driven Protein Annotation:Site Rules and Name Rules

Goal: Automatic annotation of sequences based on protein families

to address the quality versus quantity problem

  • Define conditions under which family properties propagate to individual proteins

  • Propagate protein name, function, functional sites, EC, GO terms, pathway

  • Enable further specificity based on taxonomy or motifs

  • Account for functional variations within one PIRSF, including:

    - Lack of active site residues necessary for enzymatic activity

    - Certain activities relevant only to one part of the taxonomic tree

    - Evolutionarily-related proteins whose biochemical activities are known to differ


Overview1

Problem:

  • Most new protein sequences come from genome sequencing projects

  • Many have unknown functions

  • Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect

Facilitates:

  • Automatic annotation of sequences based on protein families

  • Systematic correction of annotation errors

  • Name standardization in UniProt

  • Functional predictions for uncharacterized proteins

Overview

Functional Analysis of Protein Sequences:

  • Homology-based (sequence analysis, structure analysis)

  • Non-homology (genome context, phylogenetic distribution)

Solution for Large-scale Annotation:

  • Highly curated and annotated protein classification system

  • Automatic annotation of sequences based on protein families


Anastasia nikolskaya pir protein information resource georgetown university medical center

Impact of Protein Bioinformatics and Genomics

  • Single protein level

    • Discovery of new enzymes and superfamilies

    • Prediction of active sites and 3D structures

  • Pathway level

    • Identification of “missing” enzymes

    • Prediction of alternative enzyme forms

    • Identification of potential drug targets

  • Cellular metabolism level

    • Multisubunit protein systems

    • Membrane energy transducers

    • Cellular signaling systems


Pir team

PIR Team

  • Dr. Cathy Wu, Director

  • Protein Science team

    • Dr. Darren Natale (lead) Dr. Peter McGarvey

    • Dr. Cecilia Arighi Dr. Anastasia Nikolskaya

    • Dr. Winona BarkerDr. Sona Vasudevan

    • Dr. Zhang-zhi HuDr. CR Vinayaka

    • Dr. Raja Mazumder Dr. Lai-Su Yeh

  • Bioinformatics team

    • Dr. Hongzhan Huang (lead) Yongxing Chen, M.S.

    • Dr. Leslie ArminskiBaris Suzek, M.S.

    • Dr. Hsing-Kuo HuaXin Yuan, M.S.

    • Dr. Robel Kahsay Jian Zhang, M.S.

  • Students

    • Natalia Petrova

  • UniProt Collaborators

    • Dr. Rolf Apweiler (EBI)Dr. Amos Bairoch (SIB)

UniProt is supported by the National Institutes of Health, grant # 1 U01 HG02712-01


  • Login