Automatic Domain Identification

Automatic Domain Identification Topic 10 Chapter 20, Du and Bourne “Structural Bioinformatics”

What is a domain reasonable region of complexity Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx

Protein Domain • Definition of protein domain is not well defined (to say the least), • which makes it difficult to identify their boundaries • General Considerations: • - compact, semi-independent units • (close to spherical shape) * • - interactions between domains are weak • (small contact) • - identifiable hydrophobic core • (interface is more hydrophilic) ** • - -sheet is best preserved • * Wetlaufer DB. PNAS 1973; 70:697-701 • ** Swindells MB. Protein Science 1995; 4:103-112

Multi-domain Proteins Approximately 50% proteins are multi-domain (data from 2005). It could be as high as 80% in eukaryotes Redfern et al, PloS Computational Biology, 2007

From Wikipedia… • A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently (not likely now) of the rest of the protein chain. • Each domain forms (formed?) a compact three-dimensional structure and often can be independently stable and folded. • Many proteins consist of several structural domains. • One domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. • Domains vary in length from between about 25 amino acids up to 500 amino acids in length. The shortest domains such as zinc fingers are stabilized by metal ions or disulfide bridges. • Domains often form functional units, such as the calcium-binding EF-hand domain of calmodulin(is a single EF-hand really a domain?). • Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.

EF-hands (domain or motif?) The EF-hand is another common structural element. In fact, the protein calmodulin has four of them. Calmodulin

Adding to the Complexity, Discontinuous Domains N-terminal C-terminal SCOP Classification: 33844 px c.56.5.4 d1cg2a1 1cg2 A:26-213,A:327-414 39360 px d.58.19.1 d1cg2a2 1cg2 A:214-326 About 20% of mutidomain proteins are not contiguous in sequence Redfern OC. et al, PloS Computational Biology, 2007

Domain identification • Any structure unclassified by the sequence-based methods are divided into their constituent domains (when appropriate). The domains are then resubmitted to the sequence and structure comparison protocols discussed previously. • While there are many automatic domain identification algorithms, most result in significant numbers of incorrect assessments (20-30% incorrect). • This is mainly due to the fact that there is no unique answer to the question, “What is a domain?” For example, one could easily envision various domain classification schemes based on sequence, phylogeny, and/or structure. • Structure-based approaches are based on straightforward structural concepts: namely that (globular) proteins have hydrophobic cores, and that these cores should constitute a (semi)independent folding nucleus. • Thus the automated methods attempt to (maximize, minimize) (intra, inter)-domain contacts. • What about non-globular (i.e., intrinsically disordered or integral) proteins???

Domain identification ADH Most automated domain identification methods are primarily based on this premise. However, as you might expect, there are myriad ways to implement such an idea.

Automatic Domain Partition Methods • Early works only apply to single-segment domains • Crippen, 1978; Nemethy & Scheraga, 1979; Lesk& Rose, 1981; Rashin, 1981. • Current methods for multi-segment domains mostly use heuristics and approximations: • Holm & Sander, 1994; Siddiqui & Barton, 1995; Swindells, 1995…….. Note: the focus here is structural domain partition. While structure-based domain assignment is not a trivial problem, domain prediction from sequences is even more difficult. Any advances in sequence-based domain prediction will greatly improve protein structure prediction.

The general approach Basic principle for domain partition: inter-residue interactions are denser within domains than between domains

Top-down vs. Bottom-up Start with the entire structure and proceed through iterative partitions into smaller units. • Over the years, an amazing array of approaches have been put forward to solve the domain ID problem. • In spite of very different overall approaches, an interesting observation has been made: most algorithms correctly ID 70-80% domains within structures, but fail on the others due to complexity within some multi-domain proteins. • The # of boundaries are both over-predicted leading to too many domains (overcut) or under-predicted leading to too few domains (undercut). • Thus, the problem remaining is not “where does the boundary of the domain fall?”, but rather “is the identified boundary real?” Define very small structural units and assemble them into domains.

Make domains by putting together primitive units of secondary structure Maximize hydrophobic core of the unit Maximize compactness of the unit Find mechanical hinge points between units Minimize interface area between units Make domains by partitioning chain into smaller units Step 1 Bottom-up approach Top-down approach Minimum size of unit Maximize globularity Minimize cutting through secondary structures Maximum number of discontinuous fragments within the domain How do automatic methods work? 3D-coordinates of chain Parameters involved Evaluate each potential domainusing set of parameters (accept or reject given assignment) Step 2 Predicted domains Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx

Two steps of algorithm design: Step B Step A Validate the performance run the algorithm of an independent set of data Report % of correctly partitioned proteins Train the algorithm compare predicted domain assignments to “correct” domain assignments Tune parameters till the best level of prediction is achieved Use expert data for domain assignments A problem: different algorithms use assignments from different experts for training and validation. More seriously, there is no good objective way to compare the performance of different methods, as each uses different dataset for validation. Algorithms will reflect same propensities toward domain assignments as the expert method they rely upon. Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx

Issues in Protein Domain Partition • Compactness (contacts/#of residues……) • Minimum domain size (35 amino acids [AA], 40AA…?) • Minimum size to be considered for partition (80AA…?) • Integrity of secondary structures (Is it ok to break -sheet?) • Most programs use top-down approach, what are the criteria for stops?

CATH Domain Classification • Use both automatic and manual techniques • If it has high sequence identity (80%) and structural similarity (SSAP score >= 80) with a protein chain X that has been classified in CATH, use the boundaries of X. • Otherwise, apply several domain partition programs • 1. DETECTIVE (Swindells, 1995), • 2. PUU (Holm & Sander, 1994), • 3. DOMAK (Siddiqui and Barton, 1995). • If there is no consensus assign manually.

Differences WARNING: Even though each method has about 70-80% accuracy based on benchmark tests, disagreement among methods is very big in terms of the number of domains, and domain boundaries. In CATH, if consensus is not found within a tolerance of 10 residues, the domains are manually assigned (right).

Automatic Domain Partition Methods • DOMAK (Siddiqui and Barton, 1995). • split value = (intA/extAB)*(intB/extAB) • intA (B): the number of internal contacts in A (B) (contact: heavy atoms within 5 Å) • extAB: the number of contacts between A and B • DETECTIVE (Swindells, 1995), • hydrophobic core determination • PUU (protein unfolding units, Holm & Sander, 1994), • harmonic model to describe inter-domain dynamics • Domainparser (Xu, 2000) • graph algorithm---network flow

DomainParser • DomainParser (Xu et al, Bioinformatics 2000) uses a graph-theoretic algorithm for the decomposition of a multi-domain protein into individual structural domains. • The underlying principle used is that residue-residue contacts are denser within a domain than between domains. • The decomposition problem is recast as a network flow problem, in which each residue is represented as a node of a network and each residue-residue contact is represented as an edge with a particular capacity, depending on the type of the contact. • A two-domain decomposition problem is solved by finding a cut of the network, which minimizes the total cross-edge capacity (minimum cut). • To deal with networks with non-unique minimum cuts, the algorithm finds all cuts, which achieve the minimum cross-edge capacity. • A recent analysis of four automatic methods put DomainParser (marginally) at the top (Holland et al, JMB, 2006) --- In fact, 3/4 were nearly equal depending on the evaluation criterion.

Domain Partition as a Network Flow Problem interface bottleneck Domain partition Network flow Basic idea: identify the bottleneck Xu et al, Bioinformatics 2000 Guo et al, NAR 2003 Note: there is now a DomainParser 2

DomainParser Domain identification is recast as a network flow problem. Meaning, the method attempts to divide the network into two interconnected parts in such a way that the edge capacity across the division in minimized. (Note, each edge can carry different weights, or capacities.) Intuitively, this translates into finding the bottleneck within the network. The algorithm works by systematically removing nodes until domain separation is maximized. There is a second (post-processing) step that checks the validity of the domain boundaries using commonsense metrics like compactness, radius of gyration, number of non-contiguous segments per domain, and distribution of domain sizes. Because the method is based on topology, it is very fast. And, it scales very well as well O(nm2), where n = # of nodes and m = # of nodes.

Maximum Flow/Minimum Cut (bottleneck) source sink edge capacity node Algorithm to solve this problem: Ford-Fulkerson Method We need to construct a graph first……

Model Building for Domain Partition Find the bottleneck Residue (C) Packing Extreme points • Issues: • Compactness • Minimum domain size • Integrity of secondary structures • When to stop Node Capacity Source/sink

Capacity and Extreme Points • Capacity between Residues A/B: (based on Holm & Sander 1994) • If atom distance <= 4.0 A, ++1; • If backbone contact, ++5; • If across a -sheet, ++12; • If within a -strand, ++1000. Preserve -sheet structure • Use multiple extreme points Two farthest residues perpendicular to the axis Source Sink (sampling)

Domains have very simple and/or extended structure (DomainParser 1 domain) 1zmec 6prch 1aaya Assignments by DomainParser vs. SCOP * Violate compact globular requirement • DomainParser preserves -sheet (DomainParser 1 domain) undercut

Assignments by DomainParser vs. SCOP • Structurally correct decomposition by DomainParser (DomainParser: 2 domains) overcut 2adma 2liv SCOP treats them as single domain proteins, functional consideration or ?

Domain Assignments by DomainParser Experts: CATH, SCOP, AUTHORS Holland, et al, JMB, 2006

DomainParsertneds to undercut large mutlti-domain proteins Holland, et al, JMB, 2006

Summary of Performance Comparison Holland, et al, JMB, 2006

But PDP (Protein Domain Parser) is the winner Holland, et al, JMB, 2006

But PDP (Protein Domain Parser) is the winner • PDP is a recursive top-down algorithm that makes either: (1.) a single cut producing two contiguous domains or (2.) a double cut, where the cuts are at least 35 residues apart and within 8 Å of each other. • The best cut is selected using criteria of minimum contacts between resulting domains, normalized by the size of the domains. • The algorithm continues to recursively partition each of the resulting domains until a stopping condition is met. • During the post-processing step, the number of contacts between resulting domains is evaluated and domains with a high level of contacts are merged together. Very small domains (below 35 residues) are discarded.

Summary of Performance Comparison • Based on the criterion of correct number of assigned domains, PDP appears to be the most accurate method (85% correct) followed by NCBI (83%), DomainParser (77%), and PUU (74%). • DomainParser is the most accurate on structures with few domains. However, it tends to under-cut many structures (4.5% over-cut, 18.5% under-cut). • NCBI, on the other hand, shows a balance between over-cut and under-cut types of errors (9.9% over-cut, 7.6% under-cut). • The performance of PDP is consistently superior to other methods; it is particularly impressive on chains with larger number of domains: the method assigns correctly four out of five, five-domain chains and is the only method to correctly assign a six-domain chain. In general the performance of NCBI is very similar overall as well as in its profile character to that of PDP; its assignment of four-domain chains is superior to that of PDP, but NCBI fails to assign correctly most of five-domain chains and both of the six-domain chains.

Some insights from looking at automatic domain assignments: Maximizing ratio of intra- /inter-domain contacts is a chief principle in algorithmic assignments and work well for ‘standard’ cases. As more complex structures are solved, more cases of ‘unusual’ architecture are uncovered. These tend to defy our basic rules. It is possible to include more parameters and tune them better to avoid some obvious cases of overcuts: penalize splitting secondary structure elements(some cutting of secondary structures is essential to obtain ‘correct’ domain, but this feature should be carefully balanced) penalize domains consisting from too many short fragments(excessive fragmentation may result in very compact, but biologically unfeasible domains) improve the ability to recognize ‘classical’ folds(this will improve recognition of very small and very large domains for which contact density may be misleading) Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx

Best practices: use a consensus approach http://pdomains.sdsc.edu

Best practices: use a consensus approach

Automatic Domain Identification