METHOD: Family Classification Scheme

Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) • Set for a model estimation: 73 microbial genomes (Table 2) METHOD:Family Classification Scheme • PSI-BLAST search against all sequences in Genbank. • Domain parsing. • Domain clustering. • The model was tuned on 50,000 randomly chosen full-length SwissProt protein sequences containing domains present in a 721 family subset of PfamA. • E-value (e-4) and number of rounds (6) for (1) and rules for (3) have been chosen to maximize the agreement between the generated and PfamA families.

METHOD:Domain boundaries are identified based on the location of indels in the PSI-BLAST alignment The result is that 96% of single domain PfamA proteins are predicted as such, but only 24% of PfamA two-domain proteins are predicted correctly.

METHOD:Evaluation of domain family construction on SCOP40 1224 superfamilies (5226 domains, 760 folds) • SCOP domains were clustered into families using the authors’ procedure (PSI-BLAST against nr and clustering). • For comparison, the same SCOP domains were clustered using BLAST, PSI-BLAST, SAM-T99 (HMM) and PRC (profile-profile) methods. • 2) Generated families were compared with the SCOP superfamilies: any pair of domains in a generated family presented in the same SCOP superfamily is considered as a true positive, presented in different SCOP fold – as a false positive.

METHOD: Evaluation of domain family construction on SCOP40 1224 superfamilies (5226 domains, 760 folds) The threshold of 1% ratio of false positives versus true positives Authors’ method gives 32% of TPs HMM and PRC detect 28% of TPs PSI-BLAST detects 18% of true positives BLAST detects 9% of true positives

RESULT 1: Distribution of domain family size for 67 genomes (178,310 proteins, 249,574 domains, 31,874 families) 87.7% coverage would be provided by 6072 structures

RESULT 2: Estimation of the number of families in 1000 microbial genomes 1) Examination of a number of families as a function of the number of genomes through 100 times repeated procedure of randomly chosen genomes among 67. 67 genomes

RESULT 2:Estimationof the number of families in 1000 microbial genomes 2) Extrapolation of the data on 1000 genomes. ~1,000,000 families for 10,000 genomes The model predicts ~250,000 families for 1000 genomes The model predicts ~140,000 singletons for 1000 genomes

RESULT 2: Estimation of the number of families in 1000 microbial genomes 3) Testing of the extrapolation model on the complete set of 140 genomes.

RESULT 3:Structural coverage of protein families for 67 genomes • For each non-membrane family with more than 2 members (4907 families), a sequence profile (PSSM) was obtained using blastgp. • PDB sequences was run against PSSMs using RPS-BLAST. Profile with E-value  e-2 was considered to represent a family. Such family was considered to be structurally covered. Overall, 20% of all families have representative structures. 3926 structures would be required to complete the coverage.

RESULT 3:Structural coverage of protein families for 67 genomes Suggested strategy of obtaining representative structures for the largest families first: ~4000 structures are required to obtain complete coverage of all families with three or more members, covering 88% of the domains in these genomes.

RESULT 4:Estimation of structural coverage for 1000 microbial genomes 1) Examination of a number of structures needed to obtain coverage of various fractions of proteins (assuming structures for the largest families are obtained first) through 100 times repeated simulation of randomly chosen genomes among 67.

RESULT 4:Estimation of structural coverage for 1000 microbial genomes 2) Extrapolation of the data on 1000 genomes. . A total of 250,000 structures would be required to obtain 100% coverage of protein domains families from 1000 genomes

RESULT 4:Estimation of structural coverage for 1000 microbial genomes 2) Extrapolation of the data on 1000 genomes. Representative structures for about 8,000 families will provide 70% coverage

RESULT 4:Estimation of structural coverage for 1000 microbial genomes 3) Testing of the extrapolation model on the complete set of 140 genomes. ?

DISCUSSION • Domain boundaries detection method. It doesn’t split at many domain boundaries that are obvious at the structure level. The result is that 96% of single domain PfamA proteins are predicted as such, but only 24% of PfamA two-domain proteins are predicted correctly. • Clusterization procedure was tuned on a set of PfamA domains. PfamA omits a large number of evolutionary relationships (placing related proteins in different families). PfamA also focuses on larger families, whereas the genome data are dominated by smaller families. As a result, a better method for PfamA may not necessarily be optimum on genome data. • The final procedure of family generation was benchmarked on a set of SCOP superfamilies. SCOP may be unrepresentative of proteins in genomes as a whole (not including disorder proteins and that form part of complexes).

METHOD: Family Classification Scheme

METHOD: Family Classification Scheme

Presentation Transcript

Types of Algorithms

The Family Attachment Scheme

A Heavy-Duty Vehicle Visual Classification Scheme: Heavy-Duty Vehicle Reclassification Method for Mobile Source Emissio

Chapter 6. Classification and Prediction

TAXONOMY

Scientific Classification

Insect Classification

Systematics and the Phylogenetic Revolution

Linnaeus And Biological Classification

Geosource

The Scientific Method The Atomic Theory Classification of Matter

Morphological Classification

Biology

Gene family classification using a semi-supervised learning method

DiVo : A Novel Distance based Voting Method for One Class Classification

Concept Evolution (Subject Ontogeny) in the NLM Classification Scheme: a Case Study

Scheme in Python

National Family Benefit Scheme

Classification of organisms

Chapter 6. Classification and Prediction

A Sorting Classification of Parallel Rendering