240 likes | 259 Views
Partitioning Sequences Based on Association Measures. Deborah Weisser Carnegie Mellon University. Feature boundaries. Need to know form and function of protein sequences to understand complex biological systems Not possible to directly determine features or functions directly
E N D
Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University
Feature boundaries • Need to know form and function of protein sequences to understand complex biological systems • Not possible to directly determine features or functions directly • estimate feature positions by indirect laboratory experiments, e.g. hydrophobicity • Use statistical measures of association to determine feature boundaries
Feature boundaries • Proteins are comprised of adjacent, non-overlapping features: • helical, cytoplasmic, periplasmic, extracellular, intracellular, etc. • GPCR proteins have a fixed feature pattern, although feature positions are only known for one member of the family, Rhodopsin (opsd_human)
Goal: Statistically determine feature boundaries in sequences of amino acids S H D E G C L S S E P K P R K Q S D S S T
Association measures S H D E G C L S S E P K P R K Q S D S S T 2.5 2.5 is a measure of the strength of the association between P and R
Association measures S H D E G C L S S E P K P R K Q S D S S T 1.1 4.8 0.3 1.2 4.5 6.2 1.2 3.7 0.7 3.4 5.2 0.8 1.1 5.5 2.3 4.1 0.2 2.5 1.8 1.1 6.2
Association measures S H D E G C L S S E P K P R K Q S D S S T 1.1 4.8 0.3 1.2 4.5 6.2 1.2 3.7 0.7 3.4 5.2 0.8 1.1 5.5 2.3 4.1 0.2 2.5 1.8 1.1 6.2 4.2 Adjacent pairs with low association measures are candidates for partition points.
Association measures are used to quantify correlations between adjacent amino acids • Yule’s Q statistic • Mutual information
MI breaks Hydropathy breaks 233 136 63 255 76 155 301 230 Cytoplasmic (cp) Domain Cytoplasmic (cp) Domain 309 153 133 61 253 74 Transmembrane (helices) Domain Transmembrane (helices) Domain Extracellular (ec) Domain Extracellular (ec) Domain Cytoplasmic (cp) Domain S Q T A V E P T A - K OOC - S V T T S A E cp2 D cp3 D N G S E F R M cp1 Q Q L F S T C A T Q P F G T T K Q Q P M A K K L K R E A A L K R N N V H E A C K K N E T N V E K G M C C V Q H V V V V T A P T R T M F I M M V V Y I L Y V L L N Q I Y T Y 136 155 I I L 225 F 256 306 N 75 55 I II III IV V VI VII MI: 39, 63, 76, 94, 115, 136, 155, 176, 205, 233, 255, 279, 287, 301 Hydropathy: 37, 61, 74, 98, 114, 133, 153, 176, 203, 230,253, 276, 285, 309
The changes in association measure values correspond to feature boundaries • Goal: automatically detect partition points based on association measures
Partitioning algorithm • Cluster adjacent association values • each group is represented by its mean value • Calculate standard deviation of values over all clusters • Locate partition points in data based on: • deviation from mean • [change between adjacent values]
Parameters • Cluster adjacent association values • each group is represented by its mean value window size for computing mean • Calculate standard deviation of values over all clusters • Locate partition points in data based on: • deviation from mean • [change between adjacent values] cutoff distance from mean for a value to be considered “extreme”
Effect of cutoff threshold on partitioning in opsd_human using mutual information
Effect of window size on partitioning in opsd_human using mutual information
GPCR: different subfamilies • Class A Rhodopsin like • Amine • Peptide • Hormone protein • (Rhodopsin • Rhodopsin Vertebrate • Rhodopsin Vertebrate type 1 • Rhodopsin Vertebrate type 2 • Rhodopsin Vertebrate type 3 • Rhodopsin Vertebrate type 4 • Rhodopsin Vertebrate type 5 • Rhodopsin Arthropod • Rhodopsin Mollusc • Rhodopsin Other • Olfactory • Prostanoid • Nucleotide-like • Cannabis • Platelet activating factor • Gonadotropin-releasing hormone • Thyrotropin-releasing hormone & Secretagogue • Melatonin • Viral • Lysosphingolipid & LPA (EDG) • Leukotriene B4 receptor • Class A Orphan/other • Class B Secretin like • Class C Metabotropic glutamate / pheromone • Class D Fungal pheromone • Class E cAMP receptors (Dictyostelium) • Frizzled/Smoothened family
GPCR: different subfamilies Size: Hierarchy: 717755 GPCR 371134 Class A 48393 Rhodopsin 33543 Vertebrate 20314 Vertebrate 1 348 opsd_human 39724 Class B 20930 Class C
Structure of curve is preserved even when the dataset is small.
In progress / Future work • Set parameters of partition algorithm automatically • Apply to other sources of data, types of features • Group amino acids into sub-classes • Quantify the effect of training set information content and training set size.