Partitioning Sequences Based on Association Measures

Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University

Feature boundaries • Need to know form and function of protein sequences to understand complex biological systems • Not possible to directly determine features or functions directly • estimate feature positions by indirect laboratory experiments, e.g. hydrophobicity • Use statistical measures of association to determine feature boundaries

Feature boundaries • Proteins are comprised of adjacent, non-overlapping features: • helical, cytoplasmic, periplasmic, extracellular, intracellular, etc. • GPCR proteins have a fixed feature pattern, although feature positions are only known for one member of the family, Rhodopsin (opsd_human)

Goal: Statistically determine feature boundaries in sequences of amino acids S H D E G C L S S E P K P R K Q S D S S T

Association measures S H D E G C L S S E P K P R K Q S D S S T 2.5 2.5 is a measure of the strength of the association between P and R

Association measures S H D E G C L S S E P K P R K Q S D S S T 1.1 4.8 0.3 1.2 4.5 6.2 1.2 3.7 0.7 3.4 5.2 0.8 1.1 5.5 2.3 4.1 0.2 2.5 1.8 1.1 6.2

Association measures S H D E G C L S S E P K P R K Q S D S S T 1.1 4.8 0.3 1.2 4.5 6.2 1.2 3.7 0.7 3.4 5.2 0.8 1.1 5.5 2.3 4.1 0.2 2.5 1.8 1.1 6.2 4.2 Adjacent pairs with low association measures are candidates for partition points.

Association measures are used to quantify correlations between adjacent amino acids • Yule’s Q statistic • Mutual information

MI breaks Hydropathy breaks 233 136 63 255 76 155 301 230 Cytoplasmic (cp) Domain Cytoplasmic (cp) Domain 309 153 133 61 253 74 Transmembrane (helices) Domain Transmembrane (helices) Domain Extracellular (ec) Domain Extracellular (ec) Domain Cytoplasmic (cp) Domain S Q T A V E P T A - K OOC - S V T T S A E cp2 D cp3 D N G S E F R M cp1 Q Q L F S T C A T Q P F G T T K Q Q P M A K K L K R E A A L K R N N V H E A C K K N E T N V E K G M C C V Q H V V V V T A P T R T M F I M M V V Y I L Y V L L N Q I Y T Y 136 155 I I L 225 F 256 306 N 75 55 I II III IV V VI VII MI: 39, 63, 76, 94, 115, 136, 155, 176, 205, 233, 255, 279, 287, 301 Hydropathy: 37, 61, 74, 98, 114, 133, 153, 176, 203, 230,253, 276, 285, 309

The changes in association measure values correspond to feature boundaries • Goal: automatically detect partition points based on association measures

Partitioning algorithm • Cluster adjacent association values • each group is represented by its mean value • Calculate standard deviation of values over all clusters • Locate partition points in data based on: • deviation from mean • [change between adjacent values]

Parameters • Cluster adjacent association values • each group is represented by its mean value window size for computing mean • Calculate standard deviation of values over all clusters • Locate partition points in data based on: • deviation from mean • [change between adjacent values] cutoff distance from mean for a value to be considered “extreme”

Effect of cutoff threshold on partitioning in opsd_human using mutual information

Effect of window size on partitioning in opsd_human using mutual information

GPCR: different subfamilies • Class A Rhodopsin like • Amine • Peptide • Hormone protein • (Rhodopsin • Rhodopsin Vertebrate • Rhodopsin Vertebrate type 1 • Rhodopsin Vertebrate type 2 • Rhodopsin Vertebrate type 3 • Rhodopsin Vertebrate type 4 • Rhodopsin Vertebrate type 5 • Rhodopsin Arthropod • Rhodopsin Mollusc • Rhodopsin Other • Olfactory • Prostanoid • Nucleotide-like • Cannabis • Platelet activating factor • Gonadotropin-releasing hormone • Thyrotropin-releasing hormone & Secretagogue • Melatonin • Viral • Lysosphingolipid & LPA (EDG) • Leukotriene B4 receptor • Class A Orphan/other • Class B Secretin like • Class C Metabotropic glutamate / pheromone • Class D Fungal pheromone • Class E cAMP receptors (Dictyostelium) • Frizzled/Smoothened family

GPCR: different subfamilies Size: Hierarchy: 717755 GPCR 371134 Class A 48393 Rhodopsin 33543 Vertebrate 20314 Vertebrate 1 348 opsd_human 39724 Class B 20930 Class C

Structure of curve is preserved even when the dataset is small.

In progress / Future work • Set parameters of partition algorithm automatically • Apply to other sources of data, types of features • Group amino acids into sub-classes • Quantify the effect of training set information content and training set size.

Partitioning Sequences Based on Association Measures

Partitioning Sequences Based on Association Measures

Presentation Transcript

MEASURES OF ASSOCIATION

Measures of Association

Measures of Association

Measures of Association

Measures of Association

Association Measures

Measures of Association

Scalability-Based Manycore Partitioning

Measures of Association

Classification based on Association Rules

Selectivity-Based Partitioning

Measures of association

Measures of Association

Measures of Association

Similarity Measures for Rhythmic Sequences

On the Complexity Measures of Genetic Sequences

Measures of Association

More on Partitioning

Measures of association

Similarity Measures for Rhythmic Sequences

Array-Based Sequences

Measures of association