1 / 24

Partitioning Sequences Based on Association Measures

Partitioning Sequences Based on Association Measures. Deborah Weisser Carnegie Mellon University. Feature boundaries. Need to know form and function of protein sequences to understand complex biological systems Not possible to directly determine features or functions directly

kimcarlos
Download Presentation

Partitioning Sequences Based on Association Measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University

  2. Feature boundaries • Need to know form and function of protein sequences to understand complex biological systems • Not possible to directly determine features or functions directly • estimate feature positions by indirect laboratory experiments, e.g. hydrophobicity • Use statistical measures of association to determine feature boundaries

  3. Feature boundaries • Proteins are comprised of adjacent, non-overlapping features: • helical, cytoplasmic, periplasmic, extracellular, intracellular, etc. • GPCR proteins have a fixed feature pattern, although feature positions are only known for one member of the family, Rhodopsin (opsd_human)

  4. Goal: Statistically determine feature boundaries in sequences of amino acids S H D E G C L S S E P K P R K Q S D S S T

  5. Association measures S H D E G C L S S E P K P R K Q S D S S T 2.5 2.5 is a measure of the strength of the association between P and R

  6. Association measures S H D E G C L S S E P K P R K Q S D S S T 1.1 4.8 0.3 1.2 4.5 6.2 1.2 3.7 0.7 3.4 5.2 0.8 1.1 5.5 2.3 4.1 0.2 2.5 1.8 1.1 6.2

  7. Association measures S H D E G C L S S E P K P R K Q S D S S T 1.1 4.8 0.3 1.2 4.5 6.2 1.2 3.7 0.7 3.4 5.2 0.8 1.1 5.5 2.3 4.1 0.2 2.5 1.8 1.1 6.2 4.2 Adjacent pairs with low association measures are candidates for partition points.

  8. Association measures are used to quantify correlations between adjacent amino acids • Yule’s Q statistic • Mutual information

  9. MI breaks Hydropathy breaks 233 136 63 255 76 155 301 230 Cytoplasmic (cp) Domain Cytoplasmic (cp) Domain 309 153 133 61 253 74 Transmembrane (helices) Domain Transmembrane (helices) Domain Extracellular (ec) Domain Extracellular (ec) Domain Cytoplasmic (cp) Domain S Q T A V E P T A - K OOC - S V T T S A E cp2 D cp3 D N G S E F R M cp1 Q Q L F S T C A T Q P F G T T K Q Q P M A K K L K R E A A L K R N N V H E A C K K N E T N V E K G M C C V Q H V V V V T A P T R T M F I M M V V Y I L Y V L L N Q I Y T Y 136 155 I I L 225 F 256 306 N 75 55 I II III IV V VI VII MI: 39, 63, 76, 94, 115, 136, 155, 176, 205, 233, 255, 279, 287, 301 Hydropathy: 37, 61, 74, 98, 114, 133, 153, 176, 203, 230,253, 276, 285, 309

  10. The changes in association measure values correspond to feature boundaries • Goal: automatically detect partition points based on association measures

  11. Partitioning algorithm • Cluster adjacent association values • each group is represented by its mean value • Calculate standard deviation of values over all clusters • Locate partition points in data based on: • deviation from mean • [change between adjacent values]

  12. Parameters • Cluster adjacent association values • each group is represented by its mean value window size for computing mean • Calculate standard deviation of values over all clusters • Locate partition points in data based on: • deviation from mean • [change between adjacent values] cutoff distance from mean for a value to be considered “extreme”

  13. Effect of cutoff threshold on partitioning in opsd_human using mutual information

  14. Effect of window size on partitioning in opsd_human using mutual information

  15. GPCR: different subfamilies • Class A Rhodopsin like • Amine • Peptide • Hormone protein • (Rhodopsin • Rhodopsin Vertebrate • Rhodopsin Vertebrate type 1 • Rhodopsin Vertebrate type 2 • Rhodopsin Vertebrate type 3 • Rhodopsin Vertebrate type 4 • Rhodopsin Vertebrate type 5 • Rhodopsin Arthropod • Rhodopsin Mollusc • Rhodopsin Other • Olfactory • Prostanoid • Nucleotide-like • Cannabis • Platelet activating factor • Gonadotropin-releasing hormone • Thyrotropin-releasing hormone & Secretagogue • Melatonin • Viral • Lysosphingolipid & LPA (EDG) • Leukotriene B4 receptor • Class A Orphan/other • Class B Secretin like • Class C Metabotropic glutamate / pheromone • Class D Fungal pheromone • Class E cAMP receptors (Dictyostelium) • Frizzled/Smoothened family

  16. GPCR: different subfamilies Size: Hierarchy: 717755 GPCR 371134 Class A 48393 Rhodopsin 33543 Vertebrate 20314 Vertebrate 1 348 opsd_human 39724 Class B 20930 Class C

  17. Structure of curve is preserved even when the dataset is small.

  18. In progress / Future work • Set parameters of partition algorithm automatically • Apply to other sources of data, types of features • Group amino acids into sub-classes • Quantify the effect of training set information content and training set size.

More Related