1 / 42

Protein and gene model inference based on statistical modeling in k -partite graphs

Protein and gene model inference based on statistical modeling in k -partite graphs. Sarah Gester , Ermir Qeli , Christian H. Ahrens, and Peter Buhlmann. Problem Description. Given peptides and scores/probabilities, infer the set of proteins present in the sample. PERFGKLMQK. Protein A.

libby
Download Presentation

Protein and gene model inference based on statistical modeling in k -partite graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein and gene model inference based on statistical modeling in k-partite graphs Sarah Gester, ErmirQeli, Christian H. Ahrens, and Peter Buhlmann

  2. Problem Description • Given peptides and scores/probabilities, infer the set of proteins present in the sample. PERFGKLMQK Protein A MLLTDFSSAWCR Protein B TGYIPPPLJMGKR Protein C FFRDESQINNR

  3. Previous Approaches • N-peptides rule • ProteinProphet (Nesvizhskii et al. 2003. Anal Chem) • Assumes peptide scores are correct. • Nested mixture model (Li et al. 2010. Ann Appl Statist) • Rescores peptides while doing the protein inference • Does not allow shared peptides • Peptide scores are independent • Hierarchical statistical model (Shen et al. 2008. Bioinformatics) • Allows for shared peptides • Assume PSM scores for the same peptide are independent • Impractical on normal datasets • MSBayesPro (Li et al. 2009. J ComputBiol) • Uses peptide detectabilities to determine peptide priors.

  4. Markovian Inference of Proteins and Gene Models (MIPGEM) • Inclusion of shared/degenerate peptides in the model. • Treats peptide scores/probabilities as random values • Model allows dependence of peptide scores. • Inference of gene models

  5. Why scores as random values? PERFGKLMQK Protein A MLLTDFSSAWCR Protein B TGYIPPPLJMGKR Protein C FFRDESQINNR

  6. Building the bipartite graph

  7. Shared peptides

  8. Definitions • Let pi be the score/probabilitiy of peptide i. I is the set of all peptides. • Let Zj be the indicator variable for protein j. J is the set of all proteins.

  9. Simple Probability Rules

  10. Bayes Rule Prior probability on the protein being present Probability of observing these peptide scores given that the protein is present Joint probability of seeing these peptide scores

  11. Assumptions • Prior probabilities of proteins are independent • Dependencies can be included with a little more effort. • This does not mean that proteins are independent.

  12. Assumptions • Connected components are independent

  13. Assumptions • Peptide scores are independent given their neighboring proteins. • Ne(i) is the set of proteins connected to peptide i in the graph. • Iris the set of peptides belonging to the rth connected component • R(Ir) is the set of proteins connected to peptides in Ir

  14. Assumptions • Conditional peptide probabilities are modeled by a mixture model. • The specific mixture model they use is based on the peptide scores used (from PeptideProphet).

  15. Bayes Rule Prior probability on the protein being present Probability of observing these peptide scores given that the protein is present Joint probability of seeing these peptide scores

  16. Joint peptide score distribution • Assumption: peptides in different components are independent • Ir is the set of peptides in component r • R(Ir) is the set of proteins connected to peptides in Ir

  17. Conditional Probability • Mixture model

  18. Conditional Probability • Mixture model

  19. f1(x) – pdf of P(pi|{zj}) median

  20. Choosing b1 and b2 • Seek to maximize the log likelihood of observing the peptide scores.

  21. Choosing b1 and b2 • It turns out:

  22. Conditional Protein Probabilities

  23. Conditional Protein Probabilities

  24. Conditional Protein Probabilities(NEC Correction)

  25. Conditional Protein Probabilities

  26. Conditional Protein Probabilities

  27. Conditional Protein Probabilities

  28. Shared Peptides

  29. Shared Peptides

  30. Shared Peptides • If the shared peptide has pi ≥ median

  31. Shared Peptides • If the shared peptide has pi < median

  32. Gene Model Inference

  33. Gene Model Inference • Assume a gene model, X, has only protein sequences which belong to the same connected component. Peptide 1 Protein A Peptide 2 Gene X Peptide 3 Protein B Peptide 4

  34. Gene Model Inference • Assume a gene model, X, has only protein sequences which belong to the same connected component. • R(X) is the set of proteins with edges to X. • Ir(X) is the set of peptides with edges to proteins with edges to X

  35. Gene Model Inference • Gene model, X, has proteins from different connected components of the peptide-protein graph. Peptide 1 Protein A Peptide 2 Gene X Peptide 3 Protein B Peptide 4

  36. Gene Model Inference • Gene model, X, has proteins from different connected components of the peptide-protein graph. • Rl(X) is the set of proteins with edges to X in component l. • Il(X) is the set of peptides with edges to proteins with edges to X in component l.

  37. Datasets • Mixture of 18 purified proteins • Mixture of 49 proteins (Sigma49) • Drosophila melanogaster • Saccharomycescerevisiae (~4200 proteins) • Arabidopis thaliana (~4580 gene models)

  38. Comparisons with other tools • Small datasets with a known answer Mix of 18 proteins Sigma49

  39. Comparisons with other tools • One hit wonders Sigma49 no one hit wonders Sigma49

  40. Comparison with other tools • Arabidopsis thaliana dataset has many proteins with high sequence similarity.

  41. Splice isoforms

  42. Conclusion +Criticism • Developed a model for protein and gene model inference. • Comparisons with other tools do not justify complexity: • Value of a small FP rate at the expense of many FN is not shared for all applications. • Discard some useful information such as #spectra/peptide • Assumptions of parsimony from pruning may be too aggressive.

More Related