Bottom-Up Proteomics Data collection

Bottom-Up ProteomicsData collection Ruedi Aebersold, Ph.D Institute of Molecular Systems Biology, ETH-Zürich; Faculty of Science, University of Zürich

The proteome: The ensemble of all biochemical reactions

Steps of Bottom-up proteomics protein identifications protein sample Database Protein level A B C D A B C Peptide grouping/ validation enzymatic digestion Peptide level Quantitation Validation Database search peptide mixture peptide identifications LC/MS/MS MS/MS spectrum level MS/MS spectra Protein Inference assumptions?? -- many possible, none is right

The proteome as seen by a mass spectrometer: Possibly:10exp6- 10exp8 features m/z 1200 1100 1000 900 800 700 600 500 400 min 0 10 20 30 40 50 60 70 80 90 100 110

Slicing and dicing the proteome

The global (quantitative) analysis of the proteins expressed in a cell at a time Enumerate all the components of a proteome Detect dynamic changes in proteome following external or internal perturbations - Proteome as Database -Analytic chemistry slant - Proteomics as Biol. or clin. Assay - Biology slant Proteome analyzed once Proteome analyzed multiple (infinite) times Proteomics: Haynes P, Gygi S, Figeys D, and Aebersold R. (1998) Proteome analysis: biological assay or data archive? Electrophoresis 19:1862-1871

The global (quantitative) analysis of the proteins expressed in a cell at a time Enumerate all the components of a proteome Detect dynamic changes in proteome following external or internal perturbations Proteome as database: Proteomics as Biol. or clin. assay: Proteome analyzed once Proteome analyzed multiple (infinite) times Proteomics:

Human PeptideAtlas 2013-2015 14,274 13,230 +34 +3 +4 +110 +377 +516 2013 2014 2015 2013 2014 2015 2013 2013

Human PeptideAtlas 2013-2015 14,274 13,230 +34 +3 +4 +110 +377 +516 True new identifications or statistical noise? 2013 2014 2015 2013 2014 2015 2013 2013

Open questions re: Proteome catalogue… • When have we reached an endpoint in proteome cataloguing? • Why do we reach apparent saturation before hitting all predicted ORF’s? • What are relevant endpoints? (one representative per ORF?, all proteoforms? Other?) • How do we quantify proteins? • How do errors propagate in large datasets and how do we control FDR at peptide and protein level? • How do we best complete the catalogue? • What (biology) can we learn from the (complete) catalogue?

The global (quantitative) analysis of the proteins expressed in a cell at a time Enumerate all the components of a proteome Detect dynamic changes in proteome following external or internal perturbations Proteome as database: Proteomics as Biol. or clin. assay: Proteome analyzed once Proteome analyzed multiple (infinite) times Proteomics:

Data and the scientific method Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Data matrix supporting analyses via association • Accurately and reproducibly quantifyproteotypes across samples and conditions • Conditions: • Clinical cohorts • Time courses • Dosage courses • Samples with structured genomes Conditions 1-n Proteins 1--n

Open questions re: Association studies • How many proteins are enough? • Which ones? • How precisely do proteins need to be quantified? • Which peptides are best suited to quantify a protein? • Should proteins be considered as independent actors (like transcripts) or as parts of modules? • What factors affect protein modules and how? • How do errors propagate in large datasets and how do we control FDR at peptide and protein level? • Are data reliable, robust and accessible enough? • Data integration, dissemination of methods.

Bottom-Up Proteomics Data collection

Bottom-Up Proteomics Data collection

Presentation Transcript

Bottom-up parsing

Proteomics Data analysis

Bottom-up Parsing

Bottom Up Production

Bottom-up Control

BOTTOM-UP PARSING

Bottom up Parsing

Bottom-up Technology

Bottom-up Technology

Bottom-up Data Analysis

Bottom-up Parsing

Bottom-Up Parsing

Bottom up parsing

Bottom-Up Parser

Bottom Up Parsing

Bottom-Up Concurrency

Bottom-up Parsing

Bottom-Up Parsing

Why Bottom-Up ?

Parsing Bottom Up

Bottom Up ROFO

Bottom-up parsing