Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics

Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for Sick Children Research Institute and University of Toronto The Sixth International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA 2006) Duke University Durham, NC, U.S.A June 8-9, 2006

Outline • Objectives • Data Set • Methods & Results 3.1 – Preprocessing (identification of biomarkers) 3.2 – Classification model • Conclusions

Objectives • Identify biomarkers for CFS/CFS-like diseases using SELDI-TOF MS technology • Evaluate performance of the identified biomarkers to distinguish patients with CFS/CFS-like from healthy people • Determine the best experimental protocol for large sample studies by choosing the best Fraction/Chip/Laser Energy combinations

Data Set • Each spectrum has ~30000 m/z for high energy and ~20000 m/z values for low energy (note: the number of m/z values in fractions f1 and f2 is larger). • Each combination (Fraction/Chip/Laser energy) includes 144 ((31+32+9)*2) spectra • f1 and f2 have not been analyzed since they have different number of spectra and m/z values from other fractions • QC samples were not analyzed here

Data Analysis Pipeline • Preprocessing • Baseline subtraction (already done) • Trimming low m/z values • Normalization • Peak finding and alignment • Quantification of aligned peaks • Merging replicate samples • Classification • Do a 10-fold cross-validation (CV) • For each step of CV • Split samples of preprocessed data into training and test sets • Perform biomarker selection on the training set using t-tests • Built prediction model on the training set • kernel-based K-nearest neighbor (KNN) classifier • Evaluate performance using test set

Trimming low m/z values • Low laser energy allows peaks in the low mass range to be well-visualized • High laser energy improves visualization of peaks in the high mass range • Many studies (e.g. Baggerly et al. 2003) indicated that there is a noisy m/z region near the lower limit where the machine can not record stably. • For the above reasons, we trimmed low m/z values using the following thresholds: • For low laser energy condition, we trimmed low m/z values less than 100 • For high laser energy condition, we trimmed low m/z values less than 2000

Global Normalization (Li 2005) • Given a spectrum with intensities Xi (i=1,..,n) for all n m/z values, normalized intensities Xinorm can be computed by Xinorm = s*Xi where s = (median of the total intensities among all spectra) (the total intensity of the current spectrum). • Multiplying raw intensities by the factor s equalizes the median (mean) of the total intensity among compared spectra

Peak finding -- Why • The height of peak intensities at certain m/z values indicates the presence and the approximate amount of corresponding proteins or peptides in the sample • However, not all peaks at a m/z value are related to a protein or even a part of a protein • We need to search for those peaks that may represent a protein or a part of a protein

Peak finding-- Algorithm(Tuszynski 2006)

L1 R1 L3 R2 L2 Peak Alignment -- Why • Assume two peaks: R1 at m/z value L1, and R2 at m/z value L2 are detected in two spectra, respectively. • It is known that the m/z value of the same peak in different spectra may have a small shift (0.1%-0.3%). • The shift must be adjusted so that peaks in the given shift interval (say, m/z *(1-0.2%,1+0.2%)) are aligned to have the same m/z value • The objective of alignment is to estimate common m/z value L3 of the peaks in the given shift interval across spectra

R1 R3 Not a maximum clique R2 R4 R7 R5 R8 R6 R9 Alignment –Algorithm------Maximal cliques & real representations (Li, 2005, Gentleman 2001 ) Aligned peak centers • Find maximal cliques: {1,2}, {3,4,5,6}, {7,8}, {8,9} • Real representations: Find common m/z region for each maximal clique and estimate the aligned peak centers using maximum likelihood estimation (MLE)

Quantify aligned peaks for individual spectra • Each aligned peak location (m/z) can be treated as an interval, m/z * (1-0.2%,1+0.2%) • The intensity of each of the aligned peak location of individual spectra can be quantified by the maxima in the interval. The intensity is quantified as the intensity of the aligned peak location (red) Black: raw m/z values with intensities Red: Aligned peak m/z value. It has no associated peak in the raw data Blue: left and right intervals of the aligned peak

Merging replicate samples • After quantification of the aligned peaks in all individual spectra, we averaged the intensities of the two replicates for each samples. • The averaged intensities were used to build our prediction model.

k=1 k=6 x Predictor--K-Nearest Neighbor (KNN) Method • To classify a new input vector (observation) v, examine the k-closest training data points to v and assign the object to the most frequently occurring class • Neighborhood is defined based on a mathematical distance measure • Deficiencies: • The individual points in a neighborhood may have very different similarities to v (distances from v), but they all have the same influence on the prediction

Predictor--Kernel-based KNN Method (Hechenbichler and Schliep 2004) • To classify a new observation v, examine the k+1 nearest neighbors to v according to Euclidean distance (d) • The (k+1)st neighbour is used for standardization of the k smallest distance by D(i)=D(v, v(i))= d(v, v(i))/ d(v, v(k+1)), i=1,…, k • Transform the normalized distance D(i) using a Gaussian kernel function K(.) into a weight w(i)=K(D(i)) • Assign a prediction label to v based on where y can be either CFS/CFS-like (r=1) or NON-CFS (r=0) disease k is implicitly hidden in the weights - if k is too large, k is adjusted to a smaller value automatically, since only small number of neighbors with large weights dominate the other neighbors (very small weight-no influence on the prediction) We set k=3 in the study

Results • We first define some concepts used in the section • Condition: Experimental protocol (Fraction/Chip/Laser energy) • Biomarkers: Here we mean they are the aligned peaks • Differentially expressed biomarkers: Aligned peaks that have p-values less than 0.05 selected by t-test.

The number of biomarkers identified in each condition, and the number significant (p<0.05) High Laser Energy 1Only conditions (total 32 conditions) with at least 2 differentially expressed peaks (p<=0.05) are listed

The number of biomarkers identified in each condition, and the number significant (p<0.05) –Cont. Low Laser Energy

Comments Using Low laser energy, there are 13 conditions (Fraction/Chip) that identified at least two differentially expressed peaks/biomarkers (p<=0.05) Using High laser energy, there are only 5 conditions (Fraction/Chip) that identified at least two differentially expressed peaks/biomarkers (p<=0.05)

Performance of the kernel-based KNN predictors using selected biomarkers in each condition 1The number of biomarkers selected in each of the 10 cross-validations Only conditions with larger than 60% accuracy have been listed

Biomarkers used in building prediction model for condition: H50, Low laser energy, and F6 *9 of 299 peaks (after alignment) have p-values less than 0.05 ** The number of times the biomarkers was picked in 10 CV • Three m/z value ranges seems to be interesting: • 499-503 • 526-528 • 7784-7785

Conclusions • Based on our analysis, the best combination (laser energy, chip and fractions) appears to be • Low laser energy/H50/Fraction 6 • We identified 9 significantly expressed biomarkers (p-value<=0.05), which are located in the 3 m/z value ranges: • 499-503 • 526-528 • 7784-7785 • Using 14 biomarkers identified from the combination, our predictor can reach ~80% accuracy.

Limitations • For all combinations of experimental protocol, we used the same m/z shift interval (m/z*(1-0.2%,1+0.2%). A better choice may be obtained by estimating it for each combination from QC samples • We did not take the multiple testing issue into account in this analysis

Acknowledgements • We used following R packages to perform the analysis in this study • caMassClass (Jarek Tuszynski) • PROcess (Xiaochun Li) • kknn (Klaus Hechenbichler and Klaus Schliep) This research was supported by funding from Ontario Genomics Institute and Genome Canada, through the Centre for Applied Genomics.

Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics