Yu Shyr ( 石瑜 ), Ph.D. May 14, 2008 China Medical University Yu.Shyr@vanderbilt

Yu Shyr (石瑜), Ph.D. May 14, 2008 China Medical University Yu.Shyr@vanderbilt.edu The Biostatistical & Bioinformatics Challenges in the High Dimensional Data Derived from High Throughput Assays: Today and Tomorrow

Vanderbilt University 泛德堡大學

US News & World Report (American’s best colleges -2007)

Tennessee, the “Volunteer State”

Nashville, TN- “Music City, USA!”

Vanderbilt University

Vanderbilt University • A private, nonsectarian, coeducational research university in Nashville, TN. • Established in 1873 by shipping and rail magnate Cornelius Vanderbilt. • Enrolls 11,000 students in ten schools annually. • Ranks 18th in the nation among national research universities. • Also has several research facilities and a world-renowned medical center. • Famous alumni include former vice-president Al Gore.

Vanderbilt University Medical Center

VUMC • Collection of several hospitals and clinics associated with Vanderbilt University in Nashville, Tennessee. • In 2003, was placed on the Honor Roll of nation’s best hospitals. • The medical school was ranked 17th in the nation among research-oriented medical schools and in the ISI top 5 for research impact in clinical medicine and pharmacology.

Vanderbilt-Ingram Cancer Center

Vanderbilt-Ingram Cancer Center • Only NCI-designated Comprehensive Cancer Center in Tennessee and one of only 39 in the United States • Nearly 300 investigators in seven research programs • More than $190 million in annual research funding • Among the top 10 in competitively awarded NCI grant support

Vanderbilt-Ingram Cancer Center • Ranks 20th in the nation and consistently ranks among the best places for cancer care by U.S. News and World Report. • One of a select few centers to hold agreements with the NCI to conduct Phase I and Phase II clinical trials, where innovative therapies are first evaluated in patient.

Department of Biostatistics • Created by the School of Medicine at Vanderbilt University in September 2003. • The Dean and other senior medical school faculty are committed to providing outstanding collaborative support in biostatistics to clinical and basic scientists and to develop a graduate program in biostatistics that will train outstanding collaborative scientists and will focus on the methods of modern applied statistics.

High Dimensional Data • The major challenge in high throughput experiments, e.g., microarray data, MALDI-TOF data, SELDI-TOF data, or shotgun proteomic data is that the data is oftenhigh dimensional. • When the number of dimensions reaches thousands or more, thecomputational timefor the pattern recognition algorithms can becomeunreasonable. This can be a problem, especially when some of the features arenotdiscriminatory.

High Dimensional Data • The irrelevant features may cause a reduction in the accuracy of some algorithms. For example (Witten 1999), experiments with a decision tree classifier have shown that adding a random binary feature to standard datasets can deteriorate the classification performance by5 - 10%. • Furthermore, in many pattern recognition tasks, the number of features represents the dimension of a search space - thelargerthe number of features, thegreaterthe dimension of the search space, and theharderthe problem.

Outcome Measurement: MALDI-TOF

Reflex MALDI TOF Mass Spectrometer Laser Optics Nitrogen Laser (337 nm) TOF Analyzer Microchannel Detector MALDI Target  Ion Mirror Ion Grid

Time-of-Flight Mass Spectrometry (TOF-MS) Linear TOF : Ionizing Probe (start) Ion detector (MCP) M3 M2 M1 Ion signals +/- U M2 M1 M3 t 2 t3 t 1 Start

Issues in the Analysis of High-Throughput Experiment • Experiment Design • Measurement • Preprocessing ♦Baseline Correction, Normalization ♦ Profile Alignment, Feature selection, Denosing • QCA (Quality Control Assessment) • Feature Selection • Classification

Issues in the Analysis of High-Throughput Experiment • Computational Validation ♦ Estimate the classification error rate ♦ bootstrapping, k-fold validation, leave-one-out validation • Significance Testing of the Achieved Classification Error • Validation – blind test cohort • Validation – laboratory technology, e.g. RTPCR, Pathway analysis • Reporting the result - graphic & table

Preprocessing • Mass Spectrometry (MS) can generate high throughput protein profiles for biomedical applications. A consistent, sensitive and robust MS data preprocessing method would be greatly desirable because subsequent analyses are determined by the preprocessing output. • The preprocessing goal is to extract and quantify the common features across the spectra. • We propose a new comprehensive MALDI-TOF MS data preprocessing method using feedback concepts associated with several new algorithms. • This new package successfully resolves many conventional difficulties such as removing m/z measure error, objectively setting de-nosing parameters, and define common features across spectra.

Basic Descriptions of the Data Preprocessing Registration Denoising Baseline correction Normalization Peak selection Peak alignment or Binning Math Model for MS Data Preprocessing • From a mathematical point of view, one MS data is a signal function defined on a time or m/z domain. An observed MS signal is often modeled as the superposition of three components: where f(x) is observed signal, B(x) is a slowly varying “baseline” artifact, S(x) is the “true” signal (peaks) to be extracted, N is the normalization factor, and e(x) represents noise.

Math Model for MS Data Preprocessing • The preprocessing goal is to identify, quantify and match peaks across spectra. • Several modern algorithms such as wavelets, splines, nonparametric local maximum likelihood estimate(NLMLE) are successfully applied to the whole processing system. • The feedbacks optimized the calibration and peak picking procedures automatically.

Raw data

General steps (1) Calibration: Calibration based on multiple identified peaks (linear shifts on the time domain) and the shape of peak (convolution); in the meanwhile all spectra get aligned. (2) Quantification: Baseline Correction (splines) =>Normalization (TIC) =>area based peak quantification method. (3) Feature Extraction: Denoising (wavelets) => Peak Selection (local maximum) => common peak finding across spectra(NLMLE) (4) Feedback: optimally choosing calibration peaks and setting feature extraction parameters.

Common Feature detection De-noising Peak Detection Peak Distribution Calibration Alignment Baseline Correction Normalization Results Flowchart of the Preprocessing Procedure Raw data

Convolution Based Calibration Algorithm 1. Known peaks’ simulation (choose peaks with high prevalence across spectra and clear pattern by feedback 80% ). 2. Convolve each spectra with the known peak simulation (Gaussian, or Beta). Maximum happens when two peak shapes match best. 3. The linear shift units makes multiple peaks matched best is the optimal shift. Notice: all process are on the time domain.

Pre- Calibration

Post Calibration 1.Accurate m/z peak position (as theoretical)2. Less variation of the peaks position 3. Easily to handle large dataset in batch mode.

Pre- Calibration

Post Calibration

Baseline Correction & Normalization • Baseline is generally considered as an artificial bias of the signal. • We propose baseline might be caused by delayed charge releasing. • We apply quadratic splines to the local minimums to get the continuous curve by sliding windows. • Trimmed total ion current (TIC) normalization.

Baseline Data Before Correction

Baseline Corrected Data

Wavelets Denoising • Wavelet: FBI's image coding standard for digitized fingerprints, successful to reproduce true signal by removing noises of specific energy levels. • Wavelets method has been used to denoise signals in a wide variety of contexts. • Wavelet method analyzes the data in both time and frequency domain to extract more useful information. • Adaptive stationary discrete wavelet denoising method is applied in our research, which is shift-invariant and efficient in denoising.

Denoising strategy • Stationary discrete wavelet denoising method is shift-invariant and offers both good reconstruction performance and smoothness. • Adaptive denoising method is based on the noise distribution, we set up different threshold values at different mass intervals and frequency levels. • Parameters (decomposition and thresholds are determined by the feedback information)

DWT Decomposition

Denoised Data

Peak list across spectra

Kernel Density Estimation

Peak distribution without high-quality preprocessing

Peak distribution with high-quality preprocessing

Peak Selection

Preprocessing on one spectrum after calibration • Read in spectrum by two columns: m/z values and corresponding intensities. • Apply Adaptive Stationary Discrete Wavelet Transform for denoising. • Sliding widow splines estimate the baseline, and subtract the baseline. Total Ion Current Normalization through the whole spectrum. • Local maximums contribute to peak list across spectra.

Expression Profiles day1 day2 day3 day4

The Results from the Cluster Analysis

Yu Shyr ( 石瑜 ), Ph.D. May 14, 2008 China Medical University Yu.Shyr@vanderbilt