Eason Cheng Sep 18, 2014

Group Meeting Presentation Eason Cheng Sep 18, 2014

CrossNorm: a novel normalization strategy for microarray data in cancer

Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion

Background • Purpose of preprocessing/normalization Gene Chip System variation Biological variation Gene expression profile To remove system variation (noise) while keeping biological variation (information for analysis)

Background 3 steps of preprocessing (noise removal): • Background Correction • Remove local artifacts and “noise” (within arrays) • measurements are not so affected by neighboring measurements • Normalization • Remove array effects (among arrays) • measurements from different arrays are comparable • Summarization • Combine probe (gene segment) intensities across arrays • final measurement represents gene expression level

Background Choice makes a difference: • MAS 5.0 • dChip • GCRMA • RMA • Convolution Background Correction • Quantile Normalization • http://en.wikipedia.org/wiki/Quantile_normalization • Tukey’s Median Polish

Background

Background Assumption: • Only a few genes are DifferentiallyExpressed (DE) • Balanced upward and downword expression level changes • Forceing all arrays to have the same probe intensity distribution. Complicated disease? Cancer ?

Background Is the assumption valid for Cancer ? Figure 1. Box plot of sample median values before normalization in control (white) and cancer (grey) sample group for each dataset.

Background Is the assumption valid for Cancer ? Table 1. Comparison of sample medians of raw signal intensities between cancer and normal group. (10/18)

Background The influence of over normalization

Background We should note that: • Gene expressions tend to have excessive up-regulation in cancers. • Effective signals naturally exist in the raw data. • The assumption under most current norm algorithms may not hold true.

Background (Motivation) Assumptions: • Only a few genes are Differentially Expressed (DE) • Balanced upward and downword expression level changes • Forcing all arrays to have the same probe intensity distribution. X X X Complicated disease? Cancer ? The assumptions are NOT reasonable for Cancer Study We propose a novel normalization Strategy

Method Ask for novel methods: • Keep the property of extensive up-regulation • Do not over normalization • CrossNorm: Cross Normalization • LVS: Least Variation Set Normalization

Method: CrossNorm Cross Quantile Profile Profile after CrossQuan C C D D

Method: CrossNorm Cross Quantile Profile Profile after CrossQuan C C D D • Keep the rank order within an individual; avoid over normalization between conditions .

Method • Let be the expression profiles of the control arrays; and let be the expression profiles of the disease arrays. The ’s and ’s have the same length (the number of genes) . CrossNorm for the paired case where . • Form a matrix of columns , ; • Normalize the columns in any approach you intend, such as Quantile, to obtain a matrix with colums ; • Obtain the final normalized control arrays as , and the disease ones as .

Data sets AffymetrixSpike-in data det: • spike-in Human Genome U133 dataset • Spike-in DrosGenome1 data set Real-world cancer data set: 18 cancer data sets collected from

Data sets Affymetrix Spike-in Data Set: 1) Spike-in Human Genome U133 dataset • based on a latin-square experiment with 42 arrays • overall 42 spiked-in genes at various concentrations ranging from 0.0 to 512 pM. • Each concentration was performed with three replicates • each array contains 22,283 probes.

Data sets Affymetrix Spike-in Data Set: 2) Spike-in DrosGenome1 data set • A set of 14,010 probe sets • 3,866 had been assigned given concentration fold. • 2,535 probe sets were assigned unchanged concentration. (FC=1) • 1,331 with FC greater than 1, ranging from 1.2 to 4. (FC>1) • 10,144 empty probe sets • not spiked any concentration (removed in the project).

Data sets Cancer Datasets:

Result HG U133

Result DrosGenome1

Result Figure 1. Box plot of sample median values after CrossNormin control (white) and cancer (grey) sample group for each dataset.

Result • Identifying Differentially Expressed (DE) genes: • Fold Change (FC) with different thresholds. Assessment of reproducibility: • Percentage of Overlap Gene (POG) • POG is a score measuring the percentage of overlapping genes accounting for the total number of the two gene sets. • Direction Consistency (DC) ratio • DC ratio is the ratio of the genes that had the same regulation direction for both gene sets.

Result Table 2. (a) The consistency statistic of data sets for ESCC and Pancreatic cancer. • (b) The consistency statistic of data sets for ESCC and Pancreatic cancer. • DC: Direction Consistency; POG: Percent of Overlap Gene

Conclusion • CrossNorm is a modification of existing normalization methods to process microarray data sets with global shifts over samples. • It makes the most out of raw signal and maintain the regulation direction. • CrossNormoutperforms global normalizations as well as the already well-performed LVS normalization approach, when it comes to differential analysis with a high degree of biological variation.

Conclusion • CrossNormfully utilizing biological signal from the raw data rather than artificially presetting parameters or pre defining the proportion of assumed housekeeping genes, like LVS. • The applications is not restricted to cancer study, but also for researches comparing tissues and developmental stages as genes are expected to have high variation in both cases. • The strategy could also be extended to all sorts of baseline normalizations.

Discussion The identification of regulation direction of genes is of vital importance for the subsequent biological analysis, • expression correlation of gene productions • regulation relations between miRNA and target mRNA, • detecting the regulation direction of oncogene and tumor suppress genes.

Future work CrossNorm is a robust and unbiased procedure that could help us better understand the expressional difference among samples. • Correlation study • miRNA data • RNA-seq data • preprocessing of published data sets

Q & A THANK YOU!

Eason Cheng Sep 18, 2014

Eason Cheng Sep 18, 2014

Presentation Transcript

2013 S eason

Sat/Sun 17/18 Sep 2011

Xiuzhen Cheng cheng@gwu

Competitive Summer S eason

Xiuzhen Cheng cheng@gwu

Xiuzhen Cheng cheng@gwu

Xiuzhen Cheng cheng@gwu

EASON CHAN

Dr. M. Izad Sep 2014

ACCESS Faculty Lunch Sep 18*

mini-workshop Fundamental Physics ESO/Garching 18-19 Sep, 2014

Xiuzhen Cheng cheng@gwu

Xiuzhen Cheng cheng@gwu

Eubanks World History Sep 18

802.1 Interworking Sep 2014 Agenda

18 Sep 2018 Nifty future tips

Xiuzhen Cheng cheng@gwu

Xiuzhen Cheng cheng@gwu

Xiuzhen Cheng cheng@gwu