1 / 37

Eason Cheng Sep 18, 2014

Group Meeting Presentation. Eason Cheng Sep 18, 2014. CrossNorm : a novel normalization strategy for microarray data in cancer. Outline. Background and Introduction Method and Datasets Results Conclusion and Discussion. Outline. Background and Introduction Method and Datasets

doli
Download Presentation

Eason Cheng Sep 18, 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Group Meeting Presentation Eason Cheng Sep 18, 2014

  2. CrossNorm: a novel normalization strategy for microarray data in cancer

  3. Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion

  4. Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion

  5. Background • Purpose of preprocessing/normalization Gene Chip System variation Biological variation Gene expression profile To remove system variation (noise) while keeping biological variation (information for analysis)

  6. Background 3 steps of preprocessing (noise removal): • Background Correction • Remove local artifacts and “noise” (within arrays) • measurements are not so affected by neighboring measurements • Normalization • Remove array effects (among arrays) • measurements from different arrays are comparable • Summarization • Combine probe (gene segment) intensities across arrays • final measurement represents gene expression level

  7. Background 3 steps of preprocessing (noise removal): • Background Correction • Remove local artifacts and “noise” (within arrays) • measurements are not so affected by neighboring measurements • Normalization • Remove array effects (among arrays) • measurements from different arrays are comparable • Summarization • Combine probe (gene segment) intensities across arrays • final measurement represents gene expression level

  8. Background Choice makes a difference: • MAS 5.0 • dChip • GCRMA • RMA • Convolution Background Correction • Quantile Normalization • http://en.wikipedia.org/wiki/Quantile_normalization • Tukey’s Median Polish

  9. Background

  10. Background Assumption: • Only a few genes are DifferentiallyExpressed (DE) • Balanced upward and downword expression level changes • Forceing all arrays to have the same probe intensity distribution. Complicated disease? Cancer ?

  11. Background Is the assumption valid for Cancer ? Figure 1. Box plot of sample median values before normalization in control (white) and cancer (grey) sample group for each dataset.

  12. Background Is the assumption valid for Cancer ? Table 1. Comparison of sample medians of raw signal intensities between cancer and normal group. (10/18)

  13. Background The influence of over normalization

  14. Background We should note that: • Gene expressions tend to have excessive up-regulation in cancers. • Effective signals naturally exist in the raw data. • The assumption under most current norm algorithms may not hold true.

  15. Background (Motivation) Assumptions: • Only a few genes are Differentially Expressed (DE) • Balanced upward and downword expression level changes • Forcing all arrays to have the same probe intensity distribution. X X X Complicated disease? Cancer ? The assumptions are NOT reasonable for Cancer Study We propose a novel normalization Strategy

  16. Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion

  17. Method Ask for novel methods: • Keep the property of extensive up-regulation • Do not over normalization • CrossNorm: Cross Normalization • LVS: Least Variation Set Normalization

  18. Method: CrossNorm Cross Quantile Profile Profile after CrossQuan C C D D

  19. Method: CrossNorm Cross Quantile Profile Profile after CrossQuan C C D D • Keep the rank order within an individual; avoid over normalization between conditions .

  20. Method • Let be the expression profiles of the control arrays; and let be the expression profiles of the disease arrays. The ’s and ’s have the same length (the number of genes) . CrossNorm for the paired case where . • Form a matrix of columns , ; • Normalize the columns in any approach you intend, such as Quantile, to obtain a matrix with colums ; • Obtain the final normalized control arrays as , and the disease ones as .

  21. Data sets AffymetrixSpike-in data det: • spike-in Human Genome U133 dataset • Spike-in DrosGenome1 data set Real-world cancer data set: 18 cancer data sets collected from

  22. Data sets Affymetrix Spike-in Data Set: 1) Spike-in Human Genome U133 dataset • based on a latin-square experiment with 42 arrays • overall 42 spiked-in genes at various concentrations ranging from 0.0 to 512 pM. • Each concentration was performed with three replicates • each array contains 22,283 probes.

  23. Data sets Affymetrix Spike-in Data Set: 2) Spike-in DrosGenome1 data set • A set of 14,010 probe sets • 3,866 had been assigned given concentration fold. • 2,535 probe sets were assigned unchanged concentration. (FC=1) • 1,331 with FC greater than 1, ranging from 1.2 to 4. (FC>1) • 10,144 empty probe sets • not spiked any concentration (removed in the project).

  24. Data sets Cancer Datasets:

  25. Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion

  26. Result HG U133

  27. Result DrosGenome1

  28. Result DrosGenome1

  29. Result Figure 1. Box plot of sample median values after CrossNormin control (white) and cancer (grey) sample group for each dataset.

  30. Result • Identifying Differentially Expressed (DE) genes: • Fold Change (FC) with different thresholds. Assessment of reproducibility: • Percentage of Overlap Gene (POG) • POG is a score measuring the percentage of overlapping genes accounting for the total number of the two gene sets. • Direction Consistency (DC) ratio • DC ratio is the ratio of the genes that had the same regulation direction for both gene sets.

  31. Result Table 2. (a) The consistency statistic of data sets for ESCC and Pancreatic cancer. • (b) The consistency statistic of data sets for ESCC and Pancreatic cancer. • DC: Direction Consistency; POG: Percent of Overlap Gene

  32. Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion

  33. Conclusion • CrossNorm is a modification of existing normalization methods to process microarray data sets with global shifts over samples. • It makes the most out of raw signal and maintain the regulation direction. • CrossNormoutperforms global normalizations as well as the already well-performed LVS normalization approach, when it comes to differential analysis with a high degree of biological variation.

  34. Conclusion • CrossNormfully utilizing biological signal from the raw data rather than artificially presetting parameters or pre defining the proportion of assumed housekeeping genes, like LVS. • The applications is not restricted to cancer study, but also for researches comparing tissues and developmental stages as genes are expected to have high variation in both cases. • The strategy could also be extended to all sorts of baseline normalizations.

  35. Discussion The identification of regulation direction of genes is of vital importance for the subsequent biological analysis, • expression correlation of gene productions • regulation relations between miRNA and target mRNA, • detecting the regulation direction of oncogene and tumor suppress genes.

  36. Future work CrossNorm is a robust and unbiased procedure that could help us better understand the expressional difference among samples. • Correlation study • miRNA data • RNA-seq data • preprocessing of published data sets

  37. Q & A THANK YOU!

More Related