on the importance of data cleansing and pre processing for genomic studies
Download
Skip this Video
Download Presentation
On the Importance of Data Cleansing and Pre-processing for Genomic Studies

Loading in 2 Seconds...

play fullscreen
1 / 24

On the Importance of Data Cleansing and Pre-processing for Genomic Studies - PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on

On the Importance of Data Cleansing and Pre-processing for Genomic Studies. Raymond Ng Computer Science, UBC (ICapture and BC Cancer Research). My Key Genomic Projects. Better biomarkers in transplantation (Genome Canada)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'On the Importance of Data Cleansing and Pre-processing for Genomic Studies' - tender


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
on the importance of data cleansing and pre processing for genomic studies

On the Importance of Data Cleansing and Pre-processing for Genomic Studies

Raymond Ng

Computer Science, UBC

(ICapture and BC Cancer Research)

my key genomic projects
My Key Genomic Projects
  • Better biomarkers in transplantation (Genome Canada)
  • Rational chemotherapy selection for Non-Small Cell lung cancer (Genome Canada)
  • Nanosilver effects on amphibian wildlife using novel molecular assays (NSERC)
  • Frog Sentinel species comparative “omics” for the environment (Genome BC)
overview better biomarkers in transplantation
Overview: Better Biomarkers in Transplantation
  • Vital organ failure a leading cause of premature death world-wide
  • Organ transplantation restores life and health to over 40,000 patients per year
  • Post-transplants not clear sailing:
    • Immunosuppressive drugs cause infection, cancer, diabetes, heart diseases, and kidney failures
    • Transplant failure and treatment complications consume enormous health care resources
bit overall objectives
BiT OverallObjectives
  • Identify effective and widely applicable markers that…
    • predict rejection or immune accommodation of solid organ transplants
    • diagnose acute and chronic rejection
    • forecast the response to therapies that individual transplant recipients receive
bit analysis pipeline
BiT Analysis Pipeline

54,675 probe sets

~ 38,500 genes

~15,000 probe sets

~200 probe sets

PAXgene Whole Blood

Affymetrix HG U133 plus 2 Microarrays

Normalization and Pre-filtering

Univariate feature selection

Classifier building + Pathway analysis

Biomarker Panel

complex data types
Complex Data Types

Common Theme: data cleansing and pre-processing

my studies on cleansing and pre processing
My Studies on Cleansing and Pre-processing
  • Clinical:“Detecting potential labeling errors in microarrays by data perturbation,’’Bioinformatics 2006 (Malossini, Blanzieri)
  • mRNA:“MDQC: a new quality assessment method for microarrays based on quality control reports,”Bioinformatics 2007 (Cohen-Freue, Hollander et al.)
  • DNA:“Modelling Recurrent DNA Copy Number Alterations in array CGH Data,”Bioinformatics 2007 (Shah, Murphy, Lam)
  • Proteomics:“Linking Protein Groups Across Multiple Experiments,”in preparation (Cohen-Freue et al.)
a detecting potential labeling errors mbn06
A. Detecting Potential Labeling Errors [MBN06]
  • Biomedical data can be very noisy:
    • Laboratory environment could change
    • Diagnostic decisions are not completely objective
    • Different “gold-standards” are used for grading
  • Essential to check for label (e.g., grade of rejection) consistency
slide9
A. Our Approach
  • Propose a leave-one-out perturbed classification matrix:
    • Flip every training sample and compare the resulting classifier with the classifier trained on the original training set
  • Look for differences in the sets obtained from Support Vector Machines
    • E.g., sample A is a suspect of mislabeling if flipping A’s label increases accuracy
  • Effectiveness shown on 3 real microarray data sets with ground truths
b mdqc our approach to microarray quality control fzn 07
B. MDQC: Our Approach to Microarray Quality Control [FZN+07]
  • collapses all values in QC reportsinto measures to assess the quality of the array, the sample, and the RNA
  • measures the distances of each array to an “average” array in the study, adjusting for covariances
  • accounts for interrelation among measures to identify outlying arrays that is not evident from inspection of each one in isolation
slide11
Sample Quality

1500

21-4

1000

500

0

0

50

100

150

200

Chip Quality

RNA Quality

400

21-4

13-3

15

300

320-1

13-4

10

13-6

13-2

200

19-1

13-5

317-10

5

100

17-6

302-7

25-5

0

0

0

50

100

150

200

0

50

100

150

200

Sample

Sample

b mdqc advantages
B. MDQC Advantages
  • Performs a multidimensional analysis and not requiring absolute thresholds (which are often arbitrary)
  • Easy to implement and visualize, and computationally inexpensive (as compared with Affy PLM)
  • Can suggest potential sources of problems and possible batch effects
possible batch effects
Possible Batch Effects

Some of the samples from batch 9

Batches 1, 2, 3 and 4)

c dna copy number analysis with cgh arrays snlm06 07
C. DNA Copy Number Analysis with CGH arrays [SNLM06,07]
  • Segments of DNA that get duplicated (gains) or deleted (losses)
  • Chromosomal aberrations are being used to form signatures
    • Chemotheraphy selection for NSCLC
    • Staging in cancer (e.g., lung and oral)
computational challenges
Computational challenges
  • Noisy signals
  • Spatial dependence between adjacent clones
  • Outliers
    • Systematic errors
    • Copy number polymorphisms
c our approach
C. Our Approach
  • Use a Hidden Markov Model (HMM) to capture spatial dependency between clones
  • Use a Gaussian mixture model to model the outliers separately from the inliers
    • Outliers have no spatial dependence
  • Use prior knowledge about locations of CNPs to ‘inform’ the model about possible locations of outliers
d from peptides to proteins
Distinct

Indistinguishable

Subset

Shared Peptide

D. From Peptides to Proteins

p1

p1’

p2

p3

p4

p5

p5

p6

p6

p7

p8

p9

p10

p10

p11

p11

p12

A

B

C

D

E

F

G

H

d protein groups
p5

p5

p6

p6

p7

p8

p9

C

D

E

F

D. Protein Groups

Which protein is present in the sample?

Form protein groups where proteins in

a group are not distinguishable

d linking proteins groups across different sets a challenge
p10

p11

p12

p4

  • Set 1

G

H

divide

merge

G

H

I

  • Set 2
D. Linking Proteins Groups Across Different Sets: a Challenge

Two protein groups

One protein group

Proteins G and H may not be identified in Set 2

conclusions
Conclusions
  • We propose an algorithm to solve the protein group linking problem (submitted)
  • We discussed some approaches to cleansing and pre-processing for clinical, mRNA, DNA and proteomics
  • In our biomarker studies, dealing with these issues proven to be significant to the analysis
ad