On the importance of data cleansing and pre processing for genomic studies
Download
1 / 24

On the Importance of Data Cleansing and Pre-processing for Genomic Studies - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

On the Importance of Data Cleansing and Pre-processing for Genomic Studies. Raymond Ng Computer Science, UBC (ICapture and BC Cancer Research). My Key Genomic Projects. Better biomarkers in transplantation (Genome Canada)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' On the Importance of Data Cleansing and Pre-processing for Genomic Studies' - tender


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
On the importance of data cleansing and pre processing for genomic studies

On the Importance of Data Cleansing and Pre-processing for Genomic Studies

Raymond Ng

Computer Science, UBC

(ICapture and BC Cancer Research)


My key genomic projects
My Key Genomic Projects Genomic Studies

  • Better biomarkers in transplantation (Genome Canada)

  • Rational chemotherapy selection for Non-Small Cell lung cancer (Genome Canada)

  • Nanosilver effects on amphibian wildlife using novel molecular assays (NSERC)

  • Frog Sentinel species comparative “omics” for the environment (Genome BC)


Overview better biomarkers in transplantation
Overview: Better Biomarkers in Transplantation Genomic Studies

  • Vital organ failure a leading cause of premature death world-wide

  • Organ transplantation restores life and health to over 40,000 patients per year

  • Post-transplants not clear sailing:

    • Immunosuppressive drugs cause infection, cancer, diabetes, heart diseases, and kidney failures

    • Transplant failure and treatment complications consume enormous health care resources


Bit overall objectives
BiT Overall Genomic StudiesObjectives

  • Identify effective and widely applicable markers that…

    • predict rejection or immune accommodation of solid organ transplants

    • diagnose acute and chronic rejection

    • forecast the response to therapies that individual transplant recipients receive


Bit analysis pipeline
BiT Analysis Pipeline Genomic Studies

54,675 probe sets

~ 38,500 genes

~15,000 probe sets

~200 probe sets

PAXgene Whole Blood

Affymetrix HG U133 plus 2 Microarrays

Normalization and Pre-filtering

Univariate feature selection

Classifier building + Pathway analysis

Biomarker Panel


Complex data types
Complex Data Types Genomic Studies

Common Theme: data cleansing and pre-processing


My studies on cleansing and pre processing
My Studies on Cleansing and Pre-processing Genomic Studies

  • Clinical:“Detecting potential labeling errors in microarrays by data perturbation,’’Bioinformatics 2006 (Malossini, Blanzieri)

  • mRNA:“MDQC: a new quality assessment method for microarrays based on quality control reports,”Bioinformatics 2007 (Cohen-Freue, Hollander et al.)

  • DNA:“Modelling Recurrent DNA Copy Number Alterations in array CGH Data,”Bioinformatics 2007 (Shah, Murphy, Lam)

  • Proteomics:“Linking Protein Groups Across Multiple Experiments,”in preparation (Cohen-Freue et al.)


A detecting potential labeling errors mbn06
A. Detecting Potential Labeling Errors [MBN06] Genomic Studies

  • Biomedical data can be very noisy:

    • Laboratory environment could change

    • Diagnostic decisions are not completely objective

    • Different “gold-standards” are used for grading

  • Essential to check for label (e.g., grade of rejection) consistency


A. Our Approach Genomic Studies

  • Propose a leave-one-out perturbed classification matrix:

    • Flip every training sample and compare the resulting classifier with the classifier trained on the original training set

  • Look for differences in the sets obtained from Support Vector Machines

    • E.g., sample A is a suspect of mislabeling if flipping A’s label increases accuracy

  • Effectiveness shown on 3 real microarray data sets with ground truths


B mdqc our approach to microarray quality control fzn 07
B. MDQC: Our Approach to Microarray Quality Control [FZN+07] Genomic Studies

  • collapses all values in QC reportsinto measures to assess the quality of the array, the sample, and the RNA

  • measures the distances of each array to an “average” array in the study, adjusting for covariances

  • accounts for interrelation among measures to identify outlying arrays that is not evident from inspection of each one in isolation


Sample Quality Genomic Studies

1500

21-4

1000

500

0

0

50

100

150

200

Chip Quality

RNA Quality

400

21-4

13-3

15

300

320-1

13-4

10

13-6

13-2

200

19-1

13-5

317-10

5

100

17-6

302-7

25-5

0

0

0

50

100

150

200

0

50

100

150

200

Sample

Sample


B mdqc advantages
B. MDQC Advantages Genomic Studies

  • Performs a multidimensional analysis and not requiring absolute thresholds (which are often arbitrary)

  • Easy to implement and visualize, and computationally inexpensive (as compared with Affy PLM)

  • Can suggest potential sources of problems and possible batch effects


Possible batch effects
Possible Batch Effects Genomic Studies

Some of the samples from batch 9

Batches 1, 2, 3 and 4)


C dna copy number analysis with cgh arrays snlm06 07
C. DNA Copy Number Analysis with CGH arrays [SNLM06,07] Genomic Studies

  • Segments of DNA that get duplicated (gains) or deleted (losses)

  • Chromosomal aberrations are being used to form signatures

    • Chemotheraphy selection for NSCLC

    • Staging in cancer (e.g., lung and oral)


Computational challenges
Computational challenges Genomic Studies

  • Noisy signals

  • Spatial dependence between adjacent clones

  • Outliers

    • Systematic errors

    • Copy number polymorphisms


C two state of the art methods
C. Two State-of-the-Art Methods Genomic Studies

MERGELEVELS

Base-HMM


C our approach
C. Our Approach Genomic Studies

  • Use a Hidden Markov Model (HMM) to capture spatial dependency between clones

  • Use a Gaussian mixture model to model the outliers separately from the inliers

    • Outliers have no spatial dependence

  • Use prior knowledge about locations of CNPs to ‘inform’ the model about possible locations of outliers


Example results
Example results Genomic Studies

LSP-HMM



D from peptides to proteins

Distinct Genomic Studies

Indistinguishable

Subset

Shared Peptide

D. From Peptides to Proteins

p1

p1’

p2

p3

p4

p5

p5

p6

p6

p7

p8

p9

p10

p10

p11

p11

p12

A

B

C

D

E

F

G

H


D protein groups

p5 Genomic Studies

p5

p6

p6

p7

p8

p9

C

D

E

F

D. Protein Groups

Which protein is present in the sample?

Form protein groups where proteins in

a group are not distinguishable


D linking proteins groups across different sets a challenge

p10 Genomic Studies

p11

p12

p4

  • Set 1

G

H

divide

merge

G

H

I

  • Set 2

D. Linking Proteins Groups Across Different Sets: a Challenge

Two protein groups

One protein group

Proteins G and H may not be identified in Set 2


Conclusions
Conclusions Genomic Studies

  • We propose an algorithm to solve the protein group linking problem (submitted)

  • We discussed some approaches to cleansing and pre-processing for clinical, mRNA, DNA and proteomics

  • In our biomarker studies, dealing with these issues proven to be significant to the analysis


Thank You! Genomic Studies


ad