Comparative analysis of missing value imputation
Download
1 / 25

Outline - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. Magalie Celton 1 , 2 , Alain Malpertuy 3 , Gaëlle Lelandais 1 , 4 and Alexandre G de Brevern 1 , 4 , ∗

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Outline' - yvonne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Comparative analysis of missing value imputationmethods to improve clustering and interpretationof microarray experiments

Magalie Celton1,2, Alain Malpertuy3 ,Gaëlle Lelandais1,4 and Alexandre G de Brevern1,4,∗

1INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire(EBGM),

DSIMB, Université Paris Diderot - Paris 7, 2, place Jussieu, 75005,France

MEI-HUEI CHU/NCKU


Outline
Outline

  • Introduction

    • background

    • datasets & methods

    • experiments

  • Analysis

  • Discussion

  • Conclusion

MEI-HUEI CHU/NCKU


Introduction background
Introduction(background)

  • In a previous study, it shows the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm.

  • Since numerous replacement methods have been proposed to

    impute missing values (MVs) for microarray data. In this study, twelve

    different usable methods and their influence on the quality of gene

    clustering are evaluated.

MEI-HUEI CHU/NCKU


Introduction datasets methods
Introduction(datasets & methods)

  • datasets

MEI-HUEI CHU/NCKU


Introduction datasets methods1
Introduction(datasets & methods)

  • methods

MEI-HUEI CHU/NCKU


Introduction experiments
Introduction(experiments)

  • experiments

    • missing value imputation

    • extreme value imputation

    • clustering

MEI-HUEI CHU/NCKU


Analysis
Analysis

  • principle of the method

MEI-HUEI CHU/NCKU


Analysis missing value imputation
Analysis(missing value imputation)

  • example 1

MEI-HUEI CHU/NCKU


Analysis missing value imputation1
Analysis(missing value imputation)

  • example 2

MEI-HUEI CHU/NCKU


Analysis missing value imputation2
Analysis(missing value imputation)

  • example 3

MEI-HUEI CHU/NCKU


Analysis missing value imputation3
Analysis(missing value imputation)

  • example 4

MEI-HUEI CHU/NCKU


Analysis missing value imputation4
Analysis(missing value imputation)

  • Rank the methods in term of efficiency. Roughly, they can be identified to three groups:

    • EM_array, LSI_array, LSI_combined and LSI_adaptative

    • BPCA, Row Mean, LSI_gene and LLSI

    • kNN, SkNN and EM_gene.

MEI-HUEI CHU/NCKU


Analysis extreme value imputation
Analysis(extreme value imputation)

  • 1% of the microarray measurements with the highest absolute values.

  • example: ζ = 10% corresponds to 10% of the extreme missing values, so 0.1% of the values of the dataset.

MEI-HUEI CHU/NCKU


Analysis extreme value imputation1
Analysis(extreme value imputation)

MEI-HUEI CHU/NCKU


Analysis extreme value imputation2
Analysis(extreme value imputation)

  • all the replacement methods decrease in effectiveness for the estimate of the extreme values.

  • Performance of the methods also greatly depends on the used dataset and especially in agreement with previous observation.

MEI-HUEI CHU/NCKU


Analysis clustering
Analysis(clustering)

  • Imputations of missing values have been used both to do hierarchical clustering(with seven different algorithms) and k-means.

  • Two index for clustering results comparison

    • CPP

    • CAR

MEI-HUEI CHU/NCKU


Analysis clustering1
Analysis(clustering)

  • Hierarchical clustering

    • using the normalized Euclidean distance

    • seven different algorithms:

      • average linkage

      • complete linkage

      • median linkage

      • McQuitty

      • centroid linkage

      • single linkage

      • ward minimum variance

MEI-HUEI CHU/NCKU


Analysis clustering2
Analysis(clustering)

  • Conserved Pairs Proportion(CPP)

    • RC(reference clustering): the result of clustering with the data sets without

      missing values.

    • GC(generated clustering): the result of clustering with the data sets having

      missing values.

    • CPP: the maximal proportion of genes belonging to two clusters.

  • Clustering Agreement Ratio(CAR)

MEI-HUEI CHU/NCKU


Analysis clustering3
Analysis(clustering)

MEI-HUEI CHU/NCKU


Analysis clustering4
Analysis(clustering)

  • For every hierarchical clustering methods the CPP values are different, but the general tendencies remain the same:

    • imputation of small rate ζ of MVs has always a strong impact on the CPP values.

    • the CPP values slowly decreased with the increased of ζ.

  • common trends can be found between the quality of the imputation method and the gene cluster stability.

MEI-HUEI CHU/NCKU


Discussion
Discussion

  • EM_array is clearly the most efficient methods we tested.

  • As expected, the imputation quality is greatly affected by the rate of missing data, but surprisingly it is also related to the kind of data. BPCA is a perfect illustration.

  • The efficiency of Row_Mean (and Row_Average) is surprisingly good in regards to the simplicity of the methodology used (with the exception of Gheat dataset).

MEI-HUEI CHU/NCKU


Discussion1
Discussion

  • Even if kNN is the most popular imputation method; it is one of the less efficient, compared to other methods tested in this study. It is particularly striking when analyzing the extreme values.

MEI-HUEI CHU/NCKU


Discussion2
Discussion

  • Tuikkala and co-workers have focused interestingly on the GO term class and use k-means.

    • six different methods were tested but not the methods found the most efficient by our approach.

    • with less simulation per missing value rates and less missing value rates.

    • their conclusion about the quality of BPCA.

MEI-HUEI CHU/NCKU


Conclusion
Conclusion

  • More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. The EM_array approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. This study highlights the need for a systematic assessment of imputation methods .A noticeable point is the specific influence of some biological dataset.

MEI-HUEI CHU/NCKU


The end
The end

Thank you for listening!

MEI-HUEI CHU/NCKU


ad