Comparative analysis of missing value imputation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Outline PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. Magalie Celton 1 , 2 , Alain Malpertuy 3 , Gaëlle Lelandais 1 , 4 and Alexandre G de Brevern 1 , 4 , ∗

Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Outline

Comparative analysis of missing value imputationmethods to improve clustering and interpretationof microarray experiments

Magalie Celton1,2, Alain Malpertuy3 ,Gaëlle Lelandais1,4 and Alexandre G de Brevern1,4,∗

1INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire(EBGM),

DSIMB, Université Paris Diderot - Paris 7, 2, place Jussieu, 75005,France

MEI-HUEI CHU/NCKU


Outline

Outline

  • Introduction

    • background

    • datasets & methods

    • experiments

  • Analysis

  • Discussion

  • Conclusion

MEI-HUEI CHU/NCKU


Introduction background

Introduction(background)

  • In a previous study, it shows the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm.

  • Since numerous replacement methods have been proposed to

    impute missing values (MVs) for microarray data. In this study, twelve

    different usable methods and their influence on the quality of gene

    clustering are evaluated.

MEI-HUEI CHU/NCKU


Introduction datasets methods

Introduction(datasets & methods)

  • datasets

MEI-HUEI CHU/NCKU


Introduction datasets methods1

Introduction(datasets & methods)

  • methods

MEI-HUEI CHU/NCKU


Introduction experiments

Introduction(experiments)

  • experiments

    • missing value imputation

    • extreme value imputation

    • clustering

MEI-HUEI CHU/NCKU


Analysis

Analysis

  • principle of the method

MEI-HUEI CHU/NCKU


Analysis missing value imputation

Analysis(missing value imputation)

  • example 1

MEI-HUEI CHU/NCKU


Analysis missing value imputation1

Analysis(missing value imputation)

  • example 2

MEI-HUEI CHU/NCKU


Analysis missing value imputation2

Analysis(missing value imputation)

  • example 3

MEI-HUEI CHU/NCKU


Analysis missing value imputation3

Analysis(missing value imputation)

  • example 4

MEI-HUEI CHU/NCKU


Analysis missing value imputation4

Analysis(missing value imputation)

  • Rank the methods in term of efficiency. Roughly, they can be identified to three groups:

    • EM_array, LSI_array, LSI_combined and LSI_adaptative

    • BPCA, Row Mean, LSI_gene and LLSI

    • kNN, SkNN and EM_gene.

MEI-HUEI CHU/NCKU


Analysis extreme value imputation

Analysis(extreme value imputation)

  • 1% of the microarray measurements with the highest absolute values.

  • example: ζ = 10% corresponds to 10% of the extreme missing values, so 0.1% of the values of the dataset.

MEI-HUEI CHU/NCKU


Analysis extreme value imputation1

Analysis(extreme value imputation)

MEI-HUEI CHU/NCKU


Analysis extreme value imputation2

Analysis(extreme value imputation)

  • all the replacement methods decrease in effectiveness for the estimate of the extreme values.

  • Performance of the methods also greatly depends on the used dataset and especially in agreement with previous observation.

MEI-HUEI CHU/NCKU


Analysis clustering

Analysis(clustering)

  • Imputations of missing values have been used both to do hierarchical clustering(with seven different algorithms) and k-means.

  • Two index for clustering results comparison

    • CPP

    • CAR

MEI-HUEI CHU/NCKU


Analysis clustering1

Analysis(clustering)

  • Hierarchical clustering

    • using the normalized Euclidean distance

    • seven different algorithms:

      • average linkage

      • complete linkage

      • median linkage

      • McQuitty

      • centroid linkage

      • single linkage

      • ward minimum variance

MEI-HUEI CHU/NCKU


Analysis clustering2

Analysis(clustering)

  • Conserved Pairs Proportion(CPP)

    • RC(reference clustering): the result of clustering with the data sets without

      missing values.

    • GC(generated clustering): the result of clustering with the data sets having

      missing values.

    • CPP: the maximal proportion of genes belonging to two clusters.

  • Clustering Agreement Ratio(CAR)

MEI-HUEI CHU/NCKU


Analysis clustering3

Analysis(clustering)

MEI-HUEI CHU/NCKU


Analysis clustering4

Analysis(clustering)

  • For every hierarchical clustering methods the CPP values are different, but the general tendencies remain the same:

    • imputation of small rate ζ of MVs has always a strong impact on the CPP values.

    • the CPP values slowly decreased with the increased of ζ.

  • common trends can be found between the quality of the imputation method and the gene cluster stability.

MEI-HUEI CHU/NCKU


Discussion

Discussion

  • EM_array is clearly the most efficient methods we tested.

  • As expected, the imputation quality is greatly affected by the rate of missing data, but surprisingly it is also related to the kind of data. BPCA is a perfect illustration.

  • The efficiency of Row_Mean (and Row_Average) is surprisingly good in regards to the simplicity of the methodology used (with the exception of Gheat dataset).

MEI-HUEI CHU/NCKU


Discussion1

Discussion

  • Even if kNN is the most popular imputation method; it is one of the less efficient, compared to other methods tested in this study. It is particularly striking when analyzing the extreme values.

MEI-HUEI CHU/NCKU


Discussion2

Discussion

  • Tuikkala and co-workers have focused interestingly on the GO term class and use k-means.

    • six different methods were tested but not the methods found the most efficient by our approach.

    • with less simulation per missing value rates and less missing value rates.

    • their conclusion about the quality of BPCA.

MEI-HUEI CHU/NCKU


Conclusion

Conclusion

  • More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. The EM_array approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. This study highlights the need for a systematic assessment of imputation methods .A noticeable point is the specific influence of some biological dataset.

MEI-HUEI CHU/NCKU


The end

The end

Thank you for listening!

MEI-HUEI CHU/NCKU


  • Login