1 / 7

Multiclass classification of microarray data with repeated measurements: application to cancer

Multiclass classification of microarray data with repeated measurements: application to cancer. Ka Yee Yeung & Roger E Bumgarner Genome Biology 2003, 4 :R83. Sample Classification.

roch
Download Presentation

Multiclass classification of microarray data with repeated measurements: application to cancer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiclass classification of microarray data with repeated measurements: application to cancer Ka Yee Yeung & Roger E Bumgarner Genome Biology 2003, 4:R83

  2. Sample Classification • Use gene expression measurements from microarray experiments to classify biological sample (e.g. types of tumors). • Goals • Utilize Repeated Measurements • Multiclass classification • Remove redundancy • No assumption of distribution

  3. Shrunken Centroid Classification • Feature selection • Consider features individually • Calculate overall centroid and each class centroid • “Shrink” class centroids by factor Δ • Compare shrunken class centroids to overall centroid • If significantly different, feature is predictive for the class • Estimate optimum Δ using 10-fold cross validation • Classification • Calculate standardized, squared difference of sample to each shrunken class centroid for selected features • Assign to class with nearest centroid

  4. Redundancy & Error Estimation • Uncorrelated Shrunken Centroid (USC) • Removes redundant genes • For each set of relevant genes • Compute pairwise correlations • Remove least relevant gene from pairs with correlation above given threshold • Use cross-validation to determine best pair (shrinkage factor, correlation threshold) • Error Weighted Uncorrelated SC (EWUSC) • The standard deviation of the sample mean is used to down weight the most variable genes and experiments

  5. Experiments • Datasets • Synthetic datasets, varying: • Biological noise level • Technical noise level • Number of repeated measurements • Percent of relevant genes • Real Datasets • Multiple tumor dataset – 7,129 genes, 123 samples, 11 classes (types of tumors) • Breast cancer dataset – 25,000 genes, 97 samples, 2 classes (good or poor prognosis) • Evaluation Criteria • Prediction Accuracy • Number of relevant features selected • Feature stability

  6. Synthetic data results • Removing redundant genes (USC) = Similar accuracy + Using same or fewer genes • Error weighting results on synthetic datasets • Two types of error defined • Technical noise – variation over repeated measurements (λ) • Low (1) or High (5, 10) + Handled “technical noise” well (similar accuracy similar, fewer genes) • Biological noise – signal to noise ratio (α) • 20 to 1, 2 to 1, or 1 to 1 • Accuracy was worse with increased “biological noise”, despite increasing number repeated measurements • Criticism • Noise same over entire dataset, should vary for different genes • Each dataset would have some high signal to noise genes

  7. Real Data Results • Removing redundant genes (USC) = Similar, but varying accuracy + Using many fewer genes • Error weighting – Real Datasets • Multiple tumor data + Improved accuracy + Improved feature stability = Using similar number of genes • Breast cancer data + Improved accuracy = Similar feature stability – Using increased number of genes

More Related