1 / 65

COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis

COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis. Müge Erdoğmuş Zeynep Işık. DISEASE: EMPHYSEMA. Emphysema is a lung disease that is included in a group of diseases that are called chronic obstructive pulmonary disease.

dewey
Download Presentation

COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMPUTATIONAL GENOME ANALYSIS PROJECT: Microarray data analysis Müge Erdoğmuş Zeynep Işık

  2. DISEASE: EMPHYSEMA • Emphysema is a lung disease that is included in a group of diseases that are called chronic obstructive pulmonary disease. • The ability of the lungs to expel air is diminished for the patients with emphysema. • Lungs loose their elasticity thus, they become less contractile. • In emphysema, the lung tissues which are responsible for supporting the physical shape and function of the lungs are damaged.

  3. EMPHYSEMA • The lung tissue around the smaller airways ; bronchioles and the alveolitargets for the destructions. • Normally the lungs are very elastic and spongy but not in emphsema!

  4. Causes of Emphysema • Alpha-1-antitrypsin deficiency • Cigarette smoking 1)It damages the lung tissue in various ways The cells in the airway which are responsible for the clearance of the mucus and other secretions are influenced by cigarette smoking. 2) Enhanced mucus secretion rich source of food for bacteria but immune cells are negatively influenced by the cigarette in their fight for infection. Destructive enzymes from the immune cells loss of proteins associated with elasticity.

  5. The Microarray Experiment • The gene expression dataset is composed of 30 samples that are retrieved from NCBI’s Gene Expression Omnibus. • The RNA transcripts that are utilized for measuring the expression signals are taken from Homo-sapien organism. • 18 slidesSeverely emphysematous tissue removed at LVRS 12 slides normal or mildly emphysematous lung tissue taken from smokers with nodules suspicious for lung cancer. • More than 33,000 best characterized human genes were represented in the dataset that include 1,000,000 unique oligonucleotide features.

  6. Data Analysis Read XLS Files Normalization Outlier Removal Check Normal Dist. Hypothesis Testing PCA Correlation Based Classification

  7. Data Normalization • Why do we normalize data? • Scale data for reasonable comparisons • Map the expression values of each probe between [0,1] range • Do not disturb the underlying distribution of the data

  8. Outlier Removal • Why do we need to remove outliers? • They can distort the mean of the data -> wrong clusterings, wrongs significance values • What is an outlier? • Expression values that are three or more standard deviation away from the mean are outliers • Replace outliers with the mean of remaining expression values for the probe • Detected outliers in 5331 probes

  9. Clustering Before Feature Reduction • Clustering on samples using all existing features to see how bad the situation is • K-means clustering with euclidean distance • For selection of initial k centers • KCENTRES algorithm • select k center objects from the distance matrix such that the distance between the most distant object and the center that is closest to that object is minimized.

  10. Clustering Before Feature Reduction

  11. Clustering Before Feature Reduction • Really low clustering performance • Data set is noisy • Reduce features so as to have only the pobes that are really significant • Differentially expressed among classes • ...can be used to differentiate between the two classes of samples

  12. Searching for Normally Distributed Probes • Why do we need to identify which probes are normally distributed and which ones are not? • Apply different tests of significance • Lilliefors test (95% significance) • Tests the normality of the distribution via examining the signal intensity data for a probe • Modified version of Kolmogorov-Smirnov test • No need to specify the parameters of the underlying distribution of the data • approximates the underlying distribution • 14787 probes have normal distribution • 7428 probes do not have normal distribution

  13. Tests of Significance • For probes that have normal distribution • T-test or Z-test • 30 samples -> t-test (95% significance) • For probes that do not have normal distribution • Non-parametric test • Wilcoxon rank-sum test (95%significance) • ... sorts all intensity values for a probe • ... gives each intensity value a rank • ... sums up the ranks for the signal values for both classes • ... compares the sums to decide whether the two samples come from the same distribution.

  14. Tests of Significance • 2339 of 22215 probes are differentially expressed • eliminated 19876 probes

  15. Clustering Before Feature Reduction • Clustering on samples using only differentially expressed features to see whether feature reduction proved to be useful • K-means clustering with same procedure

  16. Clustering Before Feature Reduction

  17. Further Feature Reduction • 2339 features is a high number • Try to reduce number of features while preserving clustering accuracy • Two methods • Correlation based feature reduction • Principal component analysis

  18. Correlation Based Feature Reduction • Extract uncorelated features • Cluster uncorrelated features • K-means • SOM • Prior to clustering find value of k from hierarchical clustering • Different distance metrics • Different hierarchical clustering methods • From each cluster of each clustering select certain genes • These genes will be used for classification

  19. Correlation Based Feature Reduction • Extract uncorelated features • Find correlation matrix • Keep one of the highly correlated features and remove the others. • Highly correlated = >85% • Keep the feature that is closest to all other features • 1689 probes are left in our feature set

  20. Correlation Based Feature Reduction • Prior to clustering find value of k from hierarchical clustering • Different distance metrics • Manhattan • Euclidean • Mahalanobis • Chebyshev • Correlation coefficients • Different hierarchical clustering methods • Complete linkage (max distance clustering) • Average linkage (average distance clustering)

  21. Correlation Based Feature Reduction Complete Linkage with Manhattan Distance

  22. Correlation Based Feature Reduction Average Linkage with Manhattan Distance

  23. Correlation Based Feature Reduction Complete Linkage with Euclidean Distance

  24. Correlation Based Feature Reduction Average Linkage with Euclidean Distance

  25. Correlation Based Feature Reduction Complete Linkage with Chebyshev Distance

  26. Correlation Based Feature Reduction Average Linkage with Chebyshev Distance

  27. Correlation Based Feature Reduction Complete Linkage with Correlation Coeff.

  28. Correlation Based Feature Reduction Average Linkage with Correlation Coeff.

  29. Correlation Based Feature Reduction • complete linkage method is more successful in separating between clusters • the probes that are very similar to each other are put into the same clusters, the ones that are very different are put into different clusters • focused on the trees that are formed using complete linkage with euclidean distance and correlation coefficients as distance metrics • From which level we should cut the trees? • Examine the trees

  30. Correlation Based Feature Reduction

  31. Correlation Based Feature Reduction

  32. Correlation Based Feature Reduction • cut from a level above the lower bound lines • have a small number of clusters • insufficient to explain the closeness of probes. • cut from a level below the upper bound lines • have a high number of clusters • forcing the clustering algorithm to divide clusters that consist of very similar samples • lower bound = 40 clusters • upper bound = 95 clusters

  33. Correlation Based Feature Reduction • To find optimal k value • run several k-means algorithms for each k value between 40 and 95 • produces the clustering of highest quality • High quality = small intra-cluster distance • k is found to be 80 • store the clustering result that is produced when k = 80 • run SOM clustering with 9x9 map and store the clustering result

  34. Correlation Based Feature Reduction • How do we select the signature probes from the clusters? • select the ones that are most significant • Select how many probes form each cluster ? • select the “n” most significant probes from each cluster, where “n” is depenedent on the quality of the cluster • Quality of cluster = intra-cluster distance • Quality value is high -> intra-cluster similarity is low -> cluster is loose • Quality value is low -> intra-cluster similarity is high -> cluster is tight • Take more probes from clusters that are loose in order to represent those clusters better

  35. Correlation Based Feature Reduction • From clusters formed by k-means • 144 probes • From clusters formed by SOM • 141 probes • we formed another set of 88 probes that are found both in probes selected from k-means clustering and SOM clustering (named as common probes)

  36. Principal Component Analysis • set of statistically significant probes are directly given to prtools' pca function • 99%, 90% and 85% data preservation • Number of resulting principal components

  37. Classification • Feature sets from feature reduction based on correlation that can be used by classifiers • K-means probe set • Som probe set • Common probe set • Algorithms • Linear classifier • Support vector machine • 1-narest neighbor classifier • 3-nearest neighbor classifier

  38. Classification • 30 samples in our data set -> k-fold cross validation • bias caused by random selection of samples for training and testing sets • Repeatedly perform 100 classifications for each classifier • Report the average classification error

  39. Classification (k-means probe set)

  40. Classification (k-means probe set)

  41. Classification (SOM probe set)

  42. Classification (common probe set)

  43. Classification • Three sets of principal components from feature reduction with principal component analysis can be used by classifiers • Algorithms • Linear classifier • Support vector machine • 1-narest neighbor classifier • 3-nearest neighbor classifier

  44. Classification (PCA 99%)

  45. Classification (PCA 90%)

  46. Classification (PCA 85%)

  47. Classification • in all cases support vector machines provides us with the best clustering results

  48. Classification • performance of support vector machines that utilize the principal component sets is worse than the ones that utilize the probe sets that are formed as the result of correlation based feature reduction method • performance of support vector classifiers that utilize the k-means, SOM and common probe sets are more or less similar • classify samples with 99% accuracy on the average • use set of common probes as signature genes • aim is to reduce the number of features without sacrificing classification performance

  49. Final feature reduction • 88 features is still a high number • use Fisher’s linear discriminant to further reduce number of features • resulting signature gene set consisted of 26 probes • performance of classification even improved when the number of probes is reduced • able to classify the 30 samples with 99.7-100% accuracy on the average

  50. Final classification

More Related