1 / 92

920 likes | 1.05k Views

Introduction to Time-Course Gene Expression Data. STAT 675 R Guerra April 21, 2008. Outline. The Data Clustering – nonparametric, model based A case study A new model. The Data.

Download Presentation
## Introduction to Time-Course Gene Expression Data

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Introduction to Time-Course Gene Expression Data**STAT 675 R Guerra April 21, 2008**Outline**• The Data • Clustering – nonparametric, model based • A case study • A new model**The Data**• DNA Microarrays: collections of microscopic DNA spots, often representing single genes, attached to a solid surface**The Data**• Gene expression changes over time due to environmental stimuli or changing needs of the cell • Measuring gene expression against time leads to time-course data sets**Time-Course Gene Expression**• Each row represents a single gene • Each column represents a single time point • These data sets can be massive, analyzing many genes simultaneously**Time-Course Gene Expression**• k-means to clustering • “in the budding yeast Saccharomyces cerevisiae clustering gene expression data • groups together efficiently genes of known similar function, • and we find a similar tendency in human data…” Eisen et al. (1998)**Clustering Expression Data**• When these data sets first became available, it was common to cluster using non-parametric clustering techniques like K-Means and hierarchical clustering**Yeast Data Set**• Spellman et al (1998) measured mRNA levels on yeast (saccharomyces cerevisiae) • 18 equally spaced time-points • Of 6300 genes nearly 800 were categorized as cell-cycle regulated • A subset of 433 genes with no missing values is a commonly used data set in papers detailing new time-course methods • Original and follow-up papers clustered genes using K-means and hierarchical clustering**Spellman et al. (1998)**Yeast cell cycle Row labels = cell cycle Rows=genes Col labels = expts Cols = time points**Yeast Data Set (Spellman et al.)**K-means Hierarchical Which method gives the “right” result???**Non-Parametric Clustering**• Data curves • Apply distance metric to get distance matrix • Cluster**Issues with Non-Parametric Clustering**• Technical • Require the number of clusters to be chosen a priori • Do not take into account the time-ordering of the data • Hard to incoporate covariate data, eg, gene ontology • Yeast analysis had number of clusters chosen based on number of cell cycle groups .…no statistical validation showing that these were the best clustering assignments**Model-Based Clustering**• In response to limitations of nonparametric methods, model based methods proposed • Time series • Spline Methods • Hidden Markov Model • Bayesian Clustering Models • Little consensus over which method is “best” to cluster time course data**K-Means Clustering**Relocation method: Number of clusters pre-determined and curves can change clusters at each iteration • Initially, data assigned at random to k clusters • Centroid is computed for each cluster • Data reassigned to cluster whose centroid is closest to it • Algorithm repeats until no further change in assignment of data to clusters • Hartigan rule used to select “optimal” #clusters**K-means: Hartigan Rule**• n curves, let k1 =k groups and k2 = k+1 groups. • If E1 and E2 are the sums of the within cluster sums of squares for k1 and k2 respectively, then add the extra group if:**K-means: Distance Metric**• Euclidean Distance • Pearson Correlation**K-means: Starting Chains**• Initially, data are randomly assigned to k clusters but this choice of k cluster centers can have an effect on the final clustering • R implementation of K-means software allows the choice of “number of initial starting chains” to be chosen and the run with the smallest sum of within cluster sums of squares is the run which is given as output**K-Means: Starting Chains**• For j = 1 to B • Random assignment j • k clusters • wj = within cluster sum-of-squares End j Pick clustering with min(wj)**Hierarchical Clustering**• Hierarchical clustering is an addition or subtraction method. • Initially each curve is assigned its own cluster • The two closest clusters are joined into one branch to create a clustering tree • The clustering tree stops when the algorithm terminates via a stopping rule**Hierarchical Clustering**• Nearest neighbor: Distance between two clusters is the minimum of all distances between all pairs of curves, one from each cluster • Furthest neighbor: Distance between two cluster is the maximum of all distances between all pairs of curves, one from each cluster • Average linkage: Distance between two clusters is the average of all distances between all pairs of elements, one from each cluster**Hierarchical Clustering**• Normally the algorithm stops at a pre-determined number of clusters or when the distance between two clusters reaches some pre-determined threshold • No universal stopping rule of thumb to find an optimal number of clusters using this algorithm.**Model-Based Clustering**Many uses mixture models, splines or piecewise polynomial functions used to approximate curves Can better incorporate covariate information**Models using Splines**• Time course profiles assumed observations from some underlying smooth expression curve • Each data curves represented as the sum of: • Smooth population mean spline (dependent on time and cluster assignment) • Spline function representing individual (gene) effects • Gaussian measurement noise**Model based clustering and data transformationsfor gene**expression data (2001) Yeung et al., Bioinformatics, 17:977-987. MCLUST software**Validation Methods**• L(C) is maximized log-likelihood for model with C clusters, m is the number of independent parameters to be estimated and n is the number of genes • Strikes a balance between goodness-of-fit and model complexity • The non-model-based methods have no such validation method**Comparison of Methods**• Ma et al (2006) • Smoothing Spline Clustering (SSClust) • Simulation study • SSClust better than MClust & nonparameteric • Comparison: misclassification rates**Functional Form of Ma et al (2006) Simulation Cluster**Centers**MR and OSR**• Misclassification Rate • Overall Success Rate • To calculate OSR the MR is only for the cases when the correct number of clusters is found**Comparison of Methods**• From Ma et al. (2006) paper.**SSClust Methods Paper**• Concluded that SSClust was the superior clustering method • Looking at the data, the differences in scale between the four true curves is large • Typical time course clusters differ in location and spread but not in scale to this extreme • Their conclusions are based on a data set which is not representative of the type of data this clustering method would be used for**Alternative Simulation**Functional Form for five clusters centers**Example of SSClust Breaking Down**Linear curves joined while sine curves arbitrarily split into 2 clusters**Simulation Configuration**• Distance Metric • Euclidean or Pearson • # of Curves • Small (100), Large (3000) • # Resolution of Time Points • 13 or 25 time points • evenly spaced or unevenly spaced • Types of underlying Curves • Small (4) – Large (8)**Simulation Configuration**• Distribution of curves across clusters • Equally distributed verses unequally distributed • Noise Level • Small (< 0.5*SD of the data set) • Large (> 0.5*SD of the data set) • For these cases, found the misclassification rates and the percent of times that the correct number of clusters was found**Conclusions from Simulations**• MCLUST performed better than SSClust and K-means in terms of misclassification rate and finding the correct number of clusters • Clustering methods were affected by the level of noise but, in general, not by the number of curves, the number of time points or the distribution of curves across cluster**Comparison based on Real Data**• Applied these same clustering techniques to real data • Different numbers of clusters found for different methods for each real data set**Simulations Based on Real Data**• Start with real data, like the yeast data set • Cluster the results using a given clustering method • Perturb the original data (add noise at each point) • Evaluate how different the new clustering is in comparison to the original clustering • Use MR and OSR

More Related