- 246 Views
- Uploaded on
- Presentation posted in: General

4. Gene Expression Data Analysis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

4. Gene Expression Data Analysis

EECS 600: Systems Biology & Bioinformatics

Instructor: MehmetKoyuturk

- Clustering
- How are genes related in terms of their expression under different conditions?

- Differential gene expression
- Which genes are affected by change in condition, tissue, disease?

- Classification (supervised analysis)
- Given expression profile for a gene, can we assign a function?
- Given the expression levels of several genes in a sample, can we characterize the type of sample (e.g., cancerous or normal)?

- Regulatory network inference
- How do genes regulate each others expression to orchestrate cellular function?

EECS 600: Systems Biology & Bioinformatics

- Group similar items together
- Clustering genes based on their expression profiles
- We can measure the expression of multiple genes in multiple samples
- Genes that are functionally related should have similar expression profiles

- Gene expression profile
- A vector (or a point) in multi-dimensional space, where each dimension corresponds to a sample
- Clustering of multi-dimensional real-valued data is a well-studied problem

EECS 600: Systems Biology & Bioinformatics

Expression levels of 2,000 genes in 22 normal and 40 tumor colon tissues (Alon et al. , PNAS, 1999)

EECS 600: Systems Biology & Bioinformatics

- Functional annotation
- If a gene with unknown function is clustered together with genes that perform a particular function, then that is likely to be associated with that function

- Identification of regulatory motifs
- If a group of genes are co-regulated, then it is likely that their regulation is modulated by similar transcription factors, so looking for common elements in the neighborhood of the coding sequences of genes in a cluster, we can identify regulatory motifs and their location (promoters)

- Modular analysis

EECS 600: Systems Biology & Bioinformatics

n samples

- Generally, m >> n
- m = O(103)
- n = O(101)

- Each row is an n-dimensional vector
- Expression profile

m genes

EECS 600: Systems Biology & Bioinformatics

- How do we decide which genes are similar to each other?
- Euclidian distance
- Manhattan distance

EECS 600: Systems Biology & Bioinformatics

- Minkowski distance
- General version of Euclidian, Manhattan etc.
- p is a parameter

EECS 600: Systems Biology & Bioinformatics

- If we want to measure the distance between directions rather than absolute magnitude, it may be necessary to standardize mean and variation of expression levels for each gene

EECS 600: Systems Biology & Bioinformatics

- The similarity between the variation of two random variables
- A vector is treated as sampling of a random variable
- Covariance

EECS 600: Systems Biology & Bioinformatics

- Pearson correlation coefficient
- Pearson correlation is equal to the cosine of the angle (or inner product of) normalized expression profiles
- Pearson correlation is normalized

EECS 600: Systems Biology & Bioinformatics

- Euclidian distance (normalized) and Pearson correlation coefficient are closely related
- These are the two most commonly used proximity measures in gene expression data analysis
- Without loss of generality, we will use to denote the distance between two expression profiles

EECS 600: Systems Biology & Bioinformatics

- Pearson is vulnerable to outliers
- If two genes have very high expression in a single profile, it might dominate to show that the two expression levels are highly correlated
- Jackknife correlation: Estimate n correlations by taking each dimension (sample) out, take the minimum among them

- Pearson is not robust for non-Gaussian distributions
- Spearman’s rank order correlation coefficient: Rank expression levels, replace each expression level with its rank
- More robust against outliers
- A lot of loss of information

EECS 600: Systems Biology & Bioinformatics

- Hierarchical clustering
- Group genes into a tree (a.k.a, dendrogram), so that each branch of the tree corresponds to a cluster
- Higher branches correspond to coarser clusters

- Partitioning
- Partition genes into several groups so that similar genes will be in the same partition

EECS 600: Systems Biology & Bioinformatics

- Direction of clustering
- Bottom-up (agglomerative): Start from individual genes, join them into groups until only one group is left
- Top-down (divisive): Start with one group consisting of all genes, keep partitioning groups until each group contains exactly one gene
- Agglomerative clustering is computationally less expensive
- Why?

- Hierarchical clustering methods are greedy
- Once a decision is made, it cannot be undone

EECS 600: Systems Biology & Bioinformatics

- Start with m clusters: Each cluster contains one gene
- At each step, choose two clusters that are closest (or most correlated), merge them
- How do we evaluate the distance between two clusters?
- Single-linkage: If clusters contain two very close genes, than the clusters are close to each other

EECS 600: Systems Biology & Bioinformatics

- Complete linkage: Two clusters are close to each other only if all genes inside them are close to each other
- Group average: Two clusters are close to each other if their centers are close to each other

EECS 600: Systems Biology & Bioinformatics

- Recursive bipartitioning
- Find an “optimal” partitioning of the genes into two clusters
- Recursively work on each partition
- Since the number of clusters is an issue for partitioning based clustering algorithms, the magic number 2 solves a lot of problems

- May be computationally expensive
- The problem is “global”
- At every level of the tree, we have to work on all of the genes
- If tree is imbalanced, there might be as many as m levels

- With a reasonable stopping criterion, maybe considered a partition-based clustering as well

EECS 600: Systems Biology & Bioinformatics

- Find groups of genes such that genes in each group are similar to each other, while being somewhat less similar to those in other clusters
- Easily interpratable
- Especially, for large datasets (as compared to hierarchical)

EECS 600: Systems Biology & Bioinformatics

- Clustering is “unsupervised”, so generally we do not have prior knowledge on how many clusters underly the data
- It is very difficult to partition data into an “unknown” number of clusters
- Most algorithms assume that K (number of clusters) is known
- Try different values of K, find the one that results in best clustering
- Very expensive

EECS 600: Systems Biology & Bioinformatics

- Genes do not have a single function
- Most genes might be involved in different processes, so their expression profiles might demonstrate similarities with different genes in different contexts
- Can we allow a gene to be included in more than one cluster?

- Allowing overlaps between clusters poses additional challenges
- To what extent do we allow overlaps? (We definitely don’t want to identify two identical clusters)

EECS 600: Systems Biology & Bioinformatics

- Assign weights to each gene-cluster pair, showing the extent (or likelihood) of the gene belonging to the cluster
- Difficult interpretation
- Partitioning is a special case of fuzzy clustering, where the weights are restricted to binary values
- Hierarchical clustering is also “fuzzy” in some sense
- Continuous relaxation might alleviate computational complexity as well

EECS 600: Systems Biology & Bioinformatics

- The most famous clustering algorithm
- Given K, find Kdisjoint clusters such that the total intracluster variation is minimized

Cluster mean:

Intracluster variation:

Total intracluster variation:

EECS 600: Systems Biology & Bioinformatics

K-Means is an iterative algorithm that alters parameters based on each other’s values until no improvement is possible

1. Choose K expression profiles randomly, designate each of them as the center of one of the K clusters

2. Assign each gene to a cluster

2.1. Each gene is assigned to the cluster with closest center to its profile

3. Redetermine cluster centers

4. If any gene was moved, go back to Step 2, else stop

EECS 600: Systems Biology & Bioinformatics

EECS 600: Systems Biology & Bioinformatics

- Just like K-means, we have K clusters, but this time they are organized into a map
- Often a 2D grid
- We want to organize clusters so that similar clusters will be in proximity in the map
- A way of visualizing in low-dimensional (2D) space

- Just like K-means, each cluster is associated with a weight vector
- It was the cluster center in K-means

- Each weight vector is first initialized randomly to some gene’s expression profile

EECS 600: Systems Biology & Bioinformatics

- At each step, a gene is selected at random
- The distance between the gene’s expression profile and each cluster’s weight vector is calculated, and the cluster with closest weight vector becomes the winner
- The winner’s and its neighbors’ (according to the 2D mapping) weight vectors are adjusted to represent the gene’s expression profile better
- Cjis the winner cluster for gene i at time t
- αis a decreasing function of time, θis the neighborhood function

EECS 600: Systems Biology & Bioinformatics

EECS 600: Systems Biology & Bioinformatics

- Nodes represent genes
- Weighted edges between nodes represent proximity (correlation) between genes’ expression profiles
- This is indeed a way of predicting interactions between genes

EECS 600: Systems Biology & Bioinformatics

- Partition the graph into heavy subgraphs
- Maximize total weight (number of edges) inside a cluster
- Minimize total weight (number of edges) between clusters

- Heuristic algorithms
- CLICK: Recursive min-cut
- CAST: Iterative improvement one by one for each cluster

- Loss of information?

EECS 600: Systems Biology & Bioinformatics

- Generating model
- Each cluster is associated with a distribution (that generates expression profiles for associated genes) specified by model parameters
- The probability that a gene belongs to a cluster is specified by hidden parameters

- Expectation Maximization (EM) algorithm
- Start with a guess of model parameters
- E-step: Compute expected values of hidden parameters based on model parameters
- M-step: Based on hidden parameters, estimate model parameters to maximize the likelihood of observing the data at hand, iterate
- K-means is a special case

EECS 600: Systems Biology & Bioinformatics

- In general, we want to maximize intra-cluster similarity, while minimizing inter-cluster similarity
- Homogeneity, separation
- Based on the proximity metric

- Reference partition
- Information on “true clusters” that comes from a different source (apart from expression data)
- Molecular annotation (e.g., Gene Ontology)
- Jaccard coefficient, sensitivity, specificity

- Cluster annotation
- Processes that are significantly enriched in a cluster

EECS 600: Systems Biology & Bioinformatics

- Heterogeneity (or homogeneity in reverse direction)
- How similar are the genes in one cluster?

- Separation
- How dissimilar are different clusters?

- Good clustering: high heterogeneity, low separation

EECS 600: Systems Biology & Bioinformatics

- Overall heterogeneity
- Overall separation
- How do these change with respect to number of clusters?
- Can we optimize these values to choose the best number of clusters?

EECS 600: Systems Biology & Bioinformatics

- A statistical criterion for evaluating a model
- Penalizes model complexity (number of free parameters to be estimated)
- k is the number of free parameters in the model, which increases with the number clusters
- RSS is the “total error” in the model

- Trade-off number of clusters and optimization function to choose the best number of clusters

EECS 600: Systems Biology & Bioinformatics

- If there is information about “ground truth” from an independent source, we can compare our clustering to such reference partitioning
- Pairwise assessment
- Let Cij = 1 if gene i and gene j are assigned to the same cluster by the clustering algorithm, 0 otherwise
- Let Rij = 1 if gene i and gene j are in the same cluster according to reference partition

EECS 600: Systems Biology & Bioinformatics

- Rand index (symmetric)
- Jaccard coefficient (sparse)
- Minkowski measure (sparse)

EECS 600: Systems Biology & Bioinformatics

- Clustering results in groups of genes that are co-expressed (or co-regulated)
- For each group, can we tell something about the biological phenomena that underlies our observation (their co-expression)?

- We have partial knowledge on the function of many individual genes
- Gene Ontology, COG (Clusters of Ortholog Groups), PFAM (Protein Domain Families)

- Taking a statistical approach, we can assign function to each group of genes
- A function popular in a cluster is associated with that cluster

EECS 600: Systems Biology & Bioinformatics

- Ontology: Study of being (e.g., conceptualization)
- Gene Ontology is an attempt to develop a standardized library of cellular function
- Unified view of life: Processes, structures, and functions recur in diverse organisms

- Three concepts of Gene Ontology
- Biological process: A recognized series of events or molecular functions (e.g., cell cycle, development, metabolism)
- Molecular function: What does a gene’s product do? (e.g., binding, enzyme activity, receptor activity)
- Cellular component: Localization within the cell (e.g., membrane, nucleus, ubiquitin ligase complex)

EECS 600: Systems Biology & Bioinformatics

- Gene Ontology is hierarchical
- A process might have subprocesses
- Seed maturation is part of seed development

- A process might be described at different levels of detail
- Seed dormation is a(n example of) seed maturation

- Same for function and component

- A process might have subprocesses
- Gene Ontology terms are related to each other via “is a” and “part of” relationships
- If process A is part of process B, then A is B’s child (B is A’s parent); B involves A
- If function C is a function D, then C is D’s child; C is a more detailed specification of D

EECS 600: Systems Biology & Bioinformatics

EECS 600: Systems Biology & Bioinformatics

- Gene Ontology is hierarchical, but the hierarcy is not represented by a tree, it is represented by a directed acyclic graph (DAG)
- A GO term can have multiple parents (and obviously a GO term might (should?) have multiple children)

EECS 600: Systems Biology & Bioinformatics

- GO-based annotation assigns GO terms to a gene
- A gene might have multiple functions, can be involved in multiple processes
- Multiple genes might be associated with the same function, multiple genes take part in a process

- True-path rule
- If a gene is annotated with a term, then it is also annotated by its parents (consequently, all ancestors)

- How does the number of genes associated with each term changes as we go down on the GO DAG?

EECS 600: Systems Biology & Bioinformatics

- There a |C| genes in a cluster C
- |T| genes are associated with GO term t
- |C ∩ T| genes are in C and are associated with t
- What is the association between cluster C and term t?
- If we chose random clusters, would we be able to observe that at least this many (|C ∩ T|) of the |C| genes in C are associated with t?
- What is the probability of this observation?

- Statistical significance based on hypergeometric distribution

EECS 600: Systems Biology & Bioinformatics

- We have n items, m of which are good
- If we choose r items from the entire set of items at random, what is the probability that at least k of them will be good?
- n is the number of genes in the organism
- m=|T|, r=|C|, k= |C ∩ T|
- The lower p is, the more likely that there is an underlying association between the term and the cluster (the term is significantly enriched in the cluster)

EECS 600: Systems Biology & Bioinformatics

- How specific (general) is the annotation we attach to a cluster?
- If a cluster is larger, then it might correspond to a more general process
- Some processes might be over-represented in the study set
- How do we find the best location of a cluster in GO hierarchy?

- Parent-child annotation
- Condition probability of enrichment of a term in a cluster on the enrichment of its parent terms in the cluster
- The gene space is defined as the set of genes that are associated with t’s parents

EECS 600: Systems Biology & Bioinformatics

EECS 600: Systems Biology & Bioinformatics

- The p-value for a single term provides an estimate of the probability of having the observed number of genes attached to that particular term
- We have many terms, even if the likelihood of enrichment is small for a particular term, it might be very probable that one term will be enriched as much as observed in the cluster

- We have to account for all hypotheses being tested simultaneously
- Bonferroni correction: Apply union rule, add all p-values
- Which terms should we consider while correcting for multiple hypotheses for a single term?

EECS 600: Systems Biology & Bioinformatics

- How good does a significantly enriched term represent a cluster?
- How many of the genes in the cluster are attached to the term?
- How many of the genes attached to the term are in the cluster?

- For term t that is significantly enriched in cluster C
- Specificity: |C ∩ T|/|C|, a.k.a. precision
- Specificity: |C ∩ T|/|T|, a.k.a. recall

EECS 600: Systems Biology & Bioinformatics

- A particular process might be active in certain conditions
- A group of genes might be expressed (or up-regulated, supressed, co-regulated, etc.) in only a subset of samples
- They might behave almost independently under other conditions

EECS 600: Systems Biology & Bioinformatics

- Clustering is a global approach
- Each gene is a point in the space defined by all samples
- How about points that are clustered in a subspace?

- Biclustering: While clustering genes, also choose a set of dimensions (samples) that provides best clustering
- and vice versa
- a.k.a, co-clustering, subspace clustering…
- This is a much harder problem, because you are not only trying to find groups of points that are close to each other in multi-dimensional space, but also trying to identify a subspace in which groups are more evident

EECS 600: Systems Biology & Bioinformatics

- Sample/tissue classification for diagnosis
- The samples with leukemia show specific characters for a subset of genes

- Identification of co-regulated genes
- Certain sets of genes exhibit coherent activations under specific conditions (while behaving more or less arbitrarily with respect to each other under other conditions)

- Functional annotation
- Biological processes, functional classes are overlapping
- Different sets of samples reveal different functional relationships

EECS 600: Systems Biology & Bioinformatics

- A cluster of genes is defined with respect to a cluster of samples and vice versa
- The clusters are not necessarily exclusive or exhaustive
- A gene/condition may belong to more than one cluster
- A gene/condition may not belong to any cluster at all

- Biclusters are not “perfect”
- Noise
- Statistical inference becomes particularly important

EECS 600: Systems Biology & Bioinformatics

- Given a gene expression matrix A with gene set G and sample set S, a bicluster is defined by a subset of genes I and a subset of samples J
- General idea: A bicluster is a “good” one if AIJ , the submatrix defined by I and J, has some coherence (low variance, low rank, similar ordering of rows, etc.)
- The biclustering problem can be defined as one of finding a single bicluster in the entire gene expression matrix, or as one of extracting all biclusters (with some restriction on the relationship between biclusters)

EECS 600: Systems Biology & Bioinformatics

EECS 600: Systems Biology & Bioinformatics

EECS 600: Systems Biology & Bioinformatics

- Just like symmetric matrices, which can be modeled as arbitrary graphs, rectangular matrices can be modeled using bipartite graphs
- With proper definition of edge weights, biclustering can be posed as the problem of finding “heavy” subgraphs

EECS 600: Systems Biology & Bioinformatics

EECS 600: Systems Biology & Bioinformatics

- Low-variance (constant) bicluster
- Ideal bicluster:
- Minimize bicluster variance

- Low-rank (constant row, constant column, coherent values) bicluster
- Ideal constant row:
- Ideal constant column:
- General rank-one bicluster:
- Define residue for each value:
- Minimize mean squared residue

EECS 600: Systems Biology & Bioinformatics

- Not all expression levels are available for each gene/sample pair
- A solution is to replace missing values (random values, gene mean, sample mean, regression)

- Generalize definition row, column, and bicluster means to handle missing values implicitly
- Occupancy threshold:
A bicluster is one with

adequate number of

(non-missing) values in

each row and column

- Occupancy threshold:

EECS 600: Systems Biology & Bioinformatics

- The expression of a gene in one sample may be thought of as a superposition of contribution for multiple biclusters
- Plaid model:
- : contribution of bicluster k on the expression value of the ith gene in the jth sample
- and (generally binary) specify the membership of row i and column j in the kth bicluster, respectively
- Minimize
- is defined to reflect “bicluster type”
, , ,

EECS 600: Systems Biology & Bioinformatics

- A bicluster is defined to be one with coherent ordering of the values on rows and/or columns (as compared to values themselves)
- Order-preserving submatrix (OPSM)
- A submatrix is order preserving if there is an ordering of its columns such that the sequences of values in every row is increasing

- Gene expression motifs (xMOTIFs)
- The expression level of a gene is conserved across a subset of conditions if the gene is in the same “state” in each of the conditions
- An xMOTIF is a subset of genes that are simultaneously conserved across a subset of samples

EECS 600: Systems Biology & Bioinformatics

- Quantize gene expression matrix to binary values
- SAMBA: A 1 corresponds to a significant change in the expression value
- PROXIMUS: A 1 means that the gene is “expressed” in the corresponding sample

- A bicluster is a “dense submatrix”, i.e. one with significantly more number of 1’s than one would expect
- Bipartite graph model: Bicliques, heavy subgraphs
- It is possible to statistically quantify the density of a submatrix
- Log-likelihood:
- p-value:

EECS 600: Systems Biology & Bioinformatics

- Enumeration
- Go for it!

- Greedy algorithms
- Make a locally optimal choice at every step

- Divide and conquer
- Solve problem recursively

- Alternating iterative heuristics
- Fix one dimension, solve for other, alternate iteratively

- Model Based Parameter estimation
- e.g., EM algorithm

EECS 600: Systems Biology & Bioinformatics

- m rows, n columns in the matrix
- 2mX 2npossible biclusters in total
- Not doable in realistic amounts of time
- Is it really necessary?

- Put some restriction on size of biclusters
- SAMBA models the problem as one of finding heavy subgraphs in a bipartite graph
- Key assumption is sparsity: Nodes of the bipartite graph have bounded degree
- Find K heavy bipartite subgraphs (biclusters) with bounded degree enumeration
- Refine them to optimize overlap and add/remove nodes that improve bicluster quality

EECS 600: Systems Biology & Bioinformatics

- Basic idea: Refine existing biclusters by adding/removing genes/samples to improve the objective function
- Generally, quite fast
- How to choose initial biclusters?
- How to jump over bad local optima? (Global awareness, Hill-climbing)

- Optimization function: mean-squared residue
- Node deletion: Start with a large bicluster, keep removing genes/samples that contribute most to total residue
- Node addition: Start with a small bicluster, keep adding genes/samples that contribute least to total residue
- Repeat these alternatingly to improve global awareness

EECS 600: Systems Biology & Bioinformatics

- If biclusters are identified one by one, we should make sure that we do not identify the same bicluster again and again
- Masking discovered biclusters: Fill bicluster with random values
- First identify disjoint biclusters, then grow them to capture overlaps

- Flexible Overlapped Biclustering (FLOC)
- Generate K initial biclusters
- Make decision from the gene/sample perspective (as compared to bicluster perspective): Choose the best (maximum gain) action for each gene

EECS 600: Systems Biology & Bioinformatics

- Assume K gene clusters, L sample clusters
- Notice that this is a little counter-intuitive, we do not have well-defined biclusters, we rather have clusters of genes and samples, and each pair of gene and sample clusters defines a bicluster

- R: mxk gene clustering matrix, C: nxl sample clustering matrix
- R(i,k)=1 if gene i belongs to cluster k (actually, columns are normalized to have unit norm)

- Minimize total residue:

EECS 600: Systems Biology & Bioinformatics

- We can show that
- Batch iteration
- Given R, compute
- (mxl matrix) serves as a prototype for column clusters
- For each column, find the column of that is closest to that column, update the corresponding entry of C accordingly
- Once C is fixed, repeat the same for rows to compute R from

- Converges to a local minimum of the objective function

EECS 600: Systems Biology & Bioinformatics

- Recall that an order preserving submatrix (OPSM) is one such that all rows have their entries in the same order
- Growing partial models
- Fix the extremes first
- The idea: Columns with very high or low values are more informative for identifying rows that support the assumed linear order
- Start with all (1,1) partial models, i.e., only consider the preservation of the first and last elements, keep the best ones
- Expand these to obtain (2,1) models, then (2,2) until we have (s/2, s/2) models, s being the number of columns in target bicluster

EECS 600: Systems Biology & Bioinformatics

- Block clustering (a.k.a., Direct clustering)
- Recursive bipartitioning
- Sort rows according to their mean, choose a row such that the total variance above and below the row is minimized
- Do the same for columns
- Pick the row or column that results in minimum intra-cluster variances, split matrix into two based on that row or column
- Continue splitting recursively

- One problem is that once two rows/columns go to different biclusters, they can never come together
- Gap Statistics: Find a large number of biclusters, then recombine

EECS 600: Systems Biology & Bioinformatics

- Normalize matrix on both dimensions
- Independent scaling of rows and columns
- Here, R and C are diagonal matrices that contain row and column means, respectively

- Bistochastization
- Goal: Rows will add up to a constant (or will have constant norm), columns will add up to a separate constant
- Repeat independent scaling of rows and columns until stability is reached

- The residual of entire matrix is also normalized in the sense that both rows and columns have zero mean

EECS 600: Systems Biology & Bioinformatics

- Singular value decomposition
- The eigenvalues of the matrices ATA and AAT (say, σ2) are the same
- Eachσis called a singular value of A and the corresponding left and right eigenvectors are called singular vectors
- If σ1 is the largest singular vector of A such thatATAv1 = σ1v1 and AATu1 = σ1u1 , then σ1u1v1T is the best rank-one approximation to A, i.e., ||A- σuvT ||2 is minimized by σ1 , u1 , and v1(over all orthogonal vector pairs with unit norm)

- Consequently, the entries of u and v are ordered in such a way that similar rows have similar values on u, similar columns have similar values on v
- Split matrix based on u and v

EECS 600: Systems Biology & Bioinformatics