Computational analysis of genome wide expression data
Download
1 / 59

Computational analysis of genome-wide expression data - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Computational analysis of genome-wide expression data. Paul Pavlidis Columbia Genome Center pp175@columbia.edu. Lecture overview. Microarray technology and applications How the data is collected and what you get. “High level” analysis methods: applied to the study of human sarcoma.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computational analysis of genome-wide expression data' - ferrol


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computational analysis of genome wide expression data

Computational analysis of genome-wide expression data

Paul Pavlidis

Columbia Genome Center

pp175@columbia.edu


Lecture overview
Lecture overview

  • Microarray technology and applications

  • How the data is collected and what you get.

  • “High level” analysis methods: applied to the study of human sarcoma.

    • Supervised and unsupervised learning.

    • Feature selection.

  • Method for applying biological prior knowledge.

  • (If there is time) Further applications of the technology.


Review gene expression
Review: gene expression

  • DNA  pre-mRNA mRNA protein

  • Many potential steps for regulation.

  • Many genes are differentially transcribed according to:

    • tissues

    • cell types

    • in various disease, physiological, and developmental states.

  • mRNA is easy to quantify using hybridization assays.

  • protein levels harder to measure in a high-throughput assay.


Microarrays
Microarrays

  • Thousands of small (20-200m) spots of DNA probes on a glass slide.

  • Use to measure gene expression (RNA) levels of 10,000-20,000+ genes in parallel.

    • Old way: Northern blots etc. let you measure one gene at a time.

  • Generally give only relative expression information.

    • Methods exist for getting absolute measurements (SAGE, calibrated arrays)

  • Yields a type of snapshot of the molecular state of the sample.


Applications for microarrays
Applications for microarrays

  • Diagnosis: molecular ‘portraits’ of disease

    • Preventative medicine - early detection.

    • Personalized treatment.

    • Refined diagnosis and prognosis.

  • Disease/phenotype characterization: What genes are affected by condition X? (esp. complex traits)

  • Gene expression regulation network elucidation: what happens to the expression of gene Y if you knock out gene X?

  • Mutation and polymorphism detection

  • Genome analysis: gene finding/structure determination

  • Sets a model for high-throughput technologies:

    • Protein arrays

    • Post-translational modification assay arrays

    • Protein interaction arrays


Why microarrays are of interest to computational biologists
Why microarrays are of interest to computational biologists

  • Can generate a lot of data very quickly (compared to what biologists usually deal with).

    • Many studies include 50 samples or more = ~1,000,000 data points.

  • Messier than sequence data (in interesting ways)

    • Continuous rather than discrete

    • Many more sources of variability, including biological.

  • Can ask questions you can’t ask of sequence data.


Microarray technology
Microarray technology

Oligo (one-channel)

Spotted (two-color)



Common experimental designs
Common experimental designs

  • Compare an experimental condition to a control condition (hopefully with replication).

    • mutant vs. wild type.

    • diseased vs. normal.

  • Assemble a compendium of sample/conditions or time points.

    • Stages of the cell cycle.

    • Different tumor samples.


Computation and microarrays i data acquisition and preprocessing
Computation and microarrays I: Data acquisition and preprocessing

  • Scan of fluorescent-labeled array to get an initial image

  •  Spot-finding

  •  Signal and background determination

  •  Calculation of expression measure for each spot

    • (ratiometric array)  expression ratio

    • (Affymetrix array)  ‘signal’ – combine signals from multiple probes for a gene.

  •  Normalization (correct for non-biological systematic errors)

  •  Expression data matrix

  • All of these steps present statistical/analytical/computational problems.


Values for genes on one array
Values for genes on one array

  • Two color arrays: ratio of “red” to “green” intensities. Usually expressed as log(R/G).

  • One-color arrays (Affymetrix): “signal” – just a relative measure of expression of the gene.

  • Either way we have a number that represents the expression level of the gene.

  • To keep things simple, for two-color arrays a common reference sample R is often used for all arrays.

    A/R B/R C/R D/R E/R etc.

  • Equivalent for one color arrays:

    A B C D E


False color images of spotted array
False color images of spotted array

  • Overlay of two scans of the slide

  • Compares the two samples

  • Green = less relative expression

  • Red = more relative expresion

  • Yellow = equal expression

  • Dimmer colors = lower expression levels.


Normalizing two color arrays
Normalizing two-color arrays

  • Due to imbalances in dye labeling, the signals for the two colors are rarely “balanced”.

  • There are many other sources of non-biological systematic error.

  • Normalization attempts to correct for this.

  • More complicated than it sounds because of multiple error sources.

  • All types of arrays have to be normalized before array-array comparisons are possible

after

before


Combined data for multiple arrays expression data matrix for an experiment
Combined data for multiple arrays: Expression data matrix for an experiment

  • The matrix entry at (i, j) is the expression level of gene i in experiment j.


Computation and microarrays ii some high level analysis methods
Computation and microarrays for an experimentII: Some ‘High level’ analysis methods

  • Clustering (of genes and/or samples).

  • Statistical analysis to identify ‘changed’ genes or correlate genes with experimental factors.

    • Analysis of variance

    • Regression

  • Supervised learning of sample or gene categories.

    • i.e., support vector machine, k-nearest neighbor, etc.

  • Data mining and statistical methods such as:

    • Principal components/singular value decomposition

    • Multidimensional scaling

    • Visualization

  • Prediction of gene-gene interactions (pathways/networks)


An extended example microarray analysis of human sarcoma
An extended example: for an experimentMicroarray Analysis of Human Sarcoma

Collaboration with

Memorial Sloan Kettering Cancer Center


What is sarcoma
What is sarcoma? for an experiment

  • Sarcoma = “fleshy growth”

  • Tumors of connective tissues, nerves, fat, muscle, etc. i.e. mesoderm, neuroectoderm.

  • 1% of new cancer cases.

  • Many cases (30%?) are referred to MSKCC

  • Other major types of cancer:

    • carcinoma, which come from epithelial cell lineages.

    • leukemia and lymphoma (blood and lymph derived)


Nine types of sarcoma studied as delinieated by histology and molecular markers
Nine types of sarcoma studied - for an experimentas delinieated by histology and (*) molecular markers

  • liposarcoma (lipo) - fat

  • dedifferentiated liposarcoma (lipodediff)

  • pleomorphic liposarcoma (lipopleo)

  • round cell liposarcoma (roundcell)*

  • fibrosarcoma (fibro) - fibroblast

  • leiomyosarcoma (leio) - smooth muscle

  • malignant fibrous histiocytoma (MFH) - pleomorphic

  • synovial sarcoma (synovial) - joint-cell like*

  • gastrointestinal stromal tumor (GIST) * - c-kit: Gleevec

  • clear cell sarcoma (clearcell) - pigmented *

    Are these types distinguishable at the level of RNA expression?


Stripped down version of the approach
Stripped-down version of the approach for an experiment

  • Affymetrix genechips (12500 genes/ESTs) run on RNA from each sample.

  • Final data set has 52 samples.

  • Clustering - Which types of tumor cluster together well?

  • Feature selection + SVM - Which types are learnable? Which samples are misclassified? What genes distinguish each class?


Clustering
Clustering for an experiment

  • Unsupervised learning

  • Goal: identify groupings in the data.

  • Based on some notion of similarity of the profiles being clustered

    • Euclidean distance

    • Correlation

    • Manhattan distance

    • many others…

  • Algorithms:

    • Hierarchical

    • k-means

    • self-organizing maps


Clustering genes
Clustering genes for an experiment

  • Hypothesis: genes with related functions will cluster together (are “coexpressed”).

  • True for some classes of genes, but not generally true.

    • “Function” is too broad a term, and we don’t always expect genes with related functions to be coexpressed.

    • Seems to apply most often to ‘housekeeping’ functions.

    • Probably the most overapplied method in microarray analysis.

  • Genes which are coexpressed are potentially coregulated, but not generally.


Clustering samples
Clustering samples for an experiment

  • When the samples are ‘pseudoreplicates’ (i.e, samples from individual patients), clustering may reveal known or unknown classes.

  • In the context of the sarcoma data, we can see:

    • If histologically defined classes cluster on the basis of gene expression.

    • Which classes are most similar to other classes.

    • If there are previously unrecognized subtypes within a histologically defined class.


Sarcoma clustering results
Sarcoma clustering results for an experiment


The alternative to clustering supervised learning
The alternative to clustering: Supervised learning for an experiment

Training set

Genes

Learner

Model

Experiments

Class membership

Predictor

Genes

Experiments

Predicted Class

Test set (“unknowns”)


With supervised learning can ask
With supervised learning, can ask: for an experiment

  • Which classes are recognizable as such (“learnable”).

  • When classification errors are made, are they telling us something about the label on the test sample?

    • We can identify ‘mislabeled samples’ this way.

  • Potentially allow us to make diagnoses and predictions for new samples.

  • Obviously we need to have classes defined first.


Support vector machines
Support vector machines for an experiment

+

+

+

+

+

-

-

Locate a plane that separates positive from negative examples.

+

+

-

+

-

+

+

-

-

-

-

-

+

-

-

-

+

+

-

-

+

-

-

Focus on the examples closest to the boundary.


Feature selection
Feature selection for an experiment

  • For tissue classification, each gene is a feature.

  • Most genes are not informative about the classes we are learning – they aren’t relevant.

  • Many genes are not evenexpressed in the tissue assayed.

  • Including too many ‘noisy’ features degrades learning performance.

  • It is common to attempt to identify features that will be most useful for learning.

  • Those features (genes) are also the ones which are most associated with the particular tumor type.


Selecting genes with a t test
Selecting genes with a t-test for an experiment

μi = mean expression value in class i

ni = number of examples in class i

v = pooled variance across both classes

Other methods exist:

Analysis of variance, t-test variants, non-parametric methods, etc.


Features selected gist an easy class
Features selected: GIST (an easy class) for an experiment

  • Light colors = higher expression

  • Student’s t-test

  • In decreasing order of p-value


Features selected mfh a harder class
Features selected: MFH (a harder class) for an experiment

Welch’s t-test

Student’s t-test

Fisher’s disc.


Hold one out cross validation of svm with feature selection
Hold-one-out cross-validation of SVM with feature selection for an experiment

1...52

1: hold out one sample

2: select features

22, 4,..,13

apply to test data

3: train SVM

4: classify held-out sample


Svm results
SVM results for an experiment

log2Number of features

Number of occurrences

True positive ()

false positive ()

True positive (“refined” classes”) ()

false positive (“refined” classes”) ()


What about all those genes we selected
What about all those genes we selected? for an experiment

  • What can we say about the genes which distinguish particular classes?

  • They are presumably telling us something about the biology of the tumors:

    • Hints about causes?

    • Drug targets?

    • Provide markers for diagnosis or targeting drugs?

  • An expert can look at the results and come up with a story about each gene, for each class.

  • Can we do this computationally?


Making use of biological prior knowledge
Making use of biological prior knowledge for an experiment

  • There is a huge biological knowledge base - we know a lot about many genes.

  • How can we layer this information on the gene expression data?

  • Would like to make seamless, optimal use of:

    • Gene function knowledge

    • Genetic mapping information

    • Sequence data – from other species too.

    • etc., etc., etc.


Relating expression to gene function
Relating expression to gene function for an experiment

  • Given a microarray data set, often want to ask:

    Are there any functional commonalities among the genes which were affected?

  • Typical approach:

    • “This cluster contains a lot of ribosomal protein genes”.

    • “If you look at the citric acid cycle genes, a lot of them changed during the experiment”.

  • More efficient, structured alternative: Class scoring.


Class scoring basic idea
Class scoring: basic idea for an experiment

Given: Expression data and functional annotations (class labels) for the genes.

Task: Find the interesting gene classes.

Solution: Give each class a score.

  • “Class” is any biologically meaningful set of genes.

  • “Semi-supervised”

  • Scores can be generated in multiple ways.


Sources of annotations
Sources of annotations for an experiment

  • Gene Ontology: a controlled vocabulary for describing gene function

    • http://www.geneontology.org

    • Most of the major species genome databases are adopting it for annotation.

  • MIPS catalog of yeast genes (likely to be superceded by GO)

    • http://mips.gsf.de/

  • Both are hierarchys of terms.

  • Unfortunately, currently not all genes are annotated well (or at all).


Go example
GO example for an experiment

(Browser at http://www.godatabase.org/cgi-bin/go.cgi)


Two gene class scoring methods
Two gene class scoring methods for an experiment

What makes a gene class “interesting”?

  • Similarity of the expression profiles.

    • Correlation score: Are expression profiles of the genes in the class similar? (Akin to clustering)

  • Significant effects of experimental treatments.

    • Experiment score: Do genes in the class have good group-comparison statistics (ttest, ANOVA, etc.) ?


Correlation score details
Correlation score: details for an experiment

  • Get the data for one class.

  • Measure the correlations between genes in the class. (Pearson correlation coefficient)

  • Take the average of the correlations as the score for the class.

  • Calculate p-value for the class

  • Repeat steps 1-4 for many classes.

  • Classes with best p-values are ‘most interesting’.

n*(n-1)/2 pairwise

correlations

Data

Class

data

average

Score

p-value


Correlation score example
Correlation score: example for an experiment

  • Yeast data

  • Fermentation, class correlation 0.46, p < 10-5

  • Morphogenesis, class correlation ~0, p ~ 1


Experiment score details
Experiment score: details for an experiment

  • Perform an analysis of variance, t-test, or other appropriate statistical test on each gene in the data set, yielding a score (p-value) for each gene.

  • Get the statistics data for a class.

  • The average of the -log p-values for the genes in the class is the score for the class.

  • Convert the raw score into a pvalue.

  • Repeat 2-4 for many classes.

  • Classes with best pvalues are ‘Most interesting’.

data

gene p-values

stats

for class

average

p-value

test each gene

Score


Experiment score example
Experiment score: example for an experiment

  • One-way ANOVA performed on data from three leukemia types (Data from Golub et al.)

    T-cell receptor: ave -log (pvalue) = 4.6, p<10-5

    Transferases: ave -log (pvalue) = ~ 1.5, p ~1

ALL-B-cell

ALL-T-cell

AML

ALL-B-cell

ALL-T-cell

AML


Converting raw scores into p values
Converting raw scores into p values for an experiment

How likely are we to get a class with a given score by chance?

  • The distribution of scores is affected by the class size.

  • Use empirical measurement of score distributions generated by random samples of the data.


Correlation and experiment scores are largely complementary
Correlation and experiment scores are largely complementary for an experiment

Mouse brain region data (Sandberg et al.)


Correlation score method tends to select housekeeping classes
Correlation score method tends to select ‘housekeeping’ classes

  • Across different experimental designs, organisms, and tissues:

    • Ribosomal proteins/protein synthesis

    • Mitochondrial energy production/TCA cycle

    • RNA processing/spliceosome

    • Protein degradation/proteasome

  • Suggests that:

    • These genes are always tightly coregulated.

    • Results of correlation score are not specific to the experiment at hand – “biological noise”.


Experiment score tends to select classes that are relevant to the specific experimental situation
Experiment score tends to select classes that are relevant to the specific experimental situation

  • T-cell receptors, immune system for leukemia.

  • Synaptic transmission, myelination, ion channels for brain region data.

  • True for the yeast data but plenty of overlap with the correlation score.

  • This is observed in additional data sets we have analyzed: Cancer, obesity, human brain, etc.

  • Additional “unexpected” classes are also identified in many experiments.


Class scoring on the refined mfh class
Class scoring on the refined MFH class to the specific experimental situation


Class scoring for gist
Class scoring for GIST to the specific experimental situation


Issues for the future
Issues for the future to the specific experimental situation

  • Using biological prior knowledge to make the most of microarray data.

  • How to combine and interrelate various genome-wide data types, including microarrays.

  • Extracting information about genetic networks from the data.

    • Need more data!

  • Can microarray data be used the way sequence data is used, or is it too messy?

    • Comparing data between labs, platforms, and organisms.

  • How will protein arrays and other developing technologies fit in?


A few resources
A few resources to the specific experimental situation

  • Stanford Microarray Database

    • http://genome-www5.stanford.edu/MicroArray/SMD/

  • Whitehead Institute: Cancer Genome Research

    • http://www-genome.wi.mit.edu/cancer/

  • NCBI gene expression omnibus

    • http://www.ncbi.nlm.nih.gov/geo/

  • Some software is available from my website:

    • http://rbp1sun.cpmc.columbia.edu/

  • Gene Ontology:

    • http://www.geneontology.org


Further applications and methods examples from the literature
Further applications and methods: to the specific experimental situationexamples from the literature

  • Determination of gene structure

  • Finding genetic regulatory pathways


Using arrays to determine gene structure
Using arrays to determine gene structure to the specific experimental situation

  • Computational approaches are currently inadequate.

    • First/last exons often incorrectly predicted.

    • Alternate splicing very difficult to detect.

  • Experimental approaches can be very effective, but the genome is very large...

  • One answer: genome-scanning and tiling arrays.

    see Shoemaker, et al., Nature 2001 (genome issue)


Genome tiling arrays
Genome tiling arrays to the specific experimental situation


Exon assay arrays
Exon assay arrays to the specific experimental situation

  • Arrays contain probes for known and predicted exons.

  • Assays run on 60 different human tissues.

  • Can be used to determine the accuracy of computational predictions.

  • Can detect alternative splicing.


Finding pathways or pieces of them
Finding pathways (or pieces of them) to the specific experimental situation

  • Popular data set: Mutations in 300 known yeast genes (not all of known function)

  • Microarray analysis of each of the 300 strains

  • We can ask, which genes are affected by knocking out gene X?

  • One example: Pe’er et al., ISMB 2001 (using data from Hughes et al., Cell 2000)


Overview of data
Overview of data to the specific experimental situation


Example of a subnetwork found
Example of a subnetwork found to the specific experimental situation


A few resources1
A few resources to the specific experimental situation

  • Stanford Microarray Database

    • http://genome-www5.stanford.edu/MicroArray/SMD/

  • Whitehead Institute: Cancer Genome Research

    • http://www-genome.wi.mit.edu/cancer/

  • NCBI gene expression omnibus

    • http://www.ncbi.nlm.nih.gov/geo/

  • Some software is available from my website:

    • http://rbp1sun.cpmc.columbia.edu/

  • Gene Ontology:

    • http://www.geneontology.org