Bioinformatics

Lecture 8 • Analyzing Microarray Data Bioinformatics Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly Aleppo University Faculty of technical engineering Department of Biotechnology 2010-2011

Microarray can monitor many genes at once, a DNA microarray is an inert, solid,flat and transparent surface (e.g.: a microscopic slide) onto which 20,000 to 60,000 short DNA probes of specified sequences are orderly tethered. Each probe corresponds to a particular short section of a gene. So a single gene is covered by several probes which span different parts of the gene sequence.

Repositories of Microarray Studies • Due to the large use of microarrays, data repositories have flourished world-wide. Three of the largest databases of gene expression are: • The Gene Expression Omnibus (GEO) • National Center for Biotechnology Information (NCBI) • Stanford Microarray Data Base (SMD) And for PLANTS Plant Expression database PLEXdb

DNA microarrays measure the RNA abundance with either 1 channel (one color) or 2 channels (two colors). • AffymetrixGeneChip has 1 channel and use either fluorescent red dye Cy5 orgreen fluorescent dye, Cy3 • Stanford microarraysmeasure by competitive hybridization the relative expression under a given condition (fluorescent red dye Cy5) compared to its control (labeled with a green fluorescent dye, Cy3) (Two channels)

Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment 16-bit TIFF files Image analysis (Rfg, Rbg), (Gfg, Gbg) Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation

Video

Microarray Experiment • Isolate mRNA • Make labelled cDNA library • Apply your DNA on the slide • Scan the slide • Purify the picture • Extract the data • Analyse your data

Results The colors denote the degree of expression in the experimental versus the control cells. Gene not expressed in control or in experimental cells Mostly in controlcells Only in controlcells Mostly in experimentalcells Only in experimentalcells Same inboth cells

Let us talk about the analysis and the mathematical problems: Now we have a lot of pictures which contain a huge information so: 1- we have to purify the picture 2- we have to extract our data.

Image analysis • The raw datafrom a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes. • Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

Steps in image analysis 1. Addressing. Estimate location of spot centers. 2. Segmentation. Classify pixels as foreground (signal) or background. • 3. Information extraction. For • each spot on the array and each • dye • foreground intensities; • background intensities; • quality measures.

Why do we calculate the background intensities? • Motivation behind background adjustment: A spot’s measured fluorescence intensity includes a contribution that is not specifically due to the hybridization of the target to the probe, but to something else, e.g. the chemical treatment of the slide, autofluorescence etc. Want to estimate and remove this unwanted contribution.

Quantification of expression For each spot on the slide we calculate Red intensity = Rfg - Rbg fg = foreground, bg = background, and Green intensity = Gfg – Gbg

cDNA gene expression data Data on p genes for n samples Up-regulated gene down-regulated gene unchanged expression mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.00 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene 5 in mRNA sample 4 = log2(Red intensity / Green intensity)

Normalization • Why? • To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples for example: • Dyes activity • Dyes quantity • scanning parameters • location on the array • Air bubbles

Self-self hybridizations How do we know it is necessary? • By examining self-self hybridizations, we label one sample from the same tissue with two dyes Cy3 , Cy5 so We find dye biases.

Homogeneity and Separation Principles • Homogeneity: Elements within a cluster are close to each other • Separation: Elements in different clusters are further apart from each other • …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows

Bad Clustering This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

Good Clustering This clustering satisfies bothHomogeneity and Separation principles

Clustering Techniques • Agglomerative: Start with every element in its own cluster, and iteratively join clusters together • Divisive: Start with one cluster and iteratively divide it into smaller clusters • Hierarchical: Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees

Hierarchical Clustering 1 2 4 5 3 6 7 8 9 3 4 5 6 7 9 8 1 2

Hierarchical Clustering Algorithm • Hierarchical Clustering (d, n) • Form n clusters each with one element • Construct a graph T by assigning one vertex to each cluster • while there is more than one cluster • Find the two closest clusters C1 and C2 • Merge C1 and C2 into new cluster C with |C1| +|C2| elements • Compute distance from C to all other clusters • Add a new vertex C to T and connect to vertices C1 and C2 • Remove rows and columns of d corresponding to C1 and C2 • Add a row and column to dcorrsponding to the new cluster C • return T • The algorithm takes a nxn distance matrix d of pairwise distances between points as an input.

K-Means Clustering Problem: Formulation • Input: A set, V, consisting of n points and a parameter k • Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

1-Means Clustering Problem: an Easy Case • Input: A set, V, consisting of n points • Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x

1-Means Clustering Problem: an Easy Case • Input: A set, V, consisting of n points • Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic method for K-Means clustering is the Lloyd algorithm

x1 x2 x3 K-Means Clustering: Lloyd Algorithm

K-Means Clustering: Lloyd Algorithm • Lloyd Algorithm • Arbitrarily assign the k cluster centers • while the cluster centers keep changing • Assign each data point to the cluster Ci corresponding to the closest cluster representative (center) (1 ≤ i ≤ k) • After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is ∑v \ |C|for all v in C for every cluster C *This may lead to merely a locally optimal clustering.

Thank you

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics