Correspondence analysis applied to microarray data

1 / 21

# Correspondence analysis applied to microarray data - PowerPoint PPT Presentation

Correspondence analysis applied to microarray data. Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D. Hoheiselartin Vingron http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98-(2001)-10781.pdf www.pnas.org/cgi/doi/10.1073/pnas.181597298. Principal Component Analysis.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Correspondence analysis applied to microarray data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Correspondence analysis applied to microarray data
• Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D. Hoheiselartin Vingron
• http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98-(2001)-10781.pdf
• www.pnas.org/cgi/doi/10.1073/pnas.181597298
Principal Component Analysis
• Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data
• The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)
• Each data vector is a linear combination of the c principal component vectors
• Project on the subspace which preserve the most of the data variability:
Correspondence analysis
• CA= PCA for categorical variables

Example:Dataset X -27 dog species 7 categorical variables

Name Height Weight Speed Intelligence Affection Agresivity Function

- + ++ - + ++ - + ++ - + ++ - + - + C H U (company,Hunt,Utility)

1. Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0

… =X

27.Caniche 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0

CA study dependence between 2 categorical variables

-Height,Function

Works on crossable N

Height/Function C H U C H U marginals

- 6 1 0| 7 n11 n12 n13 |n1+ -

+ 3 2 0| 5 N= n21 n22 n23 |n2+ +

++ 1 6 8| 15 n31 n32 n33 |n3+ ++

- - - - - - - - - - - - - - - - - - - - - - - - - -

10 9 8| 27 n+1 n+2 n+3 |n marginals

• Lines of crossable (categories of first variable) are seen as distribution in the space of distributions over the second variables category (dimension =# categories for second)
• Distance between points (distribution) – mutual information (KL distance)
• variable)distance
• Projection on the subspace that preserve the most of the “variability”
• Each category
Correspondence analysis
• Divide each line by its total

Height/Function C H U C H U

- 6/7 1/7 0/7| 7 n11/n1+ n12/n1+ n13/n1+

+ 3/5 2/5 0/5| 5 n21/n2+ n22/n2+ n23/n2+

++ 1/5 6/5 8/5| 15 n31/n3+ n32/n3+ n33/n3+

Each row become a point in probability space over the categories of the second variable (conditional distributions given the category value of the first variable)

# points =#categories of first variable =m

dimension of space = #number of categories of second variable l

distance between points (probabilities) –weighted Euclidian distance –low when indep

(can transform data and work with usual Euclidian distance)

Correspondence analysis
• Each row I considered with weight=
• Measure for total variability I= chi-square statistic =measure of

dependence between the two variables

CA –visualization of the cell that contribute most to dependence: if an n_ih has an outstanding value then both line i and column h will be far from g in the same direction

Correspondence analysis

Dimension reduction –project (in norm chi2) on the subspace that preserves the most of the variability (dependence)

New variable =linear combinations of the initial ones

Like in PCA -solutions in term of eigenvalues/eigenvectors of N

-eigenvalue –gives proportion of variability preserved

-measures for how well each point is represented in the subspace

-measures for contribution of each point/category in determining the optimal subspace -subspace “meaning”

Height/Function C H U CA1 CA2 C H U

- 6/7 1/7 0/7| 7 1.10 -.92 n11/n1+ n12/n1+ n13/n1+

+ 3/5 2/5 0/5| 5 0.85 1.23 n21/n2+ n22/n2+ n23/n2+

++ 1/15 6/15 8/15| 15 - 0.84 0.02 n31/n3+ n32/n3+ n33/n3+

Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the first variable (Height) in the low dimensional plot) (-+ height have similar function)

If join two “identical categories” the chi2- distance do not change

Repeat everything for transpose(N)

Height/Function C H U Function/Height - + ++ CA1 CA2

- 6 1 0| C 6/10 3/10 1/10 1.04 -.10

+ 3 2 0| H 1/9 2/9 6/9 -0.32 .43

++ 1 6 8| U 0/8 0/8 0/9 -0.94 -.37

10 9 8

Each column become a point in probability space over the categories of the first variable (conditional distributions given the category value of the second variable)

# points =#categories of second variable =l

dimension of space = #number of categories of first variable m

CA- New variables =linear combinations of the initial ones preserving dependence best

Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the second variable (Function) in the low dimensional plot) (U,H functions have similar heights)

Overlap the two plots

Function/Height - + ++ CA1 CA2

C 6/10 3/10 1/101.04 -.10

H 1/9 2/9 6/9 -0.32 .43

U 0/8 0/8 0/9 -0.94 -.37

CA1 1.10 .85 -.84

CA2 -.92 1.93 0.02

CA value in one plot are (up to a scale) weighted means of CA values in the second plot with weight corresponding to the conditional probability:

1.04= (6/10*(-.92)+3/10*1.93+1/10*0.02)*constant

Include “standard coordinates” =virtual rows concentrated on one column (1 0 0) (0 1 0) (0 0 1)

Categories of different variable close to the extreme of the axes and to each other are highly correlated:

Utility dogs are big; Company dogs are small

If reorder the rows and columns by first CA – generally cells with high values go on diagonalHeight/Function C H U CA1 - 6 1 0| 1.10 + 3 2 0| .85 ++ 1 6 8| -.84 CA1 1.04 -.32 -.94

Extension
• Treat X as N (crossable for two possible variable with 27 respective 6 categories

Name Height Function

- + ++ C H U - + ++ C H U CA1 CA2

1. Boxer 0 1 0 1 0 0 0 1/2 0 ½ 0 0 .45 .88

… =X

27.Caniche 1 0 0 1 0 0 ½ 0 0 ½ 0 0 .91 .02

CA1 1.2 .85 -.84 1.04 -.32 -.4

Plot from transpose(X) identical to overlapped plots above

New plot from X – extra points for each dog race

Relationship Height/Function- Dog race: Canish is small dog for company

Multiple correspondence analysis

Use the whole X (all the variables) as crosstable

Name Height Weight Speed Intelligence Affection Agresivity Function

- + ++ - + ++ - + ++ - + ++ - + - + C H U

1 . Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0

… =X

27.Caniche 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0

CA1 .32 .60 -.89 -35 .37 -.34 -.84 .78 .40 -.43

CA2 -1.04 .89 .37 -.81 .29 .46 -.29 .27 .19 -.21

Discovering association rules (based on correlation):

Company dogs are small, with high affectivity

Utility dogs are big, fast, aggressive

Hunt dogs are very intelligent

Use them for classification

What Is Association Mining?
• Association rule mining:
• Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
• Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]
• Motivation: finding regularities in data
• What products were often purchased together? — Beer and diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?

Customer

Customer

Customer

Basic Concepts: Frequent Patterns and Association Rules
• Itemset X={x1, …, xk}
• Find all the rules XYwith min confidence and support
• support, s, probability that a transaction contains XY
• confidence, c,conditional probability that a transaction having X also contains Y.

Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

• B, E, F
• Let min_support = 50%, min_conf = 50%:
• A  C (50%, 66.7%)
• C  A (50%, 100%)
Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases
• Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!
• Method:
• generate length (k+1) candidate itemsets from length k frequent itemsets,
• test the candidates against DB
• Challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Construct FP-tree From A Transaction Database
• For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree
• Repeat the process on each newly created conditional FP-tree
• Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
Association-Based Classification
• Several methods for association-based classification
• ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)
• It beats C4.5 in (mainly) scalability and also accuracy
• Associative classification: (Liu et al’98)
• It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label
• CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
• Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another
• Mine Eps based on minimum support and growth rate
Table 1. Cell-cycle data as used in analysis
• The raw intensity data as obtained from ais imaging software (Imaging Research, St. Catherines, ON, Canada) were normalized
• The normalized data matrix was filtered for genes with positive minmax separation for at least one of the conditions under study (2).
• The data were shifted to a positive range by adding the minimum + 1
• alpha0 alpha7 alpha14 alpha21 alpha28 alph1a35 …
• (M/G1) (M/G1) (G1) (G1) (S) (S) ..
• YHR126C 5.81 5.73 6.01 5.48 5.37 5.23 …
• YOR066W 5.62 5.81 6.02 5.28 5.02 5.23 …
• hxt4 5.78 6.21 6.02 5.5 5.58 5.21 …
• PCL9 4.64 5.39 4.89 5.19 4.96 5.62 …
• mcm3 5.38 5.8 6.13 5.74 4.52 5.22 …
• . . . . . . . . …
• * 800 genes X 73 hybridizations
• 4 cell-cycle arrest methods of hybridization (18-alpha,24-cdc15,17-cdc28,14-elu)
• Samples from each method are drawn and their cell-cycle phase had been classified –5 classes
• * link toward database with information (meaning, functionality etc) for each gene provided

Each cell-cycle phase colored differently

(M/G1),(G1),(S), (G2),(M)

-can see that hybridization separate according to their cell-cycle phase

(one phase = one region of the plot)

- G1 phase strongly associated with histone gene cluster

- cdc15-30 hybridization classified yellow behave green

(located in green region)

-cdc15-70

-cdc15-80

suggest improper phase classification for these samples

(check with the profiles –proves correct)