correspondence analysis applied to microarray data l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Correspondence analysis applied to microarray data PowerPoint Presentation
Download Presentation
Correspondence analysis applied to microarray data

Loading in 2 Seconds...

play fullscreen
1 / 21

Correspondence analysis applied to microarray data - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

Correspondence analysis applied to microarray data. Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D. Hoheiselartin Vingron http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98-(2001)-10781.pdf www.pnas.org/cgi/doi/10.1073/pnas.181597298. Principal Component Analysis.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Correspondence analysis applied to microarray data' - ayala


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
correspondence analysis applied to microarray data
Correspondence analysis applied to microarray data
  • Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D. Hoheiselartin Vingron
  • http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98-(2001)-10781.pdf
  • www.pnas.org/cgi/doi/10.1073/pnas.181597298
principal component analysis
Principal Component Analysis
  • Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data
    • The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)
  • Each data vector is a linear combination of the c principal component vectors
  • Project on the subspace which preserve the most of the data variability:
correspondence analysis
Correspondence analysis
  • CA= PCA for categorical variables

Example:Dataset X -27 dog species 7 categorical variables

Name Height Weight Speed Intelligence Affection Agresivity Function

- + ++ - + ++ - + ++ - + ++ - + - + C H U (company,Hunt,Utility)

1. Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0

… =X

27.Caniche 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0

CA study dependence between 2 categorical variables

-Height,Function

Works on crossable N

Height/Function C H U C H U marginals

- 6 1 0| 7 n11 n12 n13 |n1+ -

+ 3 2 0| 5 N= n21 n22 n23 |n2+ +

++ 1 6 8| 15 n31 n32 n33 |n3+ ++

- - - - - - - - - - - - - - - - - - - - - - - - - -

10 9 8| 27 n+1 n+2 n+3 |n marginals

  • Lines of crossable (categories of first variable) are seen as distribution in the space of distributions over the second variables category (dimension =# categories for second)
  • Distance between points (distribution) – mutual information (KL distance)
  • variable)distance
  • Projection on the subspace that preserve the most of the “variability”
  • Each category
correspondence analysis4
Correspondence analysis
  • Divide each line by its total

Height/Function C H U C H U

- 6/7 1/7 0/7| 7 n11/n1+ n12/n1+ n13/n1+

+ 3/5 2/5 0/5| 5 n21/n2+ n22/n2+ n23/n2+

++ 1/5 6/5 8/5| 15 n31/n3+ n32/n3+ n33/n3+

Each row become a point in probability space over the categories of the second variable (conditional distributions given the category value of the first variable)

# points =#categories of first variable =m

dimension of space = #number of categories of second variable l

distance between points (probabilities) –weighted Euclidian distance –low when indep

(can transform data and work with usual Euclidian distance)

correspondence analysis5
Correspondence analysis
  • Each row I considered with weight=
  • Measure for total variability I= chi-square statistic =measure of

dependence between the two variables

CA –visualization of the cell that contribute most to dependence: if an n_ih has an outstanding value then both line i and column h will be far from g in the same direction

correspondence analysis6
Correspondence analysis

Dimension reduction –project (in norm chi2) on the subspace that preserves the most of the variability (dependence)

New variable =linear combinations of the initial ones

Like in PCA -solutions in term of eigenvalues/eigenvectors of N

-eigenvalue –gives proportion of variability preserved

-measures for how well each point is represented in the subspace

-measures for contribution of each point/category in determining the optimal subspace -subspace “meaning”

Height/Function C H U CA1 CA2 C H U

- 6/7 1/7 0/7| 7 1.10 -.92 n11/n1+ n12/n1+ n13/n1+

+ 3/5 2/5 0/5| 5 0.85 1.23 n21/n2+ n22/n2+ n23/n2+

++ 1/15 6/15 8/15| 15 - 0.84 0.02 n31/n3+ n32/n3+ n33/n3+

Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the first variable (Height) in the low dimensional plot) (-+ height have similar function)

If join two “identical categories” the chi2- distance do not change

repeat everything for transpose n
Repeat everything for transpose(N)

Height/Function C H U Function/Height - + ++ CA1 CA2

- 6 1 0| C 6/10 3/10 1/10 1.04 -.10

+ 3 2 0| H 1/9 2/9 6/9 -0.32 .43

++ 1 6 8| U 0/8 0/8 0/9 -0.94 -.37

10 9 8

Each column become a point in probability space over the categories of the first variable (conditional distributions given the category value of the second variable)

# points =#categories of second variable =l

dimension of space = #number of categories of first variable m

CA- New variables =linear combinations of the initial ones preserving dependence best

Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the second variable (Function) in the low dimensional plot) (U,H functions have similar heights)

overlap the two plots
Overlap the two plots

Function/Height - + ++ CA1 CA2

C 6/10 3/10 1/101.04 -.10

H 1/9 2/9 6/9 -0.32 .43

U 0/8 0/8 0/9 -0.94 -.37

CA1 1.10 .85 -.84

CA2 -.92 1.93 0.02

CA value in one plot are (up to a scale) weighted means of CA values in the second plot with weight corresponding to the conditional probability:

1.04= (6/10*(-.92)+3/10*1.93+1/10*0.02)*constant

Include “standard coordinates” =virtual rows concentrated on one column (1 0 0) (0 1 0) (0 0 1)

Categories of different variable close to the extreme of the axes and to each other are highly correlated:

Utility dogs are big; Company dogs are small

(see also shaving gene classification)

slide10

If reorder the rows and columns by first CA – generally cells with high values go on diagonalHeight/Function C H U CA1 - 6 1 0| 1.10 + 3 2 0| .85 ++ 1 6 8| -.84 CA1 1.04 -.32 -.94

extension
Extension
  • Treat X as N (crossable for two possible variable with 27 respective 6 categories

Name Height Function

- + ++ C H U - + ++ C H U CA1 CA2

1. Boxer 0 1 0 1 0 0 0 1/2 0 ½ 0 0 .45 .88

… =X

27.Caniche 1 0 0 1 0 0 ½ 0 0 ½ 0 0 .91 .02

CA1 1.2 .85 -.84 1.04 -.32 -.4

Plot from transpose(X) identical to overlapped plots above

New plot from X – extra points for each dog race

Relationship Height/Function- Dog race: Canish is small dog for company

multiple correspondence analysis
Multiple correspondence analysis

Use the whole X (all the variables) as crosstable

Name Height Weight Speed Intelligence Affection Agresivity Function

- + ++ - + ++ - + ++ - + ++ - + - + C H U

1 . Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0

… =X

27.Caniche 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0

CA1 .32 .60 -.89 -35 .37 -.34 -.84 .78 .40 -.43

CA2 -1.04 .89 .37 -.81 .29 .46 -.29 .27 .19 -.21

Discovering association rules (based on correlation):

Company dogs are small, with high affectivity

Utility dogs are big, fast, aggressive

Hunt dogs are very intelligent

Use them for classification

what is association mining
What Is Association Mining?
  • Association rule mining:
    • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
    • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]
  • Motivation: finding regularities in data
    • What products were often purchased together? — Beer and diapers?!
    • What are the subsequent purchases after buying a PC?
    • What kinds of DNA are sensitive to this new drug?
    • Can we automatically classify web documents?
basic concepts frequent patterns and association rules

Customer

buys both

Customer

buys diaper

Customer

buys beer

Basic Concepts: Frequent Patterns and Association Rules
  • Itemset X={x1, …, xk}
  • Find all the rules XYwith min confidence and support
    • support, s, probability that a transaction contains XY
    • confidence, c,conditional probability that a transaction having X also contains Y.

Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

  • B, E, F
  • Let min_support = 50%, min_conf = 50%:
    • A  C (50%, 66.7%)
    • C  A (50%, 100%)
slide15
Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases
  • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!
  • Method:
    • generate length (k+1) candidate itemsets from length k frequent itemsets,
    • test the candidates against DB
  • Challenges
    • Multiple scans of transaction database
    • Huge number of candidates
    • Tedious workload of support counting for candidates
  • Construct FP-tree From A Transaction Database
    • For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree
    • Repeat the process on each newly created conditional FP-tree
    • Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
association based classification
Association-Based Classification
  • Several methods for association-based classification
    • ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)
      • It beats C4.5 in (mainly) scalability and also accuracy
    • Associative classification: (Liu et al’98)
      • It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label
    • CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
      • Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another
      • Mine Eps based on minimum support and growth rate
table 1 cell cycle data as used in analysis18
Table 1. Cell-cycle data as used in analysis
  • The raw intensity data as obtained from ais imaging software (Imaging Research, St. Catherines, ON, Canada) were normalized
  • The normalized data matrix was filtered for genes with positive minmax separation for at least one of the conditions under study (2).
  • The data were shifted to a positive range by adding the minimum + 1
  • alpha0 alpha7 alpha14 alpha21 alpha28 alph1a35 …
  • (M/G1) (M/G1) (G1) (G1) (S) (S) ..
  • YHR126C 5.81 5.73 6.01 5.48 5.37 5.23 …
  • YOR066W 5.62 5.81 6.02 5.28 5.02 5.23 …
  • hxt4 5.78 6.21 6.02 5.5 5.58 5.21 …
  • PCL9 4.64 5.39 4.89 5.19 4.96 5.62 …
  • mcm3 5.38 5.8 6.13 5.74 4.52 5.22 …
  • . . . . . . . . …
  • * 800 genes X 73 hybridizations
  • 4 cell-cycle arrest methods of hybridization (18-alpha,24-cdc15,17-cdc28,14-elu)
  • Samples from each method are drawn and their cell-cycle phase had been classified –5 classes
  • * link toward database with information (meaning, functionality etc) for each gene provided
slide20

Each cell-cycle phase colored differently

(M/G1),(G1),(S), (G2),(M)

-can see that hybridization separate according to their cell-cycle phase

(one phase = one region of the plot)

- G1 phase strongly associated with histone gene cluster

- cdc15-30 hybridization classified yellow behave green

(located in green region)

-cdc15-70

-cdc15-80

suggest improper phase classification for these samples

(check with the profiles –proves correct)