1 / 85

From Correlation to liquid association

From Correlation to liquid association. Why correlation is so popularly used? Positive correlation and negative correlation What is the limitation of similarity analysis? The concept of liquid association. Liquid Association (LA). Two examples A challenge. Liquid Association (LA).

isolde
Download Presentation

From Correlation to liquid association

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Correlation to liquid association • Why correlation is so popularly used? • Positive correlation and negative correlation • What is the limitation of similarity analysis? • The concept of liquid association

  2. Liquid Association (LA) • Two examples • A challenge

  3. Liquid Association (LA) • LA is a generalized notion of association for describing certain kind of ternary relationship between variables in a system. (Li 2002 PNAS) • Green points represent four conditions for cellular state 1. • Red points represent four conditions for cellular state 2. • Blue points represent the transit state between cellular states 1 and 2. • (X,Y) forms a LA. Profiles of genes X and Y are displayed in the above scatter plot. Important! Correlation between X and Y is 0

  4. Correlation Coefficienthas been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to Galton(1822-1911) “Typical laws of heredity in man” Karl Pearsonmodifies and popularizes its use. A building block in multivariate analysis, of which clustering, classification, dimension reduction are recurrent themes

  5. A note on correlation • Given two variables, X and Y, the concept of correlation between has several variations. • Correlation intends to measure the degree of coordinated changes. • Original definition by Galton used median as the center

  6. Regression, correlation • Sir Francis Galton (1822-1911), • half-cousin of Charles Darwin, • was an EnglishVictorian polymath, anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician, and statistician. He was knighted in 1909. • Galton invented the use of the regression line (Bulmer 2003, p. 184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas. Bivariate normal

  7. Provide an introduction to regression and inverse regression. • Galton’s data. • Positive : height v.s. speed • Negative : weight w.s. speed

  8. Microarray

  9. Low level analysis • Convert an image into a number representing the ratio of the levels of expression between red and green channels • Color bias • Spatial, tip, spot effects • Background noises • cDNA, oligonucleotide arrays (Affymetrix) (see review article ) • Raw Data processing packages: RMA (Robust Multi-array Average ) dChip MAS5 Chip to chip normalization (set “average expression level” in each chip to a common value in certain sense) Recommendation : try more than one method.

  10. gene-expression data cond1 cond2 …….. condp gene1gene2 gene n x11 x12 …….. x1p x21 x22 …….. x2p … …

  11. Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al)

  12. Many examples are based onexploratory multivariate data analysis

  13. Review • Linkage : Eisen et al , Alon et al • K-mean : Tavazoein et al • Self-organizing map : Tamayo et al • SVD : Holter et al; Alter, Brown, Botstein • Support vector machine (classification)

  14. Finding Gene clusters • K-mean versus self-organization map • how to fine-tune user-specified parameters-need some theoretical guidance • What is a cluster ? Criteria needed • normal mixture, (hidden) indicator • PLAID model ( Statistica Sinica 2002, Lazzeroni, Owen) • PCA plot, projection pursuit, grand tour • MDS( bi-plot for categorical responses, showing both cases (genes) and variables(different clustering methods), displaying results from many different clustering procedures) • GAP (generalized association plot; Chun-Houh Chen cchen@stat.sinica.edu.tw)

  15. Seriation and row-column sorting • Hierarchical clustering • Others • Generalized association plot (Chen 2001) • Sharp boundaries may be artifacts due to “clever” permutation

  16. Gene profiles and correlation Rationale behind massive gene expression analysis: Genes with high degree of expression similarity are likely to be functionally related and may participate in common pathways. They may be co-regulated by common upstreamregulatory factors. Pearson's correlation coefficient, a simple way of describing the strength of linear association between a pair of random variables, has become the most popular measure of gene expression similarity. 1.Cluster analysis: average linkage, self-organizing map, K-mean, ... 2.Classification: nearest neighbor,linear discriminant analysis, support vector machine,… 3.Dimension reduction methods: PCA ( SVD)

  17. CC has been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to Galton(1822-1911) “Typical laws of heridity in man” Karl Pearson modifies and popularizes the use. A building block in multivariate analysis, of which clustering, classification, dim. reduct. are recurrent themes As a statistician, how can you ignore the time order ? (Isn’t it true that the use of sample correlation relies on the assumption that data are I.I.D. ???)

  18. Gene profiles and correlation • Not studying cell-cycle regulation - time order is not considered here; but ... • Synchronized cells enhance cellular signals: Start, 10 min, 20 min, 30 min, 40 min, 50 min, 60 min Task1,task2, task2, task2, task2, task3, task 4 • Suppose gene A and gene B participate in tasks 2, 3, but not in tasks 1 or 4. A positive correlation is expected. Gene B Gene A

  19. Example : SCATTERPLOT MATRIX of MCM1,MCM2, MCM3, MCM4, MCM5, MCM6, MCM7, The tighter association among the six genes, MCM2,..., MCM7 is in a sharp contrast to the association between each of them and MCM1. It turns out that the gene products of MCM2,..,MCM7form ahexamericcomplex thatbinds chromatin.It is a part of pre-replicative complex, an assembly of proteins that form at origins of DNA replication between late M phase and the G1/S transition and includes other proteins believed to act inDNA replication initiation. MCM1 is a transcription factor of the MADS(MCm1p, Agamous, Deficiens, SRF)box family, which recruits coregulatory proteins for both gene activation and repression at a variety of loci. Note that MCM2-MCM7 fall into the "MCM" cluster of 34 cycle-cell regulated genes identified by Spellman et al(1998).

  20. Negative correlation : Gene A (transcription factor) inactivates gene B Gene A and gene B are involved in conjugate tasks “Correlation is enough !” ??? Till I read the following from page 462 of Lodish et al (1995)

  21. The thyroid hormone receptordiffers functionally from glucocorticoid receptor in two important respects : it binds to its DNA response elements in theabsenceof hormone, and the bound protein represses transcription rather than activating it. Whenthyroidhormone binds to the thyroid hormone receptor, the receptoris converted from a repressor to an activator. Gene A = gene producesTHR Gene B= gene regulated byTHR THR alone represses B THR+ HM activates B

  22. Expression levels of A and B can be either positively correlated or negatively correlated, depending on thyroid hormone level. If during an experiment, hormone level fluctuated as organisms try to accomplish different tasks and if we cannot tell what tasks are, then ..... Of course, the book is not talking about yeast there. However, Pairwise similarity is not enough! THR alone represses B THR+ HM activates B

  23. Transcription factors: proteins that bind to DNAActivator; repressors

  24. Why clustering make sense biologically? The rationale is Genes with high degree ofexpression similarityare likely to befunctionally related. may form structural complex, may participate incommon pathways. may be co-regulated bycommon upstreamregulatory elements. Simply put, Profile similarity implies functional association

  25. However, the converse is not true The expression profiles of majority of functionally associated genes are indeed uncorrelated • Microarray is too noisy • Biology is complex

  26. Why no correlation? • Protein rarely works alone • Protein has multiple functions • Different biological processes or pathways have to be synchronized • Competing use of finite resources : metabolites, hormones, • Protein modification: Phosphorylation, proteolysis, shuttle, … Transcription factors serving both as activators and repressors

  27. Going subtle:Protein modification Histone inhibits transcription To activate transcription, the lysine side chain must be acetylated. Weaver(2001)

  28. Corepressor : histone deacetylase Thyroid hormone Coactivator: Histone acetyltransferase

  29. Math. Modeling : a nightmare Current Next mRNA F I T N E S S mRNA Observed mRNA protein kinase hidden ATP, GTP, cAMP, etc Cytoplasm Nucleus Mitochondria Vacuolar localization F U N C T I O N Statistical methods become useful DNA methylation, chromatin structure Nutrients- carbon, nitrogen sources Temperature Water

  30. What is LA? Concept of “mediator”

  31. Schematic illustration of LA

  32. From CC to liquid association (LA) • Two examples • A challenge

  33. Example 1. Positive-to-negative • X=ARP4,Y=LAS17, Z=MCM1 • Corr =0 in each plot • For low Z (marked points in A), X and Y are coexpressed • (B). For high Z (marked points in B), X and Y are contra-expressed Arp4 Protein that interacts with core histones, member of the NuA4 histone acetyltransferase complex; actin related protein Las17 Component of the cortical actin cytoskeleton

  34. Example 2 -Negative to Positive • X=QCR9, Y= ROX1, Z=MCM1 • Corr=0 in each plot • For low Z (marked points in A), X and Y are contra-expressed • (B). For high Z (marked points in B), X and Y are co-expressed Rox1 Heme-dependent transcriptional repressor of hypoxic genes including CYC7(iso-2-cytochrome c ) and ANB1(translation initiation, ribosome) Qcr9 Ubiquinol cytochrome c reductase subunit 9

  35. A Challenge • What genes behave like that ? • Can we identify all of them ? • N=5878 ORFs • N choose 3 = 33.8 billion triplets to inspect

  36. Liquid Association (LA) • LA is a generalized notion of association for describing certain kind of ternary relationship between variables in a system. (Li 2002 PNAS) • Green points represent four conditions for cellular state 1. • Red points represent four conditions for cellular state 2. • Blue points represent the transit state between cellular states 1 and 2. • (X,Y) forms a LA. Profiles of genes X and Y are displayed in the above scatter plot. Important! Correlation between X and Y is 0

  37. Statistical theory for LA • X, Y, Z random variables with mean 0 and variance 1 • Corr(X,Y)=E(XY)=E(E(XY|Z))=Eg(Z) • g(z) an ideal summary of association pattern between X and Y when Z =z • g’(z)=derivative of g(z) • Definition. The LA of X and Y with respect to Z is LA(X,Y|Z)= Eg’(Z)

  38. Statistical theory-LA • Theorem. If Z is standard normal, then LA(X,Y|Z)=E(XYZ) • Proof. By Stein’s Lemma : Eg’(Z)=Eg(Z)Z • =E(E(XY|Z)Z)=E(XYZ) • Additional math. properties: • bounded by third moment • =0, if jointly normal • transformation

  39. Normality ? • Convert each gene expression profile by taking normal score transformation • LA(X,Y|Z) = average of triplet product of three gene profiles: (x1y1z1 + x2y2z2 + …. ) / n

  40. Figure 3. Organization chart for incorporating LA with similarity based methods. Co-expressed genes found by profile similarity analysis can be pooled together to obtain a consensus profile for LA-scouting. Likewise, the genes identified through LA system can be further analyzed for patterns of clustering. For some applications, the scouting variable may come from external sources related to the expression profiles. SVD: singular value decomposition; PCA: principal component analysis.

  41. How does LA work in yeast? Urea cycle/arginine biosynthesis

  42. Yeast Cell Cycle(adapted from Molecular Cell Biology, Darnell et al) Most visible event

  43. Expression mastersGlycolysis/gluconeogenesis • Glycolysis : converts glucose to pruvate in 10 enzymatic reactions , generating 2 mol of ATP per mol of glucose • Gluconeogenesis: biosynthesis of glucose from noncarbohydrate precursors • A total of 35 yeast genes related to either pathway (MIPS)

  44. Longevity-an episode • A recent Science article reveals the role of an yeast gene SCH9 in aging (Fabrizio et al 2001) • A part of intensified effort to establish “common processes control life-span of most organisms, including worms, flies, and possibly mammals” (Strass, 2001) Use SCH9 as Z, find LAPs in LA-ID database

  45. ARG1, ARG2 : a LAP of SCH9 citrulline • ARG1 encodes argininosuccinate sythetaseL-citrulline + aspartate +ATP =AMP+ pyrophophate +N(omega)-(L-arginio)succinate • ARG2 encodes acetyglutamate synthase L-glutamate + N2-acetyl-L-ornithine - L-ornithine +N-acetyl-L-glutamate • Only 40 LAPs are saved in LA-ID database • 40 in 17 million • Pathway of arginine biosynthesis/urea cycle • Figure of ARG1, ARG2, SCH9 glutamate

More Related