Emerging causal inference problems in molecular systems biology
Download
1 / 44

Emerging causal inference problems in molecular systems biology - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Emerging causal inference problems in molecular systems biology. Yi Liu, Ph.D. Beijing Jiaotong University The presented work was mainly collaborated with: Prof. Jing-Dong Jackie Han, Dr. Nan Qiao, Dr. Wei Zhang @ CAS -Max Planck partner Institute for Computational Biology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Emerging causal inference problems in molecular systems biology' - chuck


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Emerging causal inference problems in molecular systems biology

Emerging causal inference problems in molecular systems biology

Yi Liu, Ph.D.

Beijing Jiaotong University

The presented work was mainly collaborated with:

Prof. Jing-Dong Jackie Han, Dr. Nan Qiao, Dr. Wei Zhang

@ CAS -Max Planck partner Institute for Computational Biology

Prof. Min Liu, Dr. Jin’e Li

@ Institute of Genetics & Developmental Biology, CAS


Outline
Outline biology

  • Background

    Mining biological knowledge from the big data generated by the Next Generation Sequencing (NGS) Technology

  • Examples of causal inference problems in biology

    1) Inferring causal relationships between transcription factors, epigenetic modifications and gene expression level from heterogeneous deep sequencing data sets

    2) Reverse-engineering the Yeast genetic regulatory network from deletion-mutant gene expression data

    3) Discovering subtypes of ovarian cancer and uncovering key molecular signatures that distinguish these subtypes.


The need for integrating heterogeneous functional genomic data sets
The need for integrating heterogeneous biologyfunctional genomic data sets

Yi Liu* and Jing-Dong J. Han*. Application of Bayesian networks on large-scale biological data. Frontiers in Biology, 2010, 5(2):98-104.


Emerging causal inference problems in molecular systems biology

SeqSpider: A new Bayesian network inference algorithm enabling integrative analysis of deep sequencing data

Y Liu, N Qiao et al., Cell Research (2013)

Thanks for Prof. Jing-Dong Han’s contribution to the slides on this topic.


Emerging causal inference problems in molecular systems biology

Limitation of traditional BN learning approaches enabling integrative analysis of deep sequencing data

In traditional BN structure learning approaches, each node must take

a discrete value.

The only exception is the Linear-Gaussian case. However, this

Parameterization is still very restrictive.


Emerging causal inference problems in molecular systems biology

Profiled signature of deep sequencing data enabling integrative analysis of deep sequencing data

H3K4me3 profile

Deep sequencing data have

distinctive profiled signatures

along the chromosomes,

especially at the gene promoter

regions.

However, there is no way to

utilize such information in the

BN learning algorithms.

mRNA profile

Liu et al, Nucleic Acids Res, 2010


Emerging causal inference problems in molecular systems biology

Profiles of hESC regulators around TSSs enabling integrative analysis of deep sequencing data

In this work, we infer causal

relationships between

transcription factors,

epigenetic modifications

and gene expression level

In human/mouse

embryonic stem cells.


Heterogeneous data types in systems biology
Heterogeneous data types in systems biology enabling integrative analysis of deep sequencing data

More severely, there could be heterogeneous data types in one

systems biological investigation.

Handling multiple data-types simultaneously in BN structure

learning is not a trivial task.


Kernel based surrogate dependency measures
Kernel-based surrogate dependency measures enabling integrative analysis of deep sequencing data

In this work, we use the Kernel Generalized Variance

(F. Bach, JMLR 2002) to quantify the joint dependence

between heterogeneous variables, which replace the

mutual information-like measures in BN learning.


Kernels for heterogeneous types of data
Kernels for heterogeneous types of data enabling integrative analysis of deep sequencing data

Using Kernel Generalized Variance (F. Bach, JMLR 2002),

to quantify the joint dependence between heterogeneous

variables, we only need to define a kernel for each type of data.

Discrete Data:

Real-valued Data:

For vectored (profiled) Data, we define:


The l1 rps kernel
The L1-RPS kernel enabling integrative analysis of deep sequencing data


The l1 rps kernel1
The L1-RPS kernel enabling integrative analysis of deep sequencing data


Motivation of the l1 rps kernel
Motivation of the L1-RPS kernel enabling integrative analysis of deep sequencing data

Bin-to-bin distances (such as Euclidean) are not ideal ones to

measure the discrepancy between two sequence tag profiles.

The Earth Mover’s distance (EMD) computes the minimum mass

transportation efforts to ‘deform’ one profile to another.

The L1-RPS distance is equivalent to EMD when the two profiles

have equal mass. In other cases, it also quantifies the total mass

difference between the two profiles while EMD not.


Data preprocessing profile clustering
Data Preprocessing: Profile clustering enabling integrative analysis of deep sequencing data

We use cluster centers of input data, instead of each gene, as the

training data to the BN learning algorithm for noise reduction.


Super k means vs k means cluster 3 0
Super k-means vs. k-means++ / Cluster 3.0 enabling integrative analysis of deep sequencing data

We propose the Super k-means

algorithm to perform clustering,

which yields tighter clusters

than the k-means algorithm (in

Cluster 3.0) and the k-means++

algorithm.

Better clustering quality is

necessary for the final good

BN learning result.


The consensus pdag network with feedbacks
The consensus PDAG network with feedbacks enabling integrative analysis of deep sequencing data

Human Embryonic

Stem Cells

We relax the acyclic constraint and perform additional structure

search after BN learning to find potential feedback edges (as

learning a dependency network), since feedbacks are important and

ubiquitous in biology.


Perfect roc in cross validation
Perfect ROC in Cross Validation enabling integrative analysis of deep sequencing data


Roc of alternative approaches
ROC of alternative approaches enabling integrative analysis of deep sequencing data


Alternative clustering approaches for preprocessing
Alternative clustering approaches for preprocessing enabling integrative analysis of deep sequencing data

Cluster 3.0

Affinity

Propagation


Alternative kernels for bn learning
Alternative Kernels for BN learning enabling integrative analysis of deep sequencing data


Cd4 t cell network
CD4+ T Cell network enabling integrative analysis of deep sequencing data


Mouse esc network
Mouse ESC network enabling integrative analysis of deep sequencing data


The proposed hub role of h3k4me3 in escs
The proposed hub role of H3K4me3 in ESCs enabling integrative analysis of deep sequencing data


Functional dissection of regulatory models using gene expression data of deletion mutants

Functional Dissection of Regulatory Models Using Gene Expression Data of Deletion Mutants

J Li, Y Liu et al., PLoS Genetics (2013)


Gene expression data of deletion mutants
Gene Expression Data of Deletion Mutants Expression Data of Deletion Mutants

In this table, each column represents a deletion mutant strain, and

each row measures the expression changes of a specific gene,

‘1’ means up-regulation, ‘-1’ means down-regulation and ‘0’ means no

specific change.


Inferring genetic regulatory networks
Inferring Genetic Regulatory Networks Expression Data of Deletion Mutants

Our goal is to infer a genetic regulatory network among the

Deletion mutant genes …

However, traditional Bayesian network learning approaches

failed…

Why?

It is because the dominant value in the deletion mutant gene

expression data set is ‘0’, which quantity is magnitudes larger

than the ‘1’ and ‘-1’ values.

Using traditional BN-learning metrics, such as K2, BDeu,

BIC/MDL, the huge intra-similarities between ‘0’s will overwhelm

true regulatory signals….


The dm bn kernel
The DM_BN Kernel Expression Data of Deletion Mutants

To overcome this problem, we resort to kernel-based BN

learning.

To this end, we propose the DM_BN kernel.

The key insight is to block the intra-similarities between ‘0’s:


Incorporating a priori causal information
Incorporating Expression Data of Deletion Mutantsa priori causal information

We also use a template matrix to incorporate the a priori

knowledge from deletion-mutant experiments into BN learning.

If Gene B is in the (influence) target list of Gene A, but not the

reverse case , we set (i, j) = 1, (j, i) = 0 in the template matrix to

prohibit the appearance of B->A in the BN.

In this way, the template matrix constraints the set of plausible

edges in a DAG.

Finally, to convert a DAG to a PDAG after BN learning, we must

Resort to Meek’s rules [Meek, 1995] to judge the reversibility of

Each edge, but not Chickering’s algorithm [Chickering, 1995].


High quality of the networks inferred by dm bn
High quality of the networks inferred by DM_BN Expression Data of Deletion Mutants


Correctness of edge directions with without using templates
Correctness of edge directions Expression Data of Deletion Mutantswith/without using templates

Without using the template matrix, DM_BN kernel leads to

~80% accuracy in the de novo inference of edge directionalities,

which is statistically significant compared to random guessing.


The inferred yeast regulatory network
The inferred Yeast regulatory network Expression Data of Deletion Mutants

Online acyclicity

checking is

implemented to

enable learning

large networks.


Emerging causal inference problems in molecular systems biology

Integrating Genomic, Epigenomic, and Transcriptomic Features Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer

W Zhang, Y Liu et al., Cell Reports (2013)

Thanks for Dr. Wei Zhang’s contribution to the slides on this topic.


T he c ancer g enome a tlas tcga
T Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancerhe Cancer Genome Atlas (TCGA)

http://cancergenome.nih.gov/


Summary of the ovarian cancer data in tcga
Summary of the Ovarian cancer data in TCGA Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer


Summary of the ovarian cancer data in tcga1
Summary of the Ovarian cancer data in TCGA Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer

The copy number segmentation data were mapped to the positions of genes and miRNAs.

Normalization:

Valuenorm = (Valueraw – Mediancontrols) / STDpatients


Scientific questions
Scientific Questions Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer

By combining the clinical and heterogeneous high-

throughput data, can we discover Ovarian cancer

subtypes whose outcomes are different?

Whether we can find active regulatory pathways

of the subtypes which could explain their different

prognosis?


Selecting the ovarian cancer hazard factors
Selecting the Ovarian Cancer Hazard Factors Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer

To investigate which features are related to the

prognosis of ovarian cancer, we first used Cox

proportional hazard model to perform the

regression analysis between each feature and

the patients’ survival time.

In total we selected 4,526 features as hazard factors

(P < 0.05), including 1,651 genes’ expression

changes, 455 genes’ promoter DNA methylation

changes, 140 miRNAs’ expression changes, and the

CNAs of 2,191 genes and 89 miRNAs.


Emerging causal inference problems in molecular systems biology

De novo discovery of ovarian cancer Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer

subtypes by adaptive clustering


Signatures of the 7 subtypes of ovarian cancer
Signatures of the 7 subtypes of Ovarian Cancer Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer

These signatures were identified using Wilcoxon rank-sum test.


Enriched terms of subtype 2 specific up regulated genes
Enriched terms of subtype 2-specific Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancerup-regulated genes

These terms, such as cell adhesion, TGF-beta binding,

angiogenesis and positive regulation of cell proliferation,

are related to tumorigenesis and metastasis.


Comparing the survival curves between subtype 2 and stage iv patients
Comparing the survival curves between Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer subtype 2 and stage IV patients

The 5-year survival rate of subtype 2 was even worse

than that of tumor stage IV.


The cancer knowledge base
The cancer knowledge base Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer

The hallmarks of cancer

Hanahan & Weinberg 2011

Used to filter out signature genes that are not drivers of cancer.


The interaction network of signature genes
The interaction network of signature genes Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer


Thanks
THANKS Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer

  • Q & A?