ChIP-seq and its applications in GRN construction

ChIP-seq and its applications in GRN construction Jin Chen 2012 Fall CSE891-001

Layout • Genome-scale evidence from microarray measurements may be used to identify regulatory interactions between TFs and targets • Hu et al used a genetic approach to identify targets of transcription factors in Yeast and reconstruct a functional regulatory network • Reimand et al re-analyzed Hu’s data using improved statistical techniques

Hu et al’s work • Grew each of 263 transcription factor knockout strains and compared mRNA expression of each of these strains with a wildtype strain using microarrays • Defined unrefined transcription factor target network as the cumulative set of significantly differentially expressed genes in each deletion strain. • There was overlap between transcription factor targets identified in the unrefined network and targets identified by ChIP-chip

2-level Refinement • First level of network refinement • If TF A activated TF B and gene M, B activated gene M, and if the confidence of A regulating gene M was lower than for B regulating gene M, then the regulation of gene M by A was presumed to be indirect and was therefore erased • Additional refinement step • Similar to previous step, except that the indirect edge that was removed bridged a three-step direct interaction series at the preceding level, resulting in a level 3 refined network • Note that the logical consistency for regulatory edges was maintained at all times

Hu et al’s work • When the transcription factor bound to a promoter was deleted, the expression of the downstream gene was much more likely to be affected than the background • Expression from promoters that were detectably occupied by a single TF were even more likely to be affected by deletion of that potentially major or sole TF • Thus, there was significant overlap between binding targets defined by ChIP-chip and functional targets defined by TF deletion

Hu et al’s work – problems • However, Hu et al ‘s study used relatively dated and insensitive approaches for microarray data processing • As a result the published P-values and target-gene ranking are likely to be unreliable • P-values were not corrected for multiple-testing • Lack of background and print-tip correction during normalization • Reimand et al re-analyzed the same dataset with the state-of-art software and obtained a much larger network Reimand et al, Nucleic Acids Research, 2010, Vol. 38, No. 14 pp 4768–4777

False Discovery Rate • False discovery rateis a statistical method used in multiple hypothesis testing to correct for multiple comparisons. q-valueis defined to be the FDR analogue of p-value • FDR is the expected proportion of false positives among all significant hypotheses • For example, if 1000 observations were experimentally predicted to be different, and FDR for these observations was 0.1, then 100 observations would be expected to be false • FDR is determined from the observed p-value distribution, and hence is adaptive to the amount of records

Redo the Preprocessing • Microarrays were normalized using the VSN package, including print-tip and background correction • Differential expression was calculated using a moderated eBayes t-test as implemented in the LimmaBioconductor package • FDR cut-off of 0.05 was used to detect significant differential gene expression Reimand et al, Nucleic Acids Research, 2010, Vol. 38, No. 14 pp 4768–4777

Re-analyze TF binding data • DNA–protein interactions derived from ChIP-chip experiments were obtained and with a P_value<0.001 were considered • A set of ‘trusted’ position weight matrices (PWMs) for 72 regulatory factors were derivedby running the PROCSE and PhyloGibbs algorithms on a set of experimentally derived TF binding sites from SCPD • These PWMs were then used to scan multiple alignments of each intergenic region in Yeast with the orthologous regions of another four Yeast species Reimand et al, Nucleic Acids Research, 2010, Vol. 38, No. 14 pp 4768–4777

Re-analyze knockout expression and ChIP binding data • Overlap between TF-binding and TF knockout data • Collect binding sites for 142 TFs, comprising 5,188 ChIP-chip interactions and 17,091 motif predictions • Calculate the intersection between the list of differentially expressed genes from the TF knockout and targets identified by ChIP-chip or binding-site predictions • 2,230 regulation relations Reimand et al, Nucleic Acids Research, 2010, Vol. 38, No. 14 pp 4768–4777

Re-analyze knockout expression and ChIP binding data • Checked the expression levels of the TFs • Intuitively one expects the TF under consideration to have lower expression in the mutant strain compared with the wild type strain • confirms this for 155 TFs • 78 TFs display a negative fold change at statistically non-significant levels • 36 TFs are lethal Reimand et al, Nucleic Acids Research, 2010, Vol. 38, No. 14 pp 4768–4777

Reimand et al, Nucleic Acids Research, 2010, Vol. 38, No. 14 pp 4768–4777

Re-analyze knockout expression and ChIP binding data • Examine functional annotations of differentially expressed genes • As most TFs are considered to regulate distinct cellular processes, their target genes should be associated with a coherent set of molecular and biological functions • Used g:Profiler to identify GO, KEGG and Reactome pathway annotations • Across all TF knockouts, this analysis has a higher score than the original analysis Reimand et al, Nucleic Acids Research, 2010, Vol. 38, No. 14 pp 4768–4777

Reimand et al, Nucleic Acids Research, 2010, Vol. 38, No. 14 pp 4768–4777

SUMMARY - exploring biological networks

Topology Approaches • What’s the next after constructing biological networks? • First of all, simple approaches • Degree, betweenness, clustering coefficient, topological coefficient, shortest path • Shared neighbors, neighborhood connectivity, closeness centrality

Clustering Coefficient • Clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together • Clustering coefficient (local version): does my neighbors connect with each other? • Evidence suggests that in most real-world networks, nodes tend to create tightly knit groups characterized by a relatively high density of ties whereki is the number of neighbors of node i and ei is the number of connected pairs between all neighbors of node i Luciano da F. Costa, Francisco A. Rodrigues, Alexandre S. Cristino. Complex networks: the key to systems biology. Genet. Mol. Biol. vol.31 no.3. 2008; http://med.bioinf.mpi-inf.mpg.de/netanalyzer

Average Clustering Coefficient Distribution Define function C(k) as the average clustering coefficient of all nodes with k links For many real networks C(k) ~ k–1 Nodes with only a few links have a high C(k) and belong to highly interconnected small modules By contrast, the highly connected hubs have a low C(k), with their role being to link different, and otherwise not communicating, modules

closeness centrality • Closeness centrality is ameasure of how many steps is required to access every other node from a given node • Closeness centrality:How long it will take information to spread from a given node to other reachable nodes in the network? wheredG(i, t) is the length of the shortest path from i to t, and V is the set of nodes in G Freeman, 1978; Opsahl et al., 2010; Wasserman and Faust, 1994

Distribution of closeness centrality Closeness centrality are successful in distinguishing the important members of the community Its distribution resembles a normal curve, while the other centrality measures have a long tail distribution similar to a power law

Limitations of simple approaches • Study each node/edge individually; cannot apply enrichment study • Topology study only; difficult to integrate other knowledge • Nodes with high scores <> key genes/proteins Study a group of genes simultaneously

Advanced approaches • Dense subgraph detection • Network motif detection • Graph clustering • Graph classification • etc.

Dense subgraph detection Software available at http://zhoulab.usc.edu/CODENSE/

Dense subgraph detection • A subgraph is considered coherent and dense if and only if every edge is well supported, and its corresponding second-order graph is dense CODENSE

Network Motif Detection

Perform graph join operation to find repeated size-k graphs Join each tree with it’s cousins to produce frequent motif candidates Ck. & h2 h1 t4_1 & & h4 h5 h3 t4_2

Graph Clustering • Graph clustering is an organization process with the goal to put similar nodes together; the result is a partition of the network into a set of communities • MCL algorithm is a fast and scalable unsupervised cluster algorithm for graphs based on simulation of stochastic flow in graphs, available at http://www.micans.org/mcl Van Dongen, S. (2000) Graph Clustering by Flow Simulation. PhD Thesis, University of Utrecht, The Netherlands

Graph Clustering Graph Graph Clusters

ChIP-seq and its applications in GRN construction

ChIP-seq and its applications in GRN construction

Presentation Transcript

ChIP-chip and ChIP-seq

mRNA - Seq : methods and applications

From Big Data to Relevant data: Ribo-seq and its applications

Analysis of ChIP-Seq Data

ChIP-seq and related applications

ChIP seq

ChIP-seq Data Analysis

Chip – Seq Peak Calling in Galaxy

ChIP-seq

ChIP-seq analysis

QUALITY MANAGEMENT AND ITS APPLICATIONS TO CONSTRUCTION INDUSTRY

Detecting enriched regions (Chip- seq , RIP- seq ) Statistical evaluation of enriched regions

ChIP-Seq

More on TF Motif Finding ChIP-chip / seq

ChIP-seq

ChIP-Seq: TB Example

Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)

ChIP-seq

ChIP-seq