- 86 Views
- Uploaded on
- Presentation posted in: General

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

Fanghong Zhang*, Stijn Vansteelandt*,

Olivier Thas*, Marnik Vuylsteke#

* Ghent University # VIB (Flanders Institute for Biotechnology)

- Genetic background
- Objectives
- Data
- Methodology
- Results
- Conclusions

- Regulation of gene expression is affected either in:
- Cis:affecting the expression of only one of the two alleles in a
- heterozygous individual;
- - Trans : affecting the expression of both alleles in a heterozygous individual;

Genetic background

- Why search for Cis-regulatory variants?
- “low hanging fruit”: window is a small genomic region
- Fast screening for markers in LD with expression trait.
- How to search for Cis-regulatory variants?
- Using GASED (Genome-wide Allelic Specific Expression Difference) approach (Kiekens et al, 2006)
- Based on a diallel design which is very popular in plant breeding system to estimate GCA (generation combination ability) and SCA (specific combination ability)

Genetic Background

- What is GASED approach?
- The expression of a gene in a F1 hybrid coming from the kth offspring of the cross can be written as: (c—cis-element, t-trans-element)

kth offspring of cross i j

From parent j

From parent i

From both (cross-terms)

Genotypic variation

In case homozygous

In case there is no trans-effect

In case there is cis-effect

A cis-regulatory divergence completely explains the difference between two parental lines

Objectives of this study

- Using mixed model analysis to discover Cis-regulated Arabidopsis genes
- Based on GASED approach, to partition between F1 hybrid genotypic variation for mRNA abundance into additive and non-additive variance components to differentiate between cis- and trans-regulatory changes and to assign allele specific expression differences to cis-regulatory variation.

- To find its associated haplotypes (a set of SNPs) for these selected cis-regulated genes.
- Systematic surveys of cis-regulatory variation to identify “superior alleles”.

Flow chart

Data contains all expressed genes (25527 genes)

Choose genes with significant genotypic variation:

Step I:

Choose genes from Step 1 with no trans-regulatory variation:

Step II:

Choose genes from step 2 displaying significant allelic imbalance to cis-regulatory variation:

Step III:

Step IV:

Choose genes from Step 3 showing significant association with founded haplotype blocks:

Data

- Data acquisition:
- Scan the arrays
- Quantitate each spot
- Subtract noise from background
- Normalize
- Export table

Data for us to analyze

Methodology - Step I

Mixed-Model Equations

yklnm = μ + dyek + replicatel + genotypen + arraym + errorklnm

Full model:

Gene X:

expression

values

Residual

RANDOM effect

FIXED effects

Reduced model: yklnm = μ + dyek + replicatel + arraym + errorklnm

- error ~ N(0,Σe) , Σe =I2202e ; array ~ N(0, Σa) , Σa =I1102a
- genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g;

Methodology - Step I

Mixed-Model Equations

K = 55 x 55 marker-based relatedness matrix:

pij and qijare allele frequencies of the jth allele at the ith locus

niis the number of alleles at the ith locus (i.e. ni= 2)

m refers to the number of loci (i.e. m = 210,205)

Rogers (1972); Reif et al. (2005)

Melchinger et al. (1991)

Methodology - Step I

Multiple testing correction

Likelihood ratio test (REML)

LRT ~ 0.52(0) + 0.52(1)) p-value

Gene X:

25527 Genes

Adjusted q-value (FDR)

FDR: false discovery rate

How many of the called positives are false?

5% FDR means 5% of calls are false positive

John Storey et al. (2002) : q-value to represent FDR

Estimate the proportion of features that are truly null:

We use adjusted q-value to represent FDR

Methodology - Step I

Multiple testing correction

Storey et al estimate π0 = m0 /m under assumption that true null p-values is uniformly distributed (0,1)

We estimate π0 –adj = m0 /m under assumption that true null p-values is 50% uniformly distributed (0,0.5) , 50% is just 0.5.

Methodology - Step II

Mixed-Model Equations

y klijm= μ + dyek + replicatel + gcai + gcaj + scaij + arraym + error klijm

Full model:

Gene X:

expression

values

Residual

RANDOM effect

FIXED effects

L is the Cholesky decomposition

Reduced model: y klijm= μ + dyek + replicatel + gcai + gcaj + arraym + error klijm

Methodology - Step II

Multiple testing correction

Likelihood ratio test (REML)

LRT ~ 0.52(0) + 0.52(1) p-value

Gene X:

qa-value (FNR)

20976 Genes

- FNR: false non-discovery rate (Genovese et al , 2002)
- How many of the called negatives are false?
- 5% FNR means 5% of calls are false negative
- Since we are interested in selecting genes with negativescaij effect, we control FNR instead of FDR

We use qa-value to represent FNR

Methodology - Step II

Multiple testing correction

False non-discovery rate (FNR) :

π0 is the estimate of the proportion of features that are truly null

Methodology - Step III

Mixed-Model Equations

yklijm = μ + dyek + replicatel + gcai + gcaj + arraym + errorkijlm

model:

Test 45 pairs ?

Gene X:

g1 =g2? g1 =g3? g1 =g4? … g1= g10? g2 =g3? g2= g4? g2=g5? … g2 =g10? ……, …… g9 = g10?

Two sample dependent t-test

Non-standard P-value

Distribution of true null p-values is not uniformly distributed from 0 to 1

Methodology - Step III

Multiple testing correction

two sample t-test testing BLUPs

Gene X:

Simulate H0 distribution from real data:

simulation-basedp-value

q-value (FDR)

1380 Genes

Methodology - Step IV

Mixed-Model Equations

Full model:

yklim = μ + dyek + replicatel + + genotypei + arraym + errorkijlm

Gene X:

(cis-regulated)

FIXED effects

RANDOM effect

Residual

Gene

chromosome

SNP1 SNP2 SNP3 ………SNPi (tag SNPs)

- genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g;

Reduced model: yklim = μ + dyek + replicate+ genotypei + arraym + errorkilm

Methodology - Step IV

Multiple testing correction

Gene X:

(cis-regulated)

Likelihood ratio test (ML)

p-value

LRT ~ 2(2n)

n is the number of SNPs

q-value (FDR)

836 Genes

Results

Data contains all expressed genes (25527 genes)

Step I:

Adjusted_q value<0.0005

20979 genes

Step II:

Adjusted_qa value<0.01

1328 genes

Step III:

q value<0.01

972 genes

q value<0.01

Step IV:

859 genes

Results

- Among all 25527 genes, 20979 genes have significant genotypic variation (qvalue < 0.0005). (–Step I)
- Among these 20979 genes, 1328 genes have no-trans regulated effect (qavalue < 0.01). (–Step II)
- Among these 1328 genes, 972 genes have showed significant different allelic expressions (qvlaue < 0.01); these 972 genes are discovered as cis-regulated. (–Step III)
- We confirm our discovery from these 972 cis-regulated genes in step IV:
- an allelic expression difference caused by cis-regulatory variant implies a nearby polymorphism (SNP) that controls expression in LD;
- We indeed found 96.5% selected cis-regulated genes have associated polymorphisms (haplotype blocks ) nearby.

Conclusions

- This mixed-model approach used here for association mapping analysis with Kinship matrix included are more appropriate than other recent methods in identifying cis-regulated genes ( p-values more reliable).
- Each step’s statistical method is controlled in a more accurate way to specify statistical significance (referring to FDR, FNR).
- Using simulation-based pvalues when testing difference between random effects increases power of detecting association.
- A comprehensive analysis of gene expression variation in plant populations has been described.
- Using this mixed-model analysis strategy, a detailed characterization of both the genetic and the positional effects in the genome is provided.
- This detailed statistical analysis provides a robust and useful framework for the future analysis of gene expression variation in large sample sizes.
- Advanced statistical methods look promising in identifying interesting discoveries in genetics.

Many thanks

for your attention !