Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*, Olivier Thas*, Marnik Vuylsteke# * Ghent University # VIB (Flanders Institute for Biotechnology)

Overview • Genetic background • Objectives • Data • Methodology • Results • Conclusions

Genetic background • Regulation of gene expression is affected either in: • Cis:affecting the expression of only one of the two alleles in a • heterozygous individual; • - Trans : affecting the expression of both alleles in a heterozygous individual;

Genetic background • Why search for Cis-regulatory variants? • “low hanging fruit”: window is a small genomic region • Fast screening for markers in LD with expression trait. • How to search for Cis-regulatory variants? • Using GASED (Genome-wide Allelic Specific Expression Difference) approach (Kiekens et al, 2006) - Based on a diallel design which is very popular in plant breeding system to estimate GCA (generation combination ability) and SCA (specific combination ability)

Genetic Background • What is GASED approach? • The expression of a gene in a F1 hybrid coming from the kth offspring of the cross can be written as: (c—cis-element, t-trans-element) kth offspring of cross i j From parent j From parent i From both (cross-terms) Genotypic variation In case homozygous In case there is no trans-effect In case there is cis-effect A cis-regulatory divergence completely explains the difference between two parental lines

Objectives of this study • Using mixed model analysis to discover Cis-regulated Arabidopsis genes • Based on GASED approach, to partition between F1 hybrid genotypic variation for mRNA abundance into additive and non-additive variance components to differentiate between cis- and trans-regulatory changes and to assign allele specific expression differences to cis-regulatory variation. • To find its associated haplotypes (a set of SNPs) for these selected cis-regulated genes. • Systematic surveys of cis-regulatory variation to identify “superior alleles”.

Flow chart Data contains all expressed genes (25527 genes) Choose genes with significant genotypic variation: Step I: Choose genes from Step 1 with no trans-regulatory variation: Step II: Choose genes from step 2 displaying significant allelic imbalance to cis-regulatory variation: Step III: Step IV: Choose genes from Step 3 showing significant association with founded haplotype blocks:

Data • Data acquisition: • Scan the arrays • Quantitate each spot • Subtract noise from background • Normalize • Export table Data for us to analyze

Methodology - Step I Mixed-Model Equations yklnm = μ + dyek + replicatel + genotypen + arraym + errorklnm Full model: Gene X: expression values Residual RANDOM effect FIXED effects Reduced model: yklnm = μ + dyek + replicatel + arraym + errorklnm • error ~ N(0,Σe) , Σe =I2202e ; array ~ N(0, Σa) , Σa =I1102a • genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g; • K = 55 x 55 marker-based relatedness matrix: • Calculated as 1 – dR;dR = Rogers’ distance • (Rogers ,1972; Reif et al. 2005)

Methodology - Step I Mixed-Model Equations K = 55 x 55 marker-based relatedness matrix: pij and qijare allele frequencies of the jth allele at the ith locus niis the number of alleles at the ith locus (i.e. ni= 2) m refers to the number of loci (i.e. m = 210,205) Rogers (1972); Reif et al. (2005) Melchinger et al. (1991)

Methodology - Step I Multiple testing correction Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1)) p-value Gene X: 25527 Genes Adjusted q-value (FDR) FDR: false discovery rate How many of the called positives are false? 5% FDR means 5% of calls are false positive John Storey et al. (2002) : q-value to represent FDR Estimate the proportion of features that are truly null: We use adjusted q-value to represent FDR

Methodology - Step I Multiple testing correction Storey et al estimate π0 = m0 /m under assumption that true null p-values is uniformly distributed (0,1) We estimate π0 –adj = m0 /m under assumption that true null p-values is 50% uniformly distributed (0,0.5) , 50% is just 0.5.

Methodology - Step II Mixed-Model Equations y klijm= μ + dyek + replicatel + gcai + gcaj + scaij + arraym + error klijm Full model: Gene X: expression values Residual RANDOM effect FIXED effects L is the Cholesky decomposition Reduced model: y klijm= μ + dyek + replicatel + gcai + gcaj + arraym + error klijm

Methodology - Step II Multiple testing correction Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1) p-value Gene X: qa-value (FNR) 20976 Genes • FNR: false non-discovery rate (Genovese et al , 2002) • How many of the called negatives are false? • 5% FNR means 5% of calls are false negative • Since we are interested in selecting genes with negativescaij effect, we control FNR instead of FDR We use qa-value to represent FNR

Methodology - Step II Multiple testing correction False non-discovery rate (FNR) : π0 is the estimate of the proportion of features that are truly null

Methodology - Step III Mixed-Model Equations yklijm = μ + dyek + replicatel + gcai + gcaj + arraym + errorkijlm model: Test 45 pairs ? Gene X: g1 =g2? g1 =g3? g1 =g4? … g1= g10? g2 =g3? g2= g4? g2=g5? … g2 =g10? ……, …… g9 = g10? Two sample dependent t-test Non-standard P-value Distribution of true null p-values is not uniformly distributed from 0 to 1

Methodology - Step III Multiple testing correction two sample t-test testing BLUPs Gene X: Simulate H0 distribution from real data: simulation-basedp-value q-value (FDR) 1380 Genes

Methodology - Step IV Mixed-Model Equations Full model: yklim = μ + dyek + replicatel + + genotypei + arraym + errorkijlm Gene X: (cis-regulated) FIXED effects RANDOM effect Residual Gene chromosome SNP1 SNP2 SNP3 ………SNPi (tag SNPs) • genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g; • K = 55 x 55 marker-based relatedness matrix. • array ~ N(0,Σa) , Σ a=I1102a; error ~ N(0,Σe) , Σ e=I2202e Reduced model: yklim = μ + dyek + replicate+ genotypei + arraym + errorkilm

Methodology - Step IV Multiple testing correction Gene X: (cis-regulated) Likelihood ratio test (ML) p-value LRT ~ 2(2n) n is the number of SNPs q-value (FDR) 836 Genes

Results Data contains all expressed genes (25527 genes) Step I: Adjusted_q value<0.0005 20979 genes Step II: Adjusted_qa value<0.01 1328 genes Step III: q value<0.01 972 genes q value<0.01 Step IV: 859 genes

Results • Among all 25527 genes, 20979 genes have significant genotypic variation (qvalue < 0.0005). (–Step I) • Among these 20979 genes, 1328 genes have no-trans regulated effect (qavalue < 0.01). (–Step II) • Among these 1328 genes, 972 genes have showed significant different allelic expressions (qvlaue < 0.01); these 972 genes are discovered as cis-regulated. (–Step III) • We confirm our discovery from these 972 cis-regulated genes in step IV: • an allelic expression difference caused by cis-regulatory variant implies a nearby polymorphism (SNP) that controls expression in LD; • We indeed found 96.5% selected cis-regulated genes have associated polymorphisms (haplotype blocks ) nearby.

Conclusions • This mixed-model approach used here for association mapping analysis with Kinship matrix included are more appropriate than other recent methods in identifying cis-regulated genes ( p-values more reliable). • Each step’s statistical method is controlled in a more accurate way to specify statistical significance (referring to FDR, FNR). • Using simulation-based pvalues when testing difference between random effects increases power of detecting association. • A comprehensive analysis of gene expression variation in plant populations has been described. • Using this mixed-model analysis strategy, a detailed characterization of both the genetic and the positional effects in the genome is provided. • This detailed statistical analysis provides a robust and useful framework for the future analysis of gene expression variation in large sample sizes. • Advanced statistical methods look promising in identifying interesting discoveries in genetics.

Many thanks for your attention !

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

Presentation Transcript

Mixed-Methods and Mixed-Model Designs

The cis regulatory elements

cis -regulatory element study in transcriptome

A Mixed Model for Cross Lingual Opinion Analysis

Abscission in Arabidopsis Thaliana

A Quantitative Analysis of Megagametogenesis in Two Species of Arabidopsis thaliana.

Regulatory sequence analysis based on a probabilistic model of evolution

The Regulatory Model

Arabidopsis thaliana

Generalized Linear Mixed Model

Mixed Cost Analysis

Global dissection of cis and trans regulatory variations in Arabidopsis thaliana Xu Zhang

The CIS Model

Arabidopsis thaliana

Mixed Model (LME)

“In the course of a proteomic analysis designed to discover

APRN Regulatory Model

In silico cis -analysis

Developing a Mixed Effects Model Using SAS PROC MIXED

APRN Regulatory Model

A mixed model to promote rational use of medicines

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana