DATA ANALYSIS

DATA ANALYSIS Module Code: CA660 Lecture Block 7

Examples in Genomics and Trait Models • Genetic traits may be controlled by No.genes-usually unknown Taking “genetic effect” as one genotypic term, a simple model for where yij is the trait value for genotype i in replication j,  is themean, Gi the genetic effect for genotype i and ij the errors. • If assume Normality (and want Random effects) + assume zero covariance between genetic effects and error Note: If same genotype replicated b times in an experiment, with phenotypic means used, error variance averaged over b.

Example - Trait Models contd. • What about Environment and GE interactions? Extension to Simple Model. ANOVA Table: Randomized Blocks within environment and within sets/blocks in environment = b = replications. Focus - on genotype effect Source dof Expected MSQ Environment e-1 know there are differences Blocks(b-1)e again – know there are differences Genotypes g-1 GE (g-1)(e-1) Error (b-1)(g-1)e Note: individuals blocked within each of multiple environments, so environmental effect intrinsic to error. Model form is standard, but only meaningful comparisons are within environment, hence form of random error = population variance = ; so random effects of interest from additional variances & ratios Genotypic effects measured within blocks

Example contd. • HERITABILITY = Ratio genotypic to phenotypic variance • Depending on relationship among genotypes, interpretation of genotypic variance differs. May contain additive, dominance, other interactions, variances (Above = heritability in broad terms). • For some experimental or mating schemes, an additive genetic variance may be calculated. Narrow/specific sense heritabilitythen • Again, if phenotypic means used, obtain a mean-based heritability for b replications.

Extended Example- Two related traits Have where 1 and 2 denote traits, i the gene and j an individual in population. Then ‘y’ is the trait value,  overall mean, G genetic effect,  = random error. To quantify relationship between the two traits, the variance- covariance matrices for phenotypic,  pgenetic  gand environmental effects e So correlations between traits in terms of phenotypic, genetic and environmental effects:

MAXIMUM LIKELIHOOD ESTIMATION • Recallgeneral points: Estimation, definition of Likelihood function for a vector of parameters  and set of values x. Find most likely value of = maximise the Likelihood fn. Also defined Log-likelihood (Support fn. S()) and its derivative, the Score, together with Information content per observation, which for single parameter likelihood is given by • Why MLE? (Need to know underlying distribution). Properties: Consistency; sufficiency; asymptotic efficiency (linked to variance); unique maximum; invariance and, hence most convenient parameterisation; usually MVUE; amenable to conventional optimisation methods.

VARIANCE, BIAS & CONFIDENCE • Variance of an Estimator - usual form or for k independent estimates • For a large sample, variance of MLE can be approximated by can also estimate empirically, using re-sampling* techniques. • Variance of a linear function (of several estimates) – (common need in genomics analysis), e.g. heritability. • Recall Biasof the Estimator then the Mean Square Error is defined to be: expands to so we have the basis for C.I. and tests of hypothesis.

COMMONLY-USED METHODS of obtaining MLE • Analytical - solvingor when simple solutions exist • Grid search or likelihood profile approach • Newton-Raphson iteration methods • EM (expectation and maximisation) algorithm N.B. Log.-likelihood, because max. same  value as Likelihood Easier to compute Close relationship between statistical properties of MLE and Log-likelihood

METHODS in brief Analytical : - recall Binomial example earlier • Example : For Normal, MLE’s of mean and variance, (taking derivatives w.r.t mean and variance separately), and equivalent to sample mean and actual variance (i.e. /N), -unbiased if mean known, biased if not. • Invariance : One-to-one relationships preserved • Used: whenMLE has a simple solution

Methods for MLE’s contd. Grid Search – Computational Plot likelihood or log-likelihood vs parameter. Various features • Relative Likelihood=Likelihood/Max. Likelihood (ML set =1). Peak of R.L. can be visually identified /sought algorithmically. e.g. Plot likelihood and parameter space range - gives 2 peaks, symmetrical around  likelihood profile for the well-known mixed linkage phase problem in linkage analysis. If e.g. constrain MLE = R.F. between genes (possible mixed linkage phase).

contd. • Graphic/numerical Implementation - initial estimate of , direction ofsearch determined by evaluating likelihood at both sides of . Search takes direction giving increase. Initial search increments large, e.g. 0.1, then when likelihood change starts to decrease or become negative, stop and refine increment. • Multiple peaks– can miss global maximum, computationally intensive • Multiple Parameters - grid search. Interpretation of Likelihood profiles can be difficult.

Example • Recall Exs 2, Q. 8. Data used to show a linkage relationship between marker and a “rust-resistant”gene. Escapes = individuals who are susceptible, but show no disease (rust) phenotype under experimental conditions. So define as proportion escapes and R.F. respectively. is penetrance for disease trait, i.e. P{ that individual with susceptible genotype has disease phenotype}. Purpose of expt.-typically to estimate R.F. between marker and gene. • Use: Support function = Log-Likelihood

Example contd. • Setting 1st derivatives (Scores) w.r.t = 0. Expected value of Score (w.r.t.  is zero, (see analogies in classical sampling/hypothesis testing). Similarly for . Here, however, No simple analytical solution, so can not solve directly for either. • Using grid search, likelihood reaches maximum at • In general, this type of experiment tests H0: Independence between marker and gene and H0: no escapes Uses LikelihoodRatioTest statistics. (MLE 2 equivalent) • N.B: Moment estimates solve slightly different problem, because no info. on expected frequencies, - (not same as MLE)

MLE Estimation Methods contd. Newton-Raphson Iteration Have Score () = 0 from previously.N-R consists of replacing Score by linear terms of its Taylor expansion, so if ´´ a solution,  ´=1st guess Repeat with  ´´replacing´ Each iteration - fits a parabolato Likelihood Fn. • Problems - Multiple peaks, zero Information, extreme estimates • Multiple parameters – need matrix notation, where S matrix e.g. has elements = derivatives of S(, ) w.r.t.  and  respectively. Similarly, Information matrix has terms of form  Estimates are L.F. 2nd 1st  Variance of Log-L i.e.S()

Methods contd. Expectation-Maximisation Algorithm - Iterative. Incompletedata (Much genomic data fits this situation e.g.linkage analysis with marker genotypes of F2 progeny. Usually 9 categories observed for 2-locus, 2-allele model, but 16 = complete info., while 14 give info. on linkage. Some hidden, but if linkage parameter known, expected frequencies can be predicted – as you know - and the complete data restored using expectation). • Steps: (1)Expectationestimates statistics of complete data, given observed incomplete data. • -(2) Maximisation uses estimated complete data to give MLE. • Iterate till converges (no further change)

E-M contd. Implementation • Initial guess, ´, chosen (e.g. =0.25 say = R.F.). • Taking this as “true”, complete data is estimated, by distributional statements e.g. P(individual is recombinant, given observed genotype) for R.F. estimation. • MLE estimate ´´ computed. • This, for R.F.  sum of recombinants/N. • Thus MLE, for fi observed count, Convergence ´´ = ´ or

LIKELIHOOD : C.I. and H.T. • Likelihood Ratio Test – c.f. with 2. • Principal Advantage of Gis Power, as unknown parameters involved in hypothesis test. Have : Likelihood of taking a value Awhich maximises it, i.e. its MLE and likelihood under H0 : N , (e.g.N = 0.5) • Form of L.R. Test Statistic or, conventionally - choose; easier to interpret. • Distribution of G~ approx. 2 (d.o.f. = difference in dimension of parameter spaces for L(A), L(N) ) • Goodness of Fit:notation as for 2 , G ~ 2n-1 : • Independence:notation again as for2

Power-Example extended • Under H0 : • At level of significance =0.05, suppose true  = 1 = 0.2, so if n=25 (e.g. in genomics might apply where R.F. =0.2 between two genes (as opposed to 0.5). Natural logs. used, though either possible in practice. Hence, generic form “Log” rather than Ln here. Assume Ln throughout for genetic/genomic examples unless otherwise indicated) • Rejection region at 0.05 level is • If sketch curves, P{LRTS falls in the acceptance region} = 0.13, = Prob.of a false negative when actual value of= 0.2 • If sample size increased, e.g. n=50, E{G} = 19 and easy to show that P{False negative} = 0.01 • Generally: Power for these tests given by

Likelihood C. I.’s - method • Example: Consider the following Likelihood function  is the unknown parameter ; a, b observed counts • For 4 data sets observed, A: (a,b) = (8,2), B: (a,b)=(16,4) C: (a,b)=(80, 20) D: (a,b) = (400, 100) • Likelihood estimates can be plotted vs possible parameter values, with MLE = peak value. e.g. MLE = 0.2, Lmax=0.0067 for A, and Lmax=0.0045 for B etc. Set A: Log Lmax- Log L=Log(0.0067) - Log(0.00091)= 2gives  95% C.I. so  =(0.035,0.496) corresponding to L=0.00091,  95% C.I. for A. Similarly, manipulating this expression, Likelihood value corresponding to  95% confidence interval given as L = 7.389Lmax Note: Usually plot Log-likelihood vs parameter, rather than Likelihood. As sample size increases, C.I. narrower and  symmetric

Multiple Populations: Extensions to G -Example • Recall Mendel’s data - earlier and Extensions to 2 for same In brief Round Wrinkled Plant O E O E G dof p-value 1 45 42.75 12 14.25 0.49 1 0.49 2 0.09 1 0.77 3 0.10 1 0.75 4 1.301 0.26 5 0.01 1 0.93 6 0.71 1 0.40 7 0.79 1 0.38 8 0.63 1 0.43 9 1.06 1 0.30 10 0.17 1 0.68 Total 336 101 5.34 10 Pooled 336 327.75 101 109.25 0.85 1 0.36 Heterogeneity 4.50 9 0.88

Multiple Populations - summary • Parallels • Partitionstherefore and Gheterogeneity = Gtotal - GPooled (n=no. classes, p = no.populations) Example: Recall Backcross (AaBb x aabb)- Goodness of fit (2- locus model). For each of 4 crosses, a Total GoF statistic can be calculated according to expected segregation ratio 1:1:1:1 – (assumes no segregation distortion for both loci and no linkage between loci). For each locus GoF calculated using marginal counts, assuming each genotype segregates 1:1. Difference between Total and 2 individual locus GoF statistics is L-LRTS (or chi-squared statistic) contributed by association/linkage between 2 loci.

Example: Marker Screening Screening for Polymorphism - (different detectable alleles) – look at stages involved. Genomic map –based on genome variation at locations (from molecular assay or traditional trait observations). (1) Screening polymorphic genetic markers is Exptal step 1 - usually assay a large number of possible genetic markers in small progeny set = random sample of mapping population. If a marker does not show polymorphism for set of progeny, then marker non-informative ; will not be used for data analysis).

Example contd. (2) Progeny size for screening – based on power, convenience etc., e.g. False positive = monomorphic marker determined to be polymorphic. Rare since m-m cannot produce segregating genotypes if these determined accurately. False negativeshigh particularly for small sample. e.g. for markers segregating 1:1 – (i)Backcross, recombinant inbred lines, doubled haploid lines, or (ii)F2 with codominant markers, So, e.g. (i) P{sampling all individuals with same genotype) = 2(0.5)n (ii) P{false negative for single marker, n=5} = 2(0.25)5+0.55=0.0332 Hence Power curves as before.

Example contd. S.R 1:1 vs 3:1- use LRTS • Detection of departurefromS.R.of 1:1 n = sample size, O1, O2observed counts of 2 genotypic classes. • For true S.R. 3:1, O1 genotypic frequency ofdominantgenotype, T.S. parametric value is approx.

Example contd. • To reject a S.R. of 1:1 at 0.05 significance level, a LogLRTS of at least 3.84 (critical value for rejection) is required. • Statistical Power • Forn=15 then, power is • For a power of 90%, n  40 needed • If problem expressed other way. i.e. calculating Expected LRTS (for rejecting a 3:1 S.R. when true value is S.R. 1:1), this is 0.2877n and n  35 needed.

Maximum Likelihood Benefits • Good Confidence Intervals Coverage probability realised and interval biologically meaningful • MLE Good estimator of a CI MSE consistent Absence of Bias - does not “stand-alone” – minimum variance important Asymptotically Normal Precise – large sample Biological inference valid Biological range realistic

DATA ANALYSIS

DATA ANALYSIS

Presentation Transcript

Data Analysis

Data analysis

Data analysis

Data Analysis

Data analysis

Data Analysis

DATA ANALYSIS

DATA ANALYSIS

DATA ANALYSIS

DATA ANALYSIS

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

DATA ANALYSIS