Data analysis
1 / 26

DATA ANALYSIS - PowerPoint PPT Presentation

  • Uploaded on

DATA ANALYSIS. Module Code: CA660 Lecture Block 7. Examples in Genomics and Trait Models. Genetic traits may be controlled by No.genes-usually unknown Taking “genetic effect” as one genotypic term, a simple model for

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' DATA ANALYSIS' - nicole

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data analysis


Module Code: CA660

Lecture Block 7

Examples in Genomics and Trait Models

  • Genetic traits may be controlled by No.genes-usually unknown

    Taking “genetic effect” as one genotypic term, a simple model


    where yij is the trait value for genotype i in replication j,  is themean, Gi the genetic effect for genotype i and ij the errors.

  • If assume Normality (and want Random effects) + assume

    zero covariance between genetic effects and error

    Note: If same genotype replicated b times in an experiment, with phenotypic means used, error variance averaged over b.

Example - Trait Models contd.

  • What about Environment and GE interactions? Extension to Simple Model.

    ANOVA Table: Randomized Blocks within environment and within sets/blocks in environment = b = replications. Focus - on genotype effect

    Source dof Expected MSQ

    Environment e-1 know there are differences

    Blocks(b-1)e again – know there are differences

    Genotypes g-1

    GE (g-1)(e-1)

    Error (b-1)(g-1)e

    Note: individuals blocked within each of multiple environments, so environmental effect intrinsic to error. Model form is standard, but only meaningful comparisons are within environment, hence form of random error = population variance = ; so random effects of interest from additional variances & ratios

Genotypic effects measured within blocks

Example contd.

  • HERITABILITY = Ratio genotypic to phenotypic variance

  • Depending on relationship among genotypes, interpretation of genotypic variance differs. May contain additive, dominance, other interactions, variances

    (Above = heritability in broad terms).

  • For some experimental or mating schemes, an additive genetic variance may be calculated. Narrow/specific sense heritabilitythen

  • Again, if phenotypic means used, obtain a mean-based heritability for b replications.

Extended Example- Two related traits


where 1 and 2 denote traits, i the gene and j an individual in population. Then ‘y’ is the trait value,  overall mean, G genetic effect,  = random error.

To quantify relationship between the two traits, the variance- covariance matrices for phenotypic,  pgenetic  gand environmental effects e

So correlations between traits in terms of phenotypic, genetic and environmental effects:


  • Recallgeneral points: Estimation, definition of Likelihood function for a vector of parameters  and set of values x.

    Find most likely value of = maximise the Likelihood fn.

    Also defined Log-likelihood (Support fn. S()) and its derivative, the Score, together with Information content per observation, which for single parameter likelihood is given by

  • Why MLE? (Need to know underlying distribution).

    Properties: Consistency; sufficiency; asymptotic efficiency (linked to variance); unique maximum; invariance and, hence most convenient parameterisation; usually MVUE; amenable to conventional optimisation methods.


  • Variance of an Estimator - usual form or

    for k independent estimates

  • For a large sample, variance of MLE can be approximated by

    can also estimate empirically, using re-sampling* techniques.

  • Variance of a linear function (of several estimates) – (common need in genomics analysis), e.g. heritability.

  • Recall Biasof the Estimator

    then the Mean Square Error is defined to be:

    expands to

    so we have the basis for C.I. and tests of hypothesis.


  • Analytical - solvingor when simple solutions exist

  • Grid search or likelihood profile approach

  • Newton-Raphson iteration methods

  • EM (expectation and maximisation) algorithm

    N.B. Log.-likelihood, because max. same  value as Likelihood

    Easier to compute

    Close relationship between statistical properties of MLE

    and Log-likelihood

METHODS in brief

Analytical : - recall Binomial example earlier

  • Example : For Normal, MLE’s of mean and variance, (taking derivatives w.r.t mean and variance separately), and equivalent to sample mean and actual variance (i.e. /N), -unbiased if mean known, biased if not.

  • Invariance : One-to-one relationships preserved

  • Used: whenMLE has a simple solution

Methods for MLE’s contd.

Grid Search – Computational

Plot likelihood or log-likelihood vs parameter. Various features

  • Relative Likelihood=Likelihood/Max. Likelihood (ML set =1).

    Peak of R.L. can be visually identified /sought algorithmically. e.g.

    Plot likelihood and parameter space range - gives 2 peaks, symmetrical around  likelihood profile for the well-known mixed linkage phase problem in linkage analysis.

    If e.g. constrain MLE = R.F. between genes (possible mixed linkage phase).


  • Graphic/numerical Implementation - initial estimate of , direction ofsearch determined by evaluating likelihood at both sides of .

    Search takes direction giving increase. Initial search increments large, e.g. 0.1, then when likelihood change starts to decrease or become negative, stop and refine increment.

  • Multiple peaks– can miss global maximum, computationally intensive

  • Multiple Parameters - grid search. Interpretation of Likelihood profiles can be difficult.


  • Recall Exs 2, Q. 8.

    Data used to show a linkage relationship between marker and a “rust-resistant”gene.

    Escapes = individuals who are susceptible, but show no disease (rust) phenotype under experimental conditions. So define as proportion escapes and R.F. respectively.

    is penetrance for disease trait, i.e.

    P{ that individual with susceptible genotype has disease phenotype}.

    Purpose of expt.-typically to estimate R.F. between marker and gene.

  • Use: Support function = Log-Likelihood

Example contd
Example contd.

  • Setting 1st derivatives (Scores) w.r.t = 0. Expected value of Score (w.r.t.  is zero, (see analogies in classical sampling/hypothesis testing). Similarly for . Here, however, No simple analytical solution, so can not solve directly for either.

  • Using grid search, likelihood reaches maximum at

  • In general, this type of experiment tests H0: Independence between marker and gene and H0: no escapes

    Uses LikelihoodRatioTest statistics. (MLE 2 equivalent)

  • N.B: Moment estimates solve slightly different problem, because no info. on expected frequencies, - (not same as MLE)

MLE Estimation Methods contd.

Newton-Raphson Iteration

Have Score () = 0 from previously.N-R consists of replacing Score by linear terms of its Taylor expansion, so if ´´ a solution,  ´=1st guess

Repeat with  ´´replacing´

Each iteration - fits a parabolato

Likelihood Fn.

  • Problems - Multiple peaks, zero Information, extreme estimates

  • Multiple parameters – need matrix notation, where S matrix e.g. has elements = derivatives of S(, ) w.r.t.  and  respectively. Similarly, Information matrix has terms of form

     Estimates are




Variance of Log-L i.e.S()

Methods contd.

Expectation-Maximisation Algorithm - Iterative. Incompletedata

(Much genomic data fits this situation e.g.linkage analysis with marker genotypes of F2 progeny. Usually 9 categories observed for 2-locus, 2-allele model, but 16 = complete info., while 14 give info. on linkage. Some hidden, but if linkage parameter known, expected frequencies can be predicted – as you know - and the complete data restored using expectation).

  • Steps: (1)Expectationestimates statistics of complete data, given observed incomplete data.

  • -(2) Maximisation uses estimated complete data to give MLE.

  • Iterate till converges (no further change)

E m contd
E-M contd.


  • Initial guess, ´, chosen (e.g. =0.25 say = R.F.).

  • Taking this as “true”, complete data is estimated, by distributional statements e.g. P(individual is recombinant, given observed genotype) for R.F. estimation.

  • MLE estimate ´´ computed.

  • This, for R.F.  sum of recombinants/N.

  • Thus MLE, for fi observed count,

    Convergence ´´ = ´ or


  • Likelihood Ratio Test – c.f. with 2.

  • Principal Advantage of Gis Power, as unknown parameters involved in hypothesis test.

    Have : Likelihood of taking a value Awhich maximises

    it, i.e. its MLE and likelihood under H0 : N , (e.g.N = 0.5)

  • Form of L.R. Test Statistic

    or, conventionally

    - choose; easier to interpret.

  • Distribution of G~ approx. 2 (d.o.f. = difference in dimension of parameter spaces for L(A), L(N) )

  • Goodness of Fit:notation as for 2 , G ~ 2n-1 :

  • Independence:notation again as for2

Power-Example extended

  • Under H0 :

  • At level of significance =0.05, suppose true  = 1 = 0.2, so if n=25

    (e.g. in genomics might apply where R.F. =0.2 between two genes (as opposed to 0.5). Natural logs. used, though either possible in practice. Hence, generic form “Log” rather than Ln here. Assume Ln throughout for genetic/genomic examples unless otherwise indicated)

  • Rejection region at 0.05 level is

  • If sketch curves, P{LRTS falls in the acceptance region} = 0.13,

    = Prob.of a false negative when actual value of= 0.2

  • If sample size increased, e.g. n=50, E{G} = 19 and easy to show that P{False negative} = 0.01

  • Generally: Power for these tests given by

Likelihood C. I.’s - method

  • Example: Consider the following Likelihood function

     is the unknown parameter ; a, b observed counts

  • For 4 data sets observed,

    A: (a,b) = (8,2), B: (a,b)=(16,4) C: (a,b)=(80, 20) D: (a,b) = (400, 100)

  • Likelihood estimates can be plotted vs possible parameter values, with MLE = peak value.

    e.g. MLE = 0.2, Lmax=0.0067 for A, and Lmax=0.0045 for B etc.

    Set A: Log Lmax- Log L=Log(0.0067) - Log(0.00091)= 2gives  95% C.I.

    so  =(0.035,0.496) corresponding to L=0.00091,  95% C.I. for A.

    Similarly, manipulating this expression, Likelihood value corresponding to  95% confidence interval given as L = 7.389Lmax

    Note: Usually plot Log-likelihood vs parameter, rather than Likelihood.

    As sample size increases, C.I. narrower and  symmetric

Multiple Populations: Extensions to G -Example

  • Recall Mendel’s data - earlier and Extensions to 2 for same

    In brief Round Wrinkled

    Plant O E O E G dof p-value

    1 45 42.75 12 14.25 0.49 1 0.49

    2 0.09 1 0.77

    3 0.10 1 0.75

    4 1.301 0.26

    5 0.01 1 0.93

    6 0.71 1 0.40

    7 0.79 1 0.38

    8 0.63 1 0.43

    9 1.06 1 0.30

    10 0.17 1 0.68

    Total 336 101 5.34 10

    Pooled 336 327.75 101 109.25 0.85 1 0.36

    Heterogeneity 4.50 9 0.88

Multiple Populations - summary

  • Parallels

  • Partitionstherefore

    and Gheterogeneity = Gtotal - GPooled (n=no. classes, p = no.populations)

    Example: Recall Backcross (AaBb x aabb)- Goodness of fit (2- locus model).

    For each of 4 crosses, a Total GoF statistic can be calculated according to expected segregation ratio 1:1:1:1 – (assumes no segregation distortion for both loci and no linkage between loci).

    For each locus GoF calculated using marginal counts, assuming each genotype segregates 1:1.

    Difference between Total and 2 individual locus GoF statistics is L-LRTS (or chi-squared statistic) contributed by association/linkage between 2 loci.

Example: Marker Screening

Screening for Polymorphism - (different detectable alleles) – look at stages involved.

Genomic map –based on genome variation at locations (from molecular assay or traditional trait observations).

(1) Screening polymorphic genetic markers is Exptal step 1

- usually assay a large number of possible genetic markers in small progeny set = random sample of mapping population.

If a marker does not show polymorphism for set of progeny, then marker non-informative ; will not be used for data analysis).

Example contd1
Example contd.

(2) Progeny size for screening – based on power, convenience etc.,

e.g. False positive = monomorphic marker determined to be polymorphic. Rare since m-m cannot produce segregating genotypes if these determined accurately.

False negativeshigh particularly for small sample. e.g. for markers segregating 1:1 – (i)Backcross, recombinant inbred lines, doubled haploid lines, or (ii)F2 with codominant markers,

So, e.g.

(i) P{sampling all individuals with same genotype) = 2(0.5)n

(ii) P{false negative for single marker, n=5} = 2(0.25)5+0.55=0.0332

Hence Power curves as before.

Example contd. S.R 1:1 vs 3:1- use LRTS

  • Detection of departurefromS.R.of 1:1

    n = sample size, O1, O2observed counts of 2 genotypic classes.

  • For true S.R. 3:1, O1 genotypic frequency ofdominantgenotype, T.S. parametric value is approx.

Example contd.

  • To reject a S.R. of 1:1 at 0.05 significance level, a LogLRTS of at least 3.84 (critical value for rejection) is required.

  • Statistical Power

  • Forn=15 then, power is

  • For a power of 90%, n  40 needed

  • If problem expressed other way. i.e. calculating Expected LRTS (for rejecting a 3:1 S.R. when true value is S.R. 1:1), this is 0.2877n and n  35 needed.

Maximum Likelihood Benefits

  • Good Confidence Intervals

    Coverage probability realised and interval biologically meaningful

  • MLE Good estimator of a CI

    MSE consistent

    Absence of Bias

    - does not “stand-alone” – minimum variance important

    Asymptotically Normal

    Precise – large sample

    Biological inference valid

    Biological range realistic