760 likes | 875 Views
Explore models and procedures for genomic and pedigree evaluation in animal breeding. Learn about single marker and multiple marker approaches, distributions for marker effects, and the importance of pseudo-data in genomic selection.
E N D
Joint records & genomic & pedigree evaluation Andrés Legarra#, Ignacio Aguilar * † #UR 631, SAGA, Castanet-Tolosan 31326 France *Animal and Dairy Science Department, University of Georgia, Athens 30602 †Instituto Nacional de Investigación Agropecuaria, Las Brujas 90200, Uruguay andres.legarra@toulouse.inra.fr with help from, I. Misztal, D.L. Johnson, T. Lawlor, M. Toro, JM Elsen, O. Christensen, L. Varona, P. Van Raden, G. de los Campos and others Zaragoza, 10/11/2009
Financing • Eadgene • AMASGEN • Holstein Association of America
Plan • Short review of models for genomic selection • The two-(three) step procedure • The genomic relationship matrix • …and its extensions to include pedigree • Performance of one-step evaluation
Plan • Short review of models for genomic selection • The two-(three) step procedure • The genomic relationship matrix • …and its extensions to include pedigree • Performance of one-step evaluation
Single marker • Assume there is a marker in complete LD with a QTL • For example, the polymorphism in DGAT1 which increases fat yield • Use a linear model to estimate its effect • yi= marker effect in animal i + noise
Base model • y= Za + e • Z= incidence matrix of marker effects • a= marker effect • e=residuals 3 individuals, 1 marker with 4 alleles • This can be solved, for example, by least squares
Single marker • This is fine if we know what markers are good predictor of what genes • But this is rarely the case
Whole genome • The simpler is to do an extension of single marker association analysis • Do multiple marker regression • Why multiple? To account for all genes (markers) simultaneously • Works well only with dense markers! • Because to trace correctly QTLs we need some markers in LD with them
Multiple marker additive model • y= Za + e • Z= incidence matrix of marker effects • a= marker effect • e=residuals 4 individuals, 2 markers each 2 alleles in 1st marker 4 alleles in 2nd marker
A priori Distributions for marker effects • Several distributions have been proposed • Normal (Meuwissen et al., Genetics 2001; Van Raden JDS 2008) • Mixture of normal (Van Raden JDS 2008) • BayesA, BayesB (Meuwissen et al. 2001) • Lasso (YI, N. and S. XU, 2008 Genetics 179: 1045; De los Campos, Genetics, 2009; Park and Casella, Journal of the American Statistical Association) • No clear proof (from data) that any one is superior • I will use normal:
Useful parameterizations • value of « 1 » allele = 0 • value of « 2 » allele = ai, where ai is the effect of the SNP at that locus • « 11 » = 0 • « 12 » = ai • « 22 » = 2ai
Useful parameterizations • value of « 1 » allele = -0.5 ai • value of « 2 » allele = +0.5 ai, where ai is the effect of the SNP at that locus • « 11 » = -ai • « 12 » = 0 • « 22 » = +ai
Useful parameterizations • value of « 1 » allele = -pi ai • value of « 2 » allele = (1-pi) ai, where ai is the effect of the SNP at that locus, and pi is the frequence of the allele 2 • Thus results in centered Z matrix (E(Za)=0 for any a) • « 11 » = -2piai • « 12 » = ai-2piai • « 22 » = 2ai-2piai • How do we choose p?
Useful parameterizations • Different parameterizations do not give the same result • This is different from quantitative genetics theory • « Old » Falconerian genes are fixed and constant terms are absorbed by the mean • But now SNP are random effects
Plan • Short review of models for genomic selection • The two-(three) step procedure • The genomic relationship matrix • …and its extensions to include pedigree • Performance of one-step evaluation
Why 2-step procedure • BayesB and A and mixtures and Lasso are fine, but only some animals are genotyped • Do they have data? • This limits practical applications • Need to get pseudo-data for genotyped animals
Inferring genotypes • Genotypes in some individuals can be inferred, only to some extent • Peeling • Peeling unilocus • Pseudo-peeling multi-locus • Gengler’s gene content prediction • They work well only a few generations back (forward) unless we genotype more individuals with low-density SNPs and then use (2) (Habier et al., Genetics 182: 343)
Pseudo-data • So we need pseudo-data • EBV’s • DYD’s
Pseudo-data • EBV’s • The problem with EBV’s is that they already share information among individuals • e.g., a dam EBV is = own yield + parent average + progeny contribution • But then we are including genetic contribution of parents, and thus the SNP effects that we want to estimate
Pseudo-data • Also, EBV’s are correlated • The correlation depends on the amount of data and distribution across fixed effects and families • EBVs of two cows are correlated, for example, if they belong to the same herd, even if they are not related
Pseudo-data • DYD’s avoid part of these problems (Van Raden Wiggans 1991) • DYD = daughter yield deviation • Record of the daughter, corrected by environmental effects and dam’s EBV • Thus DYD = 0.5 BV sire + mendelian sampling • E(DYD)=0.5 BV sire • YD’s exist for cows • YD = record –environmental effects
Pseudo-data Problems of DYD’s / YD’s • YD’s little reliable and subject to preferential treatment • YD’s and DYD’s are less, yet still, correlated, and their variances (=accuracies) are very hard to estimate. This leads to serious problems (Neuner et al., 2008, 2009) • Hard to define for some species/traits • We accept regular BLUP with pedigree that we don’t like
2-3-step procedure • Get pseudo-data from pedigree-BLUP • USA (Van Raden et al., JDS 92:16) • Run genomic evaluations with DYDs • Combine with pedigree-BLUP • FR (Guillaume et al. JDS 91:2520; also • http://www.inst-elevage.asso.fr/html1/IMG/pdf_CR_0972128-JT_13_oct_2009.pdf) • Run joint « QTLs – additive infinitesimal » BLUP evaluation with DYDs • Need variance component estimates (difficult to compute with 20-35 QTLs they’re using)
Real problem of Pseudo-data • Extremely complex procedure • Loss of generality • We analyse a subset of the population. • Thus, ungenotyped dams (or daughters) of a bull do not benefit from its improved accuracy.
Plan • Short review of models for genomic selection • The two-(three) step procedure • The genomic relationship matrix • …and its extensions to include pedigree • Performance of one-step evaluation
The genomic relationship matrix • Remember y = Za + noise; (phenotype = sum of SNP effects). • But we can say g = Za (genetic value = sum of SNP effects). This is a Breeding Value (=2EPD) in the literal sense. • Then it follows that • Var(g)=ZZ’2a • Standardizing • Var(g)=ZZ’2a/k = G2u • Where 2u is « the » additive variance
The genomic relationship matrix • G : matrix of pseudo-relationship or « genomic relationships » • Also, a « molecular relationship matrix » • ZZ’ : « looks like » number of shared SNP alleles among two individuals
Assume all possible combinations in a locus, a parameterization of 012; the covariance at that locus is 11 12 22 11 0 0 0 12 0 1 2 22 0 2 4
The genomic relationship matrix • ZZ’ is different depending on how do we parameterize Z • Parameterizations are • -1,0,1 • 0,1,2 • -2p, 1-2p, 2-2p
The genomic relationship matrix • For example, assume two individuals are • (11,12,11,22,11) • (22,11,12,22,12) • zz’ (its covariance) is • 4 with 0 1 2 • 0 with -1 0 1 • -1.75 with 2p, 1-2p, 2-2p • all are correct (yet different) since they represent a valid linear model!
The genomic relationship matrix • How do we get the variance of SNP effects from a polygenic variance? • The formula assumes HW, linkage equilibrium of SNPs (which is false) Gianola et al. (2009) • This formula is (in HW) equal to trace(ZZ’)/ number of individuals in data • k is not the number of SNPs
The genomic relationship matrix • Elements in G are related to « true » (IBD) relationships • Why? • Two guys share the same allele at the marker because they have a common ancestor (perhaps beyond pedigree founders)
The genomic relationship matrix • E(G)=A (relationship matrix) + a constant matrix (Habier et al. 2007 Genetics 177:1389) • If we use the parameterization of -2p, 1-2p, 2-2p, then the constant is 0 (Van Raden, 2008 91:4414) • « If » p is the frequency at the founders • Otherwise the genotyped animals « are »the base population (Oliehoek et al. 173:483)
The genomic relationship matrix • E(G)=A (relationship matrix) • A is an average relationship, deviations of which do exist • G more informative than A. • Two fullsibs might have a correlation of 0.6 or 0.4 • You need many markers to get these « fine relationships »
Example This is the chromosome of a sire These are sons In the infinitesimal model, each son receives exactly half the sire.
Example This is the chromosome of a sire These are FOUR sons • In reality, two sons are identical and other two are very different from the first two but alike among them.
The genomic relationship matrix • G can be used for evaluation: • Same results as fitting marker effects a • Some nice properties • Pseudo-reliabilities from inverse • Smaller set of equations • Can use old programs
Plan • Short review of models for genomic selection • The two-(three) step procedure • The genomic relationship matrix • …and its extensionsto include pedigree • Performance of one-step evaluation
Proposals for overall relationship matrix(Legarra et al., 2009 JDS 92:4656; Christensen & Lund, EAAP 2009) • Not big loss in assuming normality for SNP effects (Cole et al., Van Raden et al.) • G easy to be constructed then • Can we include G in the relationship matrix? • If we construct an overall relationship matrix with good properties, then we can just do BLUP with all data and animals
Proposals for overall relationship matrix(Legarra et al., 2009 JDS 92:4656; Christensen & Lund, EAAP 2009; also Misztal et al., 92:4648) • Naif • Modification for progeny • Overall modification
Naif proposal (Legarra et al., 2009; Gianola & De los Campos, 2008) • Let • 1 : ungenotyped animals • 2 : genotyped animals
Naif proposal • Modification • 1 : do not touch • 2 : plug G (=K-1 in G&dlC) • negative definite • Incoherent • Sons (parents) of two animals correlated in G might not be correlated themselves in A11
Naif proposal • Does not work • Assume 2 are bulls and 1 are cows • Then bulls’ EBV can be computed as • and G serves to nothing… • This is because Ag is not a valid covariance matrix, as assumed by selection index or BLUP • Ignoring G to compute covariances among 1 and 2 or individuals in 1 is wrong thisZ = incidence matrix of animals (not of SNPs!)
Modification for progeny • Assume all parents (« 1 ») are genotyped and use G • Use Quaas’ 1988 tracing of BV’s • Use average transmissions of 0.5 • Each son is half parents + mendelian samplings
Proposals for overall relationship matrix • Matrix becomes
What are these T’s and P’s • P: you are half your parents. One row of the pedigree file • T: you have ½ genes of your parents, ¼ of your grandparents, 1/8 of your grandgrandparents and so on. • Can be computed from pedigree file through recursion • D: mendelian samplings
Proposals for overall relationship matrix • Matrix becomes • Correct, positive definite if all founders genotyped • Otherwise incoherent « backwards » • Animals correlated in G might have uncorrelated ancestors • Not very practical because it is complicated and does not account for ancestors
Overall modification • What would we do « in the old times »? • Compute breeding values for whatever animals • Then use the « classical » selection index based on pedigree
Overall modification • So then: • and we can construct Variance of the selection index (under normality) Selection index Genotyped Ungenotyped
Overall modification • This leads to: • (Semi) positive definite (by construction) • No obvious incoherences • Identical to Ap if all founders genotyped