Linear Mixed Models (LMM) are becoming quite popular in population genetics. I have seen it used primarily in two settings: 1) in Visscher’s work to estimate the fraction of phenotypic variance explained by all the common SNPs; 2) in association studies to correct for population effects. They are closely related, let’s discuss them in turn.
Using LMM to estimate the component of phenotypic variance explained by common SNPs.
The essential model is where
- y is the N-by-1 vector of phenotypes (N=# of samples).
- X is the N-by-d matrix of fixed covariates, which includes all the non-genetic covariate.
- is the d-by-1 vector of fixed effects.
- is the random effect, and is the N-by-N matrix of genetic similarities.
- is the iid noise.
Integrating out , an equivalent formulation is to write the log-likelihood as . To find the MLE parameters, can be solved for analytically as usual. It’s a non-convex optimization to find the optimal combination of and . In practice this is done doing a grid search, and at each point in the grid, do a local gradient descent. In the end, the fraction tells us the fraction of phenotypic variance explained by genetic similarity.
Another way to view LMM is to think of it as , where is a vector of SNP effects and is the normalized N-by-M genotype matrix, M is the number of SNPs. The connection with above is that . is a vector of random effects because we are not trying to estimate its MLE values as we do for the fixed effect . We are only interested in estimating its variance . If we treat as fixed effects, then we have to estimate M parameters, which leads to disastrous overfitting. Treating it as random creates only one additional parameter . Note that is just one way to compute the similarity matrix; there are other ways to construct .
Ok, so what’s the intuition behind the random effects? I find it helpful to think in terms of conditional probabilities. Given that individual has some genetic disposition for the phenotype, i.e. . If is genetically related to , then is close to 1, and is also likely to be positive.
Visscher and colleagues have applied LMM in exactly this way to estimate that genetic similarities can account for 45% of the variation in height and 17% of the variation in height. Note that LMM estimates narrow-sense heritability, i.e. the variance component due to additive genetic effects. This is best seen in the formulation , where the random effects contribute additively through .
Using LMM to correct for population structure in GWAS.
The typical GWAS association test has the form , where is the genotype of the SNP i. (Ok assume that effects due to age, sex have already been regressed out.) To account for population structure among the samples, one approach is to use LMM as before . Note that the variance due to relatedness is estimated separately for each SNP and may have different sizes for different SNPs. A more efficient approach is to estimate from and then set .
Thoughts and questions.
- Seems like how is constructed makes a difference. Visscher uses where I presume is the set of SNPs after pruning for haplotype structure. That seems reasonable. But how would other ways of constructing affect the estimation of variance components?
- Comparison with PCA. Seems that the matrix captures a lot more fine grained, and non-linear, information than the top PC components. If we use more PC components, would the results converge, or are there non-linearities in the genetic similarity that can’t be captured by PCA?
- LMM tells us how much phenotypic variance can be explained by genetic similarities. But narrow sense heritability should be a mixture of the overall genetic similarities, plus the presence/absence of a handful of specific loci which have abnormally large effect. So isn’t a more accurate way to estimate narrow sense heritability by ? Here are the genotypes of the known significant loci (from GWAS for example) and u is the random effect.
Useful references (mostly in the supp. methods of papers):