Linear MIxed Models

Linear Mixed Models (LMM) are becoming quite popular in population genetics. I have seen it used primarily in two settings: 1) in Visscher’s work to estimate the fraction of phenotypic variance explained by all the common SNPs; 2) in association studies to correct for population effects. They are closely related, let’s discuss them in turn.

Using LMM to estimate the component of phenotypic variance explained by common SNPs.

The essential model is $y = X\beta + u + \epsilon$ where

• y is the N-by-1 vector of phenotypes (N=# of samples).
• X is the N-by-d matrix of fixed covariates, which includes all the non-genetic covariate.
• $\beta$ is the d-by-1 vector of fixed effects.
• $u \sim N(0, \sigma_g^2 K)$ is the random effect, and $K$ is the N-by-N matrix of genetic similarities.
• $\epsilon \sim N(0, \sigma_e^2 I )$ is the iid noise.

Integrating out $u$, an equivalent formulation is to write the log-likelihood as $LL(\sigma_g^2, \sigma_e^2, \beta) = N(y|X\beta; \sigma_g^2K+\sigma_e^2 I)$. To find the MLE parameters, $\beta$ can be solved for analytically as usual. It’s a non-convex optimization to find the optimal combination of $\sigma_g^2$ and $\sigma_e^2$. In practice this is done doing a grid search, and at each point in the grid, do a local gradient descent. In the end, the fraction $\frac{\sigma_g^2}{\sigma_e^2+\sigma_g^2}$ tells us the fraction of phenotypic variance explained by genetic similarity.

Another way to view LMM is to think of it as $y=X\beta+Wv+\epsilon$, where $v \sim N(0, \sigma_v^2I)$ is a vector of SNP effects and $W$ is the normalized N-by-M genotype matrix, M is the number of SNPs. The connection with above is that $K=WW'/M$. $v$ is a vector of random effects because we are not trying to estimate its MLE values as we do for the fixed effect $\beta$. We are only interested in estimating its variance $\sigma_v^2$. If we treat $v$ as fixed effects, then we have to estimate M parameters, which leads to disastrous overfitting. Treating it as random creates only one additional parameter $\sigma_v^2$. Note that $K=WW'/M$ is just one way to compute the similarity matrix; there are other ways to construct $K$.

Ok, so what’s the intuition behind the random effects? I find it helpful to think in terms of conditional probabilities. Given that individual $i$ has some genetic disposition for the phenotype, i.e. $u_i>0$. If $j$ is genetically related to $i$, then $K(i,j)$ is close to 1, and $u_j \sim N(0, \sigma_g^2 K | u_i >0)$ is also likely to be positive.

Visscher and colleagues have applied LMM in exactly this way to estimate that genetic similarities can account for 45% of the variation in height and 17% of the variation in height. Note that LMM estimates narrow-sense heritability, i.e. the variance component due to additive genetic effects. This is best seen in the formulation $y=X\beta+Wv+\epsilon$, where the random effects contribute additively through $Wv$.

Using LMM to correct for population structure in GWAS.

The typical GWAS association test has the form $y= X_i\beta_i + \epsilon$, where $X_i$ is the genotype of the SNP i. (Ok assume that effects due to age, sex have already been regressed out.) To account for population structure among the samples, one approach is to use LMM as before $y=X_i\beta_i + \sigma_{g,i}^2K+\epsilon$. Note that the variance due to relatedness $\sigma_{g,i}^2$ is estimated separately for each SNP and may have different sizes for different SNPs. A more efficient approach is to estimate $\sigma_g^2$ from $y=\sigma_g^2 K + \epsilon$ and then set $\sigma_{g,i}^2 = \sigma_g^2$ $\forall i$.

Thoughts and questions.

1. Seems like how $K$ is constructed makes a difference. Visscher uses $K=WW'/M$ where I presume $W$ is the set of SNPs after pruning for haplotype structure. That seems reasonable. But how would other ways of constructing $K$ affect the estimation of variance components?
2. Comparison with PCA. Seems that the matrix $K$ captures a lot more fine grained, and non-linear, information than the top PC components. If we use more PC components, would the results converge, or are there non-linearities in the genetic similarity that can’t be captured by PCA?
3. LMM tells us how much phenotypic variance can be explained by genetic similarities. But narrow sense heritability should be a mixture of the overall genetic similarities, plus the presence/absence of a handful of specific loci which have abnormally large effect. So isn’t a more accurate way to estimate narrow sense heritability by $y=X_{strong}\beta + u + \epsilon$? Here $X_{strong}$ are the genotypes of the known significant loci (from GWAS for example) and u is the random effect.

Useful references (mostly in the supp. methods of papers):

http://www.nature.com/ng/journal/v42/n4/pdf/ng.548.pdf

http://www.cs.toronto.edu/~jenn/papers/FaST-LMM-2011.pdf

http://www.nature.com/nmeth/journal/v9/n6/extref/nmeth.2037-S1.pdf

http://www.nature.com/nrg/journal/v11/n7/pdf/nrg2813.pdf