Linear MIxed Models

Linear Mixed Models (LMM) are becoming quite popular in population genetics. I have seen it used primarily in two settings: 1) in Visscher’s work to estimate the fraction of phenotypic variance explained by all the common SNPs; 2) in association studies to correct for population effects. They are closely related, let’s discuss them in turn.

Using LMM to estimate the component of phenotypic variance explained by common SNPs. 

The essential model is y = X\beta + u + \epsilon where

  • y is the N-by-1 vector of phenotypes (N=# of samples).
  • X is the N-by-d matrix of fixed covariates, which includes all the non-genetic covariate.
  • \beta is the d-by-1 vector of fixed effects.
  • u \sim N(0, \sigma_g^2 K) is the random effect, and K is the N-by-N matrix of genetic similarities.
  • \epsilon \sim N(0, \sigma_e^2 I ) is the iid noise.

Integrating out u, an equivalent formulation is to write the log-likelihood as LL(\sigma_g^2, \sigma_e^2, \beta) = N(y|X\beta; \sigma_g^2K+\sigma_e^2 I). To find the MLE parameters, \beta can be solved for analytically as usual. It’s a non-convex optimization to find the optimal combination of \sigma_g^2 and \sigma_e^2. In practice this is done doing a grid search, and at each point in the grid, do a local gradient descent. In the end, the fraction \frac{\sigma_g^2}{\sigma_e^2+\sigma_g^2} tells us the fraction of phenotypic variance explained by genetic similarity.

Another way to view LMM is to think of it as y=X\beta+Wv+\epsilon, where v \sim N(0, \sigma_v^2I) is a vector of SNP effects and W is the normalized N-by-M genotype matrix, M is the number of SNPs. The connection with above is that K=WW'/M. v is a vector of random effects because we are not trying to estimate its MLE values as we do for the fixed effect \beta. We are only interested in estimating its variance \sigma_v^2. If we treat v as fixed effects, then we have to estimate M parameters, which leads to disastrous overfitting. Treating it as random creates only one additional parameter \sigma_v^2. Note that K=WW'/M is just one way to compute the similarity matrix; there are other ways to construct K.

Ok, so what’s the intuition behind the random effects? I find it helpful to think in terms of conditional probabilities. Given that individual i has some genetic disposition for the phenotype, i.e. u_i>0. If j is genetically related to i, then K(i,j) is close to 1, and u_j \sim N(0, \sigma_g^2 K | u_i >0) is also likely to be positive.

Visscher and colleagues have applied LMM in exactly this way to estimate that genetic similarities can account for 45% of the variation in height and 17% of the variation in height. Note that LMM estimates narrow-sense heritability, i.e. the variance component due to additive genetic effects. This is best seen in the formulation y=X\beta+Wv+\epsilon, where the random effects contribute additively through Wv.

Using LMM to correct for population structure in GWAS. 

The typical GWAS association test has the form y= X_i\beta_i + \epsilon, where X_i is the genotype of the SNP i. (Ok assume that effects due to age, sex have already been regressed out.) To account for population structure among the samples, one approach is to use LMM as before y=X_i\beta_i + \sigma_{g,i}^2K+\epsilon. Note that the variance due to relatedness \sigma_{g,i}^2 is estimated separately for each SNP and may have different sizes for different SNPs. A more efficient approach is to estimate \sigma_g^2 from y=\sigma_g^2 K + \epsilon and then set \sigma_{g,i}^2 = \sigma_g^2 \forall i.

Thoughts and questions.

  1. Seems like how K is constructed makes a difference. Visscher uses K=WW'/M where I presume W is the set of SNPs after pruning for haplotype structure. That seems reasonable. But how would other ways of constructing K affect the estimation of variance components?
  2. Comparison with PCA. Seems that the matrix K captures a lot more fine grained, and non-linear, information than the top PC components. If we use more PC components, would the results converge, or are there non-linearities in the genetic similarity that can’t be captured by PCA?
  3. LMM tells us how much phenotypic variance can be explained by genetic similarities. But narrow sense heritability should be a mixture of the overall genetic similarities, plus the presence/absence of a handful of specific loci which have abnormally large effect. So isn’t a more accurate way to estimate narrow sense heritability by y=X_{strong}\beta + u + \epsilon? Here X_{strong} are the genotypes of the known significant loci (from GWAS for example) and u is the random effect.

Useful references (mostly in the supp. methods of papers):

This entry was posted in biology, machine learning. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s