Surrogate variable analysis.

Here’s a very common scenario: we want to understand the relationship between X and Y, where X is a set of covariates and Y is the phenotype/status. This is complicated by the fact that there are hidden covariates that also affect X and we don’t know what these are. The generative model is X = Y + H +\epsilon and we can’t see H. There are several approaches here.

  1. Pretend H does not exist and run the linear regression $\latex X=\beta Y+ \epsilon$. If H correlates with X, then this naive model would return that Y is associated with X regardless of its true association.
  2. Run PCA on X, remove it’s principle components to get X’, then run  $\latex X’=\beta Y+ \epsilon$. The top principle components may partially contain true effects based on Y, and this approach would lose power to detect true associations. Underfitting.
  3. Run PCA on the residual $\latex Y-X$. This also can not detect the part of H that overlaps Y, and results in overfitting.

SVA is an ad hoc method to get between 2 and 3. Like 3, it first computes principle components on the residuals. Then for each principle component direction, it finds a subset of genes that are significantly correlated with this direction. That becomes the surrogate variable. The idea is that this surrogate direction can now overlap with Y to lessen overfitting.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s