## Surrogate variable analysis.

Here’s a very common scenario: we want to understand the relationship between X and Y, where X is a set of covariates and Y is the phenotype/status. This is complicated by the fact that there are hidden covariates that also affect X and we don’t know what these are. The generative model is $X = Y + H +\epsilon$ and we can’t see H. There are several approaches here.

1. Pretend H does not exist and run the linear regression $\latex X=\beta Y+ \epsilon$. If H correlates with X, then this naive model would return that Y is associated with X regardless of its true association.
2. Run PCA on X, remove it’s principle components to get X’, then run  $\latex X’=\beta Y+ \epsilon$. The top principle components may partially contain true effects based on Y, and this approach would lose power to detect true associations. Underfitting.
3. Run PCA on the residual $\latex Y-X$. This also can not detect the part of H that overlaps Y, and results in overfitting.

SVA is an ad hoc method to get between 2 and 3. Like 3, it first computes principle components on the residuals. Then for each principle component direction, it finds a subset of genes that are significantly correlated with this direction. That becomes the surrogate variable. The idea is that this surrogate direction can now overlap with Y to lessen overfitting.