## Science-wise FDR

I spent part of today engrossed in the animated discussion here about what fraction of published results from bio-medical studies are false. A 2005 paper by Ioannidis based on simulations argued that 80% are false. A 2013 paper by Leek fitted a two component mixture model to empirical p values and estimated that “only” 14% are false. The natural questions are: 1. What is the correct answer? 2. Is this a useful exercise? 3. What can we do in science to improve true positive rate?

There are several reasonable criticisms of Leek’s paper, most having to do with model assumptions. The distribution of p values is modeled by $\pi f_0 + (1-\pi)f_1$, where $f_0$ is the p value distribution of the null hypothesis, and $f_1$ is the distribution of the true signals. Given $f_0$ and $f_1$, we can use EM to estimate $\pi$, which gives the 14% estimate (or 80% estimate or whatever). Of course, the result we get crucially depends on what parametric form we assume for $f_0$ and $f_1$. Leek assumes that $f_0= U(0,0.05)$ and $f_1$ is a beta distribution.

Woah. Why those distributions? Leek cites a few older papers in genomics to show why Beta was used to model the true signal p value distribution. I  dug through those papers and, as far as I can tell, none of them actually show on real data why one Beta is the appropriate model! The Leek and Storey q-value paper is non-parametric and does not assume the Beta distribution at all. This seems to be a case where some usage perpetuates through literature and becomes a folklore.

The assumption that $f_0 = U(0,0.05)$ seems more standard and reasonable. While this might be true in principle, the reported p value distribution from null studies may not be uniform due to p value hacking and selection bias. Imagine that the null model is completely true (e.g. no difference between two groups). Then 1% of the time, we would get a p value between 0.05  and 0.06, just by chance. And suppose the analyst ignores all decimals for a p value in this range, so all $p \in [0.05, 0.06]$ are reported as 0.05. Then the reported p values has accumulation at 0.05 and is not uniform! Hacking is one example of a bias. Another example could be some one tests a few comparisons and only report the minimum p value. So even if the underlying distribution is uniform, the reported distribution–which is what we observe empirically–can be non-uniform.

All this seems to suggest that estimating science-wise false discovery, while a noble goal, is futile if we only rely on reported p value distributions without looking at follow up replication studies. This is a case where the role of statistics is really to clarify the model assumptions rather than to uncover any ground truth.

Gelman argues that looking at p value significance is not even the right question to ask since null distributions are never true! It’s more meaningful to look at S(ign) errors and M(agnitude) errors. Lastly in practice, scientists should be more skeptical of published results and journals should view replication studies (as well as negative studies) more favorably.