As a byproduct of getting into the science-wise FDR debate, I read Stephen Goodman’s interesting note, “A comment on replication, p values, and evidence.” It’s short and sweet.
When we see a result with p value 0.05, we think that it’s pretty significant and is unlikely to be wrong. But if we perform the exact same experiment, we have a 50% chance of finding a non-significant result! Assume that we have two Gaussian populations and the difference in means has Z-score 1.96. So that the “true” p value is exactly 0.05. This also means that the mean of the case population is a standard Gaussian centered at 1.96. So with probability 0.5 a sample will be < 1.96 and hence not significant.
I did a simple experiment to confirm this. The x axis of the plot shows the -log10 p value observed and the y axis shows that, if the true effect has that p value, what is the probability that it will come out significant (power). We see that at p value of 0.001, only around 80% of the time we would get a significant result. The flip side of this is that if we design an experiment to have high power to discover a true effect (say y axis = 0.99), then we expect to see a very significant p value 0.00001.