Bayesian vs. Traditional Statistics
I think the traditional approach to statistics, using statistical significance tests, rests on a conceptual confusion.
Aside: I am no statistician. I’m just a philosopher. So you would probably be justified in assuming at this point that I’m wrong. But then you would be mistaken.
Background
You’ve collected some data. You see that variables A and B are correlated in the data. You wonder if this is due to chance, or if there is a genuine connection of some kind between A and B.
The “Null Hypothesis” is the hypothesis that it’s just chance; A and B aren’t systematically connected.
A “Type 1 Error” occurs when you mistakenly reject the null hypothesis. (You think A and B are connected but they aren’t.)
How do you decide whether to reject the null hypothesis (and thus accept some causal hypothesis)? Traditionally, you suppose the null hypothesis were true, and based on that assumption, calculate how likely it would be that you would observe a correlation of at least the magnitude you in fact observed. That’s the “p-value.” If that probability is low enough, you get to reject the null hypothesis. You set some fairly low threshold, such as 0.05 or (if you’re extra careful) 0.01. The threshold is called “alpha”. If p is less than alpha, then you report that you found a “statistically significant” effect, and journalists can now announce to the world that “scientists have discovered” that A and B are causally connected. (This includes the possibility of a third factor, C, that affects both A and B.)
Why do they get to announce this? I think the rationale is something like this: “Look, you can never get absolute certainty. It’s enough that we have a high probability for our theory. Since our result was statistically significant, by definition, that means it is highly unlikely that it’s due to chance; so it’s highly likely that there is some sort of causal connection.” Thus, here is an actual quotation from a web site explaining basic statistical concepts:
“Alpha is the maximum probability that we have a type I error. For a 95% confidence level, the value of alpha is 0.05. This means that there is a 5% probability that we will reject a true null hypothesis. In the long run, one out of every twenty hypothesis tests that we perform at this level will result in a type I error.”
(https://www.thoughtco.com/difference-between-type-i-and-type-ii-errors-3126414, accessed 10/2017)
A Confusion
So what’s the confusion I’m complaining about? It is a confusion between
1. The probability that we make a Type 1 error, given that the null hypothesis is true, and
2. The probability that we make a Type 1 error, given that we reject the null hypothesis.
Or:
1’. Out of all the cases in which the null hypothesis is true, how often do we mistakenly reject it? and
2’. Out of all the cases in which we reject the null hypothesis, how often is that a mistake?
Alpha, the threshold of statistical significance, answers 1/1’. It does not answer 2/2’. And 2 or 2’ is what actually matters – that is what would tell you how confident you should be that A and B are genuinely connected, when you hear about the results of the scientific study.
That’s a little bit abstract. Here is an illustration (from the famous medical researcher John Ioannidis, http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124). Say a scientist, John, is looking for genes that cause schizophrenia. Assume that there are 10,000 genes to be tested, of which only 5 are actually causally connected to schizophrenia. John uses a statistical significance threshold of 0.05. What happens?
Let’s assume that the 5 genes that actually cause schizophrenia pass the test. In addition, there are 9,995 genes that have no effect (the “null hypothesis” is true for them). Of these, about 5% will pass the statistical significance test anyway, because that’s what the 0.05 threshold means. So you get about (0.05)(9,995) = 500 of these genes that pass the test. So in the end you have 5 true positives, and 500 false positives.
Now to return to the two questions above:
1’. Out of all the cases in which the null hypothesis is true, how often do we mistakenly reject it?
Answer: 5%. (500/9995)
2’. Out of all the cases in which we reject the null hypothesis, how often is that a mistake?
Answer: 99%. (500/505)
Notice that 5% is really different from 99%. That shows that questions 1’ and 2’ are importantly different.
2’ is what actually matters: when someone tells you, “Hey, gene X passed the statistical significance test for being connected to schizophrenia,” how confident should you be that gene X is really causally related to schizophrenia? About 1%; there is a 99% probability, in this example, that that result is due to chance. This is part of Ioannidis’ explanation for (as the title of his paper announces) “why most published research findings are false.”
I’m under the impression that the conceptual confusion I just identified is common, and that it’s key to why people think that traditional statistical inference is cogent. The web site I quoted above says, “Alpha is the maximum probability that we have a type I error”. If that were true, it would be a reasonable justification for relying on significance tests. But it’s wrong. In my example, the probability of a type I error is 0.99, even though alpha is only 0.05.
[Based on a FB post from 10/2017. Added, 6/29/19:]
Philosophical Points
Whence comes the traditional approach to statistics, relying on significance tests? It derives from empiricism plus the "streetlamp fallacy".
According to Bayes' Theorem, when you have some hypothesis H and evidence E, you can calculate the probability that the hypothesis is true given that evidence as:
P(h|e) = P(h)*P(e|h) / P(e)
(The probability of h given e = the initial probability of h, times the probability of e given h, divided by the initial probability of e.)
P(h) and P(e), the "prior probabilities", have to be given somehow independent of evidence e. So when evaluating a theory in light of some evidence, it looks like you need some initial, non-logical information other than that evidence. You might think you can just rely on some earlier evidence, but the point applies generally, to any piece of evidence you consider. I.e., assessing the earlier evidence would also require prior probabilities, etc. So there is a threat that we'll be forced to rely on something a priori.
Empiricists believe that all knowledge must be based on evidence that comes from observation. So they don't like relying on mysterious "prior probabilities". So they came up with a way of doing statistics without having to talk about the prior probabilities of theories. That's the traditional approach to statistics: Since we can't objectively assess prior probabilities, let's just rely on conditional probabilities, P(e|h) and P(e|~h). (The p-value is basically the probability of e given the null hypothesis.)
What I call the streetlamp fallacy is based on the following joke: a drunk is looking for something under a streetlamp. You approach him:
You: What are you doing?
Drunk: Looking for my keys.
You: Where did you lose them?
Drunk: Over there. [Points to dark alley.]
You: Then why are you looking here?
Drunk: The light is much better here.
That is like the argument that, since we don't have a good way of objectively assessing prior probabilities, we should base scientific reasoning entirely on significance tests. The difficulty of assessing P(h) doesn't stop Bayes' Theorem from being correct, so it doesn't stop the prior probability from being essential to assessing the justification for a theory.