The misbehaviour of mistakes

The Comedy of Errors - with apologies to William Shakespeare in the week of his 450th birthday
The Comedy of Errors
– with apologies to William Shakespeare in the week of his 450th birthday

There are two ways that mistakes can happen when you are carrying out an experiment to test a hypothesis. Experiments usually have two possible outcomes: accepting a “null” hypothesis, which means concluding that the experiment does not challenge its truth, and rejecting a null hypothesis, which means concluding that the experiment does provide sufficient evidence to do so.

Type 1 errors, otherwise known as “false positives” are when you think there is evidence for rejecting the null hypothesis (eg deciding there actually is something wrong with a smear test) when there isn’t. Type 2 errors, otherwise known as “false negatives”, are when you accept the null hypothesis but you really shouldn’t (eg telling someone they are all clear when they are not).

Saddam Hussain once famously said “I would rather kill my friends in error, than allow my enemies to live”. This suggests that he was really very much more concerned about Type 2 errors than Type 1 errors.

He is not alone in this.

A recent widely reported academic paper published in Nature claimed to have a test that “predicted phenoconversion to either amnestic mild cognitive impairment or Alzheimer’s disease within a 2–3 year timeframe with over 90% accuracy”.

The latest statistics from the Alzheimer’s Society suggest that around 1 in 14 or 7% of over 65s will develop Alzheimer’s. Probably not all of these people will contract the disease within 3 years, but let’s assume for the sake of argument that they will. Even so this means that, out of 1,000 people over 65, 930 people will not get Alzheimer’s within 3 years.

Applying the 90% accuracy rate allows us to detect 63 out of 70 people who actually will get Alzheimer’s. There will be 7 cases not picked up where people go on to develop Alzheimer’s. However the bigger problem, the Type 1 error that Saddam Hussain was not so bothered about, is that 10% of the people who do not and will not get Alzheimer’s will be told that they will. That is 93 people scared unnecessarily.

So 63 + 93 = 156 people will test positive, of which only 63 (ie 40%) will develop Alzheimer’s within three years. The “over 90%” accuracy rate becomes only a 40% accuracy rate amongst all the people testing positive.

In statistical tests more generally, if the likelihood of a false positive is less than 5%, the evidence that the hypothesis is true is commonly described as “statistically significant”. In 2005 John Ioannidis, an epidemiologist from Stanford University, published a paper arguing that most published research findings are probably false. This was because of three things often not highlighted in the reporting of research: the statistical power of the study (ie the probability of not making a type 2 error or false negative), how unlikely the hypothesis is being tested and the bias in favour of testing new hypotheses over replicating previous results.

As an example, if we test 1,000 hypotheses of which 100 are actually true but with a 5% test of significance, a study with power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, up to 5% – ie, 45 of them – could be accepted as right because of the permissible level of type 1 errors or false positives. So you have 80 + 45 = 125 positive results, of which 36% are incorrect. If the statistical power is closer to the level which some research findings have suggested of around 0.4, you would have 40 + 45 = 85 positive results, of which 53% would be incorrect, supporting Professor Ioannidis’ claim even before you get onto the other problems he mentions.

We would have got much more reliable results if we had just focused on the negative in these examples. With a power of 0.8, we would get 20 false negatives and 855 true negatives, ie 2% of the negative results are incorrect. With a power of 0.4, we would get 60 false negatives and 855 true negatives, ie still less than 7% of the negative results are incorrect. Unfortunately negative results account for just 10-30% of published scientific literature, depending on the discipline. This bias may be growing. A study of 4,600 papers from across the sciences conducted by Daniele Fanelli of the University of Edinburgh found that the proportion of positive results increased by over 22% between 1990 and 2007.

So, if you are looking to the scientific literature to support an argument you want to advance, be careful. It may not be as positive as it seems.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.