Tag Archives: p-hacking

Statistical manipulation to get publishable results

I love data. It’s amazing the sort of “discoveries” I can make given a data set and computer statistical package. It’s just so easy to search for relationships and test their statistical significance. Maybe relationships which we feel are justified by our experience – or even new ones we hadn’t thought of previously.

It’s a lot of fun. Here’s a tool readers can use to explore a data set involving information on US Political leadership and the US economy –Hack Your Way To Scientific Glory (The image above shows the tool but it’s only an image. WordPress won’t allow me to embed the site but you can access it by clicking on the image).

Try searching for relationships between political leadership and the economy. If you can find a relationship with a p-value < 0.05 you might feel the urge to publish your findings. After all, p-values < 0.05 seem to be the gold standard for scientific journal these days.

Statistical manipulation a big problem in published science

Problem is, by playing with this data you could producing statistically significant relationships that “show” both Republicans and Democrats hurt the economy, or that both are good for the economy. It’s simply a matter of choosing the appropriate factors to define political leadership and appropriate factors to measure the economic situation.

The process is called p-hacking or data dredging. Time spent playing with this tool should convince you that it is easy to confirm one’s own political biases about political leadership and political parties using statistical techniques. It should also convince you this is very bad science. But, unfortunately, it happens. Even respectable journals will publish papers reporting relationships obtained by p-hacking, provided a p-value of less than 0.05 can be shown.

The article Science Isn’t Broken: It’s just a hell of a lot harder than we give it credit for includes the p-hacking tool and discusses how widespread the problem is in the published scientific literature. It also describes the concern that statisticians and scientists have about this sort of publication.

The author, says:

“The variables in the data sets you used to test your hypothesis had 1,800 possible combinations. Of these, 1,078 yielded a publishable p-value, but that doesn’t mean they showed that which party was in office had a strong effect on the economy. Most of them didn’t.

The p-value reveals almost nothing about the strength of the evidence, yet a p-value of 0.05 has become the ticket to get into many journals. “The dominant method used [to evaluate evidence] is the p-value,” said Michael Evans, a statistician at the University of Toronto, “and the p-value is well known not to work very well.”

Statistical manipulation and p-hacking in fluoride studies

In my articles on the way scientific papers relating to fluoridation are misrepresented, I have often referred to the misleading use of p-values to argue that a study is very strong or a relationship important. Paul Connett, head of the Fluoridation Network (FAN), often uses that argument. (see for example  Connett fiddles the data on fluoride, Connett misrepresents the fluoride and IQ data yet again, and Anti-fluoridation campaigners often use statistical significance to confirm bias).

But I have noticed p-hacking and data dredging are real problems with some of the more recent studies of fluoride and IQ. Partly because these papers are being published by some reputable journals. Also because some reviewers and scientific readers seem completely unaware of the problem and therefore are uncritically taking some of the claimed findings at face value.

I have gone through some recent papers on this issue and pulled out the factors used to represent child cognitive abilities and to represent F exposure or intake. These are listed below for 7 papers and a thesis.

Study Cognitive factor F exposure
Malin & Till (2015) ADHD prevalence in US states Fluoridation extent in US states
Thomas (2014) WAS)
Bayley Infant Scales of Development-II (BSID-II)
Blood plasma F
Concurrent child urinary F
Bashesh et al., (2017) CGI
Concurrent child urinary FSG
Bashesh et al., (2018) ADHD
CRS scores
3CPT scores
Thomas et al., (2018) MDI MUFCr
Green et al., (2019) FSIQ* boys
PIQ* boys
VIQ ns
Estimated F intake by mother
Riddell et al., (2019) SDQ hyperactive/inattentive score
ADHD – parent-reported or questionnaire
Water F
Till et al., (2020) FSIQ
Water F
Santa-Marina et al (2019) perceptual-manipulative scale
verbal function,
general cognitive

Footnotes (see papers for full information):
MUF – Prenatal maternal urinary F
MUFCr – Prenatal maternal urinary F adjusted using creatinine concentration
MUFSG – Prenatal maternal urinary F adjusted using specific gravity
Concurrent child urinary FSG – child urinary F at the time of IQ assessment adjusted using specific gravity
CGI – general cognitive index
FSIQ – Full-Scale IQ
PIQ – Performance IQ
VIQ – Verbal IQ
MDI – Mental development index
WASI – Wechsler Abbreviate Scale of Intelligence

As you can see, just like the political leadership/economy example illustrated in the p-hacking tool above there is a range of both cognitive measurements and fluoride expose factors which can be cherry-picked to produce the “right” answer (or confirm one’s bias). Most of these studies can also select from up to three cohorts. So it’s not surprising that relationships can be found to support the argument that fluoride has a negative effect on child cognitive abilities. But we can also find statistically significant relationships to support the argument that fluoride has a positive effect on cognitive abilities. Or, alternatively, that fluoride has no effect at all on cognitive abilities.

Another warning sign is that the relationships that are cited (and which have p-values < 0.05) are all extremely weak and explain only a few per cent of the variance in the data. While the complete statistical analyses are not given in most of the papers (another big problem in published research) the figures show a very high scatter in the data and the quoted confidence intervals confirm this.

Even where p < 0.05 the data can be extremely scattered and the relationship so weak as to be meaningless. Figure 1 from Till et al., (2020)

Yet another warning sign is that when relationships are reported they are only true for different cognitive factors or different fluoride exposure factors. And again, they may only be true for one sex or for a limited age group.


Geoff Cumming wrote in A Primer on p Hacking that:

“Statisticians have a saying: if you torture the data enough, they will confess.”

We should always remember this when reading papers which rely on low p-values to support a relationship. I think this is a big problem in a lot of published science but it is certainly a problem with the fluoride-IQ research currently being published.

The real take-home message from this particular research is that all the reported relationships are extremely weak, the data has been “tortured,” and it is easy to select parameters to produce a relationship with a p-value < 0.05 to confirm a bias.

In fact, the results from these studies are contradictory, confusing and extremely weak. They may be useful to political activists who have biases to confirm or ideological agendas to promote. but they are not sufficient to influence public health policy.

Similar articles