There has been widespread scientific criticism of the recently published Canadian fluoride-IQ study of Green et al., (2019). Most recently Dr. René F. Najera (a Doctor of Public Health, an epidemiologist and biostatistician) has critiqued the statistical analysis. He finds a number of faults and concludes by hoping “public health policy is not done based on this paper:”
“It would be a terrible way to do public health policy. Scientific discovery and established scientific facts are reproducible and verifiable, and they are based on better study designs and stronger statistical outcomes than this. “
Dr. Najera’s critiques the biostatistics is in his article The Hijacking of Fluorine 18.998, Part Three. This follows his previous critique (Part 1 and Part 2) of the epidemiological issues which I reviewed in Fluoridation – A new fight against scientific misinformation.
Dr. Najera starts by stressing the important role of biostatistics in epidemiological studies. After all the planning and measurement:
“.. we hand off the data to biostatisticians, or we do the work with biostatisticians. Doing this assures us that we are measuring our variables correctly and that all associations we see are not due to chance. Or, if chance had something to do with it, we recognize it and minimize the factors that lead to chance being a factor in and of itself.”
I agree completely. In my experience statisticians play a critical role in research and should be involved even at the planning stage. Further, I think the involvement of experienced biostatisticians is invaluable. Too often I see papers where the authors themselves relied on their own naive statistical analyses rather than calling on experience. Perhaps they are being protective of their own confirmation bias.
The specific study Dr. Najera critiques is:
Green, R., Lanphear, B., Hornung, R., Flora, D., Martinez-Mier, E. A., Neufeld, R., … Till, C. (2019). Association Between Maternal Fluoride Exposure During Pregnancy and IQ Scores in Offspring in Canada. JAMA Pediatrics, 1–9.
For my other comments on the Candian fluoride/IQ research see:
- If at first you don’t succeed . . . statistical manipulation might help
- Politics of science – making a silk purse out of a sow’s ear
- More expert comments on the Canadian fluoride-IQ paper
- An evidence-based discussion of the Canadian fluoride/IQ study
- Fluoridation – A new fight against scientific misinformation
No comparison group
One problem with this study is that a number of mother-child pairs were excluded and, in the end, the sample used was not representative of the Canadian population. Najera summarises the “main finding of the study as “that children of mothers who ingested fluoride during pregnancy had 4 IQ points lower for each 1 mg of fluoride consumed by the mother:”
“If you’re asking yourself, “Compared to whom?” you are on the right track. There was no comparison group. Women who did not consume tap water or lived outside a water treatment zone were not included, and that’s something I discussed in the previous post. What the authors did was a linear regression based on the data, and not much more.”
In fact, while the sample used was unrepresentative the study did compare the IQs of children whose mothers had lived in fluoridated and nonfluoridated areas. There was no statistically significant difference – an important fact which was not discussed at all in the paper. This table was extracted from the paper’s Table 1.
What about that regression?
While ignoring the mean values for fluoridated and nonfluoridated areas the authors relied on regression analyses to determine an effect.
But if you look at the data in their Figure 3A reproduced below you can see problems:
“. . . you can see that the average IQ of a child for a mother consuming 1.5 mg of fluoride is about 100. You also see that only ONE point is representing that average. That in itself is a huge problem because the sample size is small, and these individual measurements are influencing the model a lot, specially if their value is extreme. Because we’re dealing with averages, any extreme values will have a disproportionate influence on the average value.”
Several scientific commenters on this paper have noted this problem which is important because it should have been dealt with in the statistical analysis:
“When biostatisticians see these extreme values popping up, we start to think that the sample is not what you would call “normally distributed.” If that is the case, then a linear regression is not exactly what we want to do. We want to do other statistical analyses and present them along with the linear regressions so that we can account for a sample that has a large proportion of extreme values influencing the average. Is that the case with the Green study? I don’t know. I don’t have access to the full dataset. But you can see that there are some extreme values for fluoride consumption and IQ. A child had an IQ of 150, for example. And a mother consumed about 2.5 milligrams of fluoride per liter of beverage. Municipal water systems aim for 0.7 mg per liter in drinking water, making this 2.5 mg/L really high.”
No one suggests such outliers be removed from the analyses (although the authors did remove some). But they “should be looked at closely, through statistical analysis that is not just a linear regression.”
This is frustrating because while the authors did not do this they hint that it was considered (but do not produce results) when they say:
“Residuals from each model had approximately normal distributions, and their Q-Q plots revealed no extreme outliers. Plots of residuals against fitted values did not suggest any assumption violations and there were no substantial influential observations as measured by Cook distance. Including quadratic or natural-log effects of MUFSG or fluoride intake did not significantly improve the regression models. Thus, we present the more easily interpreted estimates from linear regression models.”
As Dr. Najera comments, this is “.. worrisome because that is all they presented. They didn’t present the results from other models or from their sensitivity analysis.”
Scientific commenters are beginning to demand that the authors make the data available so they can check for themselves. My own testing with the data I extracted from the figure does show that the data is not normally distributed. Transformation produced a normal distribution of the data but the relationship was far weaker than for a straight linear regression. Did the authors reject transformations simply because they “did not significantly improve the regression models?”
That suggests confirmation bias to me.
In their public promotions, the authors and their supporters never mention confidence intervals (CIs)- perhaps because the story does not look so good when they are considered. Most of the media coverage has also ignored these CIs.
A big thing is made for the IQ score of boys dropping by 4.49 points with a 1 mg/L increase in mother’s urinary fluoride, but:
“Based on this sample, the researchers are 95% confident that the true drop in IQ in the population they’re studying is between 0.6 points and 8.38 points. (That’s what the 95% CI, confidence interval, means.)”
In other words:
“In boys, the change is as tiny as 0.6 and as huge as 8.38 IQ points.”
For girls the change:
“is between -2.51 (a decrease) and 7.36 (an increase). It is because of that last 95% CI that they say that fluoride ingestion is not associated with a drop in IQ in girls. In fact, they can’t even say it’s associated with an increase. It might even be a 0 IQ change in girls.”
Dr. Najera asks:
“Is this conclusive? In my opinion, no. It is not conclusive because that is a huge range for both boys and girls, and the range for girls overlaps 0, meaning that there is a ton of statistical uncertainty here. “
This is why the epidemiological design used by the authors is worrying. For example:
” The whole thing about not including women who did not drink tap water is troubling since we know that certain drinks have higher concentrations of fluoride in them. If they didn’t drink tap water, what are the odds that they drank those higher-fluoride drinks, and what was the effect of that?”
This comes on top of the problems with the regression models used.
Transformation to normalise the data and inclusion of other important facts may have produced a non-significant relationship and there would be no need for this discussion and speculation.
What about those other important factors?
Green et al (2019) included other factors (besides maternal urinary fluoride) in their statistical model. This “adjustment” helps check that the main factor under consideration is still statistically significant when other factors are included. In this case, the coefficient (and CIs) for the linear association for boys was reduced from -5.01 (-9.06 to -0.97) for fluoride alone to -4.49 (-8.38 to -0.60) when other considered factors were included. In this case, the other factors included race/ethnicity, maternal education, “city”, and HOME score (quality of home environment).
Dr. Najera questions the way other factors, or covariates, were selected for inclusion in the final model. He says:
“The authors also did something that is very interesting. They left covariates (the “other” factors) in their model if their p-value was 0.20. A p-value tells you the probability that the results you are observing are by chance. In this case, they allowed variables to stay in their mathematical model if the model said that there was as much as a 1 in 5 chance that the association being seen is due to chance alone. The usual p-value for taking out variables is 0.05, and even that might be a little too liberal.
Not only that, but the more variables you have in your model, the more you mess with the overall p-value of your entire model because you’re going to find a statistically significant association (p-value less than 0.05) if you throw enough variables in there. Could this be a case of P Hacking, where researchers allow more variables into the model to get that desired statistical significance? I hope not.”
Good point. I myself was surprised at the use of such a large p-value for selection. And, although the study treats fluoride as the main factor and inclusion of the other factors reduces the linear coefficient for maternal urinary fluoride, I do wonder why more emphasis was not put on these other factors which may contribute more to the IQ effect than does fluoride.
Perhaps this paper should have concentrated on the relationship of child IQ with race or maternal education rather than with fluoride.
Padding out to overcome the poor explanation of IQ variance
Another point about the inclusion of these covariates. As well as possible improving the statistical significance of the final model they may also make the model look better in terms of the ability to explain the variance in IQ (which is very large – see figure above).
In my first critique of the Green et al (2019) paper (If at first you don’t succeed . . . statistical manipulation might help) I pointed out that the reported relationship for boys, although statistically significant, explained very little of the variance in IQ. I found only 1.3% of the variance was explained – using data I had digitally extracted from the figure. This was based on the R-squared value for the linear regression analysis.
Unfortunately, the authors did not provide information like R-squared values for their regression analysis (poor peer review in my opinion) – that is why I, and others, were forced to extract what data we could from the figures and estimate our own. Later I obtained more information from Green’s MA thesis describing this work (Prenatal Fluoride Exposure and Neurodevelopmental Outcomes in a National Birth Cohort). Here she reported an R-squared value of 4.7%. Bigger than my 1.3% (my analysis suffered from not having all the data) but still very small. According to Nau’s (2017) discussion of the meaning of R-squared values (What’s a good value for R-squared?), ignoring the coefficient determined by Green et al (2019) (5.01) and relying only on the constant in the relationship would produce a predicted value of IQ almost as good (out by only about 2%).
That is, simply taking the mean IQ value (about 114.1 according to the figure above) for the data would be almost as good as using the relationship for any reasonable maternal urinary fluoride value and OK for practical purposes.
But look at the effect of including other factors in the model. Despite lowering the coefficient of the relationship for fluoide it drastically increases the R-squared value. Green reported a value of 22.0% for her final model. Still not great but a hell of a lot better than 4.7%.
Perhaps the inclusion of so many other factors in a multiple regression makes the final model look much better – and perhaps that perception is unjustly transferred to the relationship with fluoride.
Are other more important factors missed?
Almost certainly – and that could drastically alter to conclusions we draw from this data. The problem is that fluoride can act as a proxy for other factors. City location and size are just one aspect to consider.
In my paper Fluoridation and attention deficit hyperactivity disorder a critique of Malin and Till (2015), I showed inclusion of altitude as a risk-modifying factor completely removed any statistical significance from the relationship between ADHD prevalence and fluoridation – despite the fact Malin & Till (2015) had reported a significant relationship with R-squared values over 30%!
So you can see the problem. Even though authors may list a number of factors or covariates they “adjusted” for, important risk-modifying factors may well be ignored in such studies. This is not to say that inclusion of them “proves” causation any more than it does for fluoride. But if their inclusion leads to the disappearance of the relationship with fluoride one should no longer claim there is one (reviewers related to the group involved in the Green et al., 2019 study still cite Malin & Till 2015 as if their reported relationship is still valid).
In effect, the authors acknowledge this with their statement:
“Nonetheless, despite our comprehensive array of covariates included, this observational study design could not address the possibility of other unmeasured residual confounding.”
Dr. Najera summarises his impression of the Green et al (2019) study in these words:
“The big idea of these three blog posts was to point out to you that this study is just the latest study that tries very hard to tie a bad outcome (lower IQ) to fluoride, but it really failed to make that case from the epidemiological and biostatistical approaches that the researcher took, at least in my opinion. Groups were left out that shouldn’t. Outliers were left in without understanding them better. A child with IQ of 150 was left in, along with one mother-child pair of a below-normal IQ and very high fluoride, pulling the averages in their respective directions. The statistical approach was a linear regression that lumped in all of the variables instead of accounting for different levels of those variables in the study group. (A multi-level analysis that allowed for the understanding of the effects of society and environment along with the individual factors would have been great. The lack of normality in the distribution of outcome and exposure variables hint at a different analysis, too.)”