“Crusade Against Multiple Regression Analysis” – don’t throw baby out with bathwater

Nisbett640

Richard Nisbett is a professor of psychology and co-director of the Culture and Cognition Program at the University of Michigan.

Edge has an interesting talk  about the problems of research relying on regression analyses (see The Crusade Against Multiple Regression Analysis).  Unfortunately, it has some important faults – not the least is his use of the term “multiple regression” when he was really complaining  about simple regression analysis.

Professor Richard Nisbett quite rightly points out that many studies using regression analysis are worthless, even misleading – even, as he suggests, “quite damaging.”  Damaging because these studies gets reported in the popular media and their faulty conclusions are “taken as gospel” by many readers. Nisbet says:

“I hope that in the future, if I’m successful in communicating with people about this, there’ll be a kind of upfront warning in New York Times articles: These data are based on multiple regression analysis. This would be a sign that you probably shouldn’t read the article because you’re quite likely to get non-information or misinformation.

Knowing that the technique is terribly flawed and asking yourself—which you shouldn’t have to do because you ought to be told by the journalist what generated these data—if the study is subject to self-selection effects or confounded variable effects, and if it is, you should probably ignore them. What I most want to do is blow the whistle on this and stop scientists from doing this kind of thing. As I say, many of the very best social psychologists don’t understand this point.

I want to do an article that will describe, similar to the way I have done now, what the problem is. I’m going to work with a statistician who can do all the formal stuff, and hopefully we’ll be published in some outlet that will reach scientists in all fields and also act as a kind of “buyer beware” for the general reader, so they understand when a technique is deeply flawed and can be alert to the possibility that the study they’re reading has the self-selection or confounded-variable problems that are characteristic of multiple regression.”

I really hope he does work with a statistician who can explain to him the mistakes he is making.  The fact that he raises the issue of “confounded-variable problems” shows he is really talking about simple regression analysis. This problem can be reduced by increasing the types and numbers of comparisons performed in an analysis – by the use of multiple regression analysis, the very thing he makes central to his attack!

The self-selection problem

Nisbett gives a couple of examples of the self-selection problem:

“A while back, I read a government report in The New York Times on the safety of automobiles. The measure that they used was the deaths per million drivers of each of these autos. It turns out that, for example, there are enormously more deaths per million drivers who drive Ford F150 pickups than for people who drive Volvo station wagons. Most people’s reaction, and certainly my initial reaction to it was, “Well, it sort of figures—everybody knows that Volvos are safe.”

Let’s describe two people and you tell me who you think is more likely to be driving the Volvo and who is more likely to be driving the pickup: a suburban matron in the New York area and a twenty-five-year-old cowboy in Oklahoma. It’s obvious that people are not assigned their cars. We don’t say, “Billy, you’ll be driving a powder blue Volvo station wagon.” Because of this self-selection problem, you simply can’t interpret data like that. You know virtually nothing about the relative safety of cars based on that study.

I saw in The New York Times recently an article by a respected writer reporting that people who have elaborate weddings tend to have marriages that last longer. How would that be? Maybe it’s just all the darned expense and bother—you don’t want to get divorced. It’s a cognitive dissonance thing.

Let’s think about who makes elaborate plans for expensive weddings: people who are better off financially, which is by itself a good prognosis for marriage; people who are more educated, also a better prognosis; people who are richer; people who are older—the later you get married, the more likelihood that the marriage will last, and so on.”

You get the idea. But how many academic studies rely on regression analysis of data from a self-selected sample of people? The favourite groups for many studies are psychology undergraduates at universities!

Confounded variable problem

I have, in past articles, discussed some examples of this related to fluoride and community water fluoridation.

See also: Prof. Nisbett’s “Crusade” Against Regression

Conclusions 

Simple regression analyses are too prone to confirmation bias and Nisbett should have chosen his words more carefully, and wisely. Multiple regression is not a silver bullet – but it is far better than a simple correlation analysis. Replication and proper peer review at all research and publication stages also helps. And we should always be aware of these and other limitation in exploratory statistical analysis. Ideally, use of such analyses should be limited to a guide for future, more controlled, studies.

Unfortunately, simple correlation studies are widespread and reporters seem to see them as easy studies for their mainstream media articles. This is dangerous because it has more influence on readers, and their actions, than such limited studies really warrant. And in the psychological and health fields there are ideologically motivated groups who will promote such poor quality studies because it fits their own agenda.

Similar articles

10 responses to ““Crusade Against Multiple Regression Analysis” – don’t throw baby out with bathwater

  1. Many of the many problems Nisbett points out don’t really have anything to do with whether or not regression analysis itself has problems. It seems that he’s touching on the issue of researchers confusing model exploration with model confirmation or validation, on the issue of publication bias (and by extension, the multiple comparison problem), on the issue of frankly ridiculous data dredging… but blaming regression for any of these misses the point for all of them.

    Like

  2. I agree Alex. But he is correct that there are lots of articles out there which seem to be fooled by simple regression analyses without understanding its problems.

    Like

  3. I experienced a similar thing from a person who proclaimed he should know what he is talking about because of being related to Thomas Szaz.

    His idea of partial correlation was, in a situation where you have correlated a number of variables with one another to make an exploratory table, to just take the result between two variables. and call that partial.

    We have to go through a lot of put-downs in this business.

    Like

  4. As what Alex C suggests, Nisbett may be thinking “multiple regression” is doing multiple (simple) regressions.

    The actual “multiple regression” attempts to imagine what relative strength various combined input factors have on an outcome.

    In a somewhat limited fashion, partial correlation does that too.

    Like

  5. Read the article, Brian. Nisbett is not talking about “multiple (simple) regressions.”

    Like

  6. Brian – you say “We have to go through a lot of put-downs in this business.”

    What business are you in?

    Like

  7. I’ve been in this business in the last 24 hours, Ken: https://www.geneticliteracyproject.org/2016/01/28/story-behind-seralinis-disappearing-gmos-toxic-study-journal-published/#comment-2484435918

    A few put downs. and people taking things as “personal insults,” when the question makes them uncomfortable, such as about their funding.

    Like

  8. Nisbet: “I hope that in the future, if I’m successful in communicating with people about this, there’ll be a kind of upfront warning in New York Times articles: These data are based on multiple regression analysis.”

    “analysis” or is that the NYT spelling checker and he actually said, “analyses?”

    “It’s obvious that people are not assigned their cars. We don’t say, “Billy, you’ll be driving a powder blue Volvo station wagon.” Because of this self-selection problem, you simply can’t interpret data like that.”

    What he may be getting at is the difference between experimental and observational research. Lack of that understanding I had to explain to a couple of people attacking Seralini on the GLP URL (last comment).

    Observational research is what you are left with if you cannot manipulate the situation with control and experimental groups. Such as for earthquakes and for epidemics. In observational research you need to look for other possible confounding factors. I hope his advisers tell him that rather than to only go for experimental research. Maybe he is just trying to get people through that point without much thinking from them.

    Like

  9. Ken: I’d say it’s more a problem of researchers (or the media) being fooled by simple regression analysis without understanding regression analysis. OLS works under a variety of assumptions: homoscedasticity of error terms (this all seems to be about cross sectional data so autocorrelation shouldn’t appear), linear independence of regressors, that the desired quantity to be minimized is the mean square error, that (typically) the error terms should add to zero and also be normally distributed, and – of particular relevance to a lot of these examples – that samples are IID (there are more too, like the regression is even the correct specification). Do researchers check these model assumptions? Probably not. But if the model assumptions aren’t actually valid then the problem doesn’t lie with regression but with the researcher who tried to use it.

    And then of course we have the whole issue of (mis)interpreting p-values! Strictly speaking they’re just the probability of observing a value so “extreme” (more appropriately, so supportive of the alternative hypothesis) given that the null hypothesis is true. But if they’re interpreted as probabilities that the null is true (or worse, that the alternative is false), then there will be problems when we start to consider things like unaccounted variables that actually have a role in the data-generating process. And the self-selection problem, which is basically one way IID can fail, will help to make p-values invalid anyway.

    But to distill my thesis again: I think these problems arise because regression is improperly applied or interpreted. Maybe one can say that these are “problems with regression”, but why blame the hammer because people think everything is a nail?

    Like

  10. For what it’s worth I would say that these problems would exist with multiple regression too. Multiple regression can be just as invalid as simple regression if, especially, the model is not correctly specified, i.e. if not all relevant variables are taken into account (and if irrelevant variables are taken into account). Model specification is not at all a simple process too; choosing variables should depend a lot on theory and quite a bit less on whether included variables happen to have observed significant effects. When it comes to understanding what sort of things impact human behavior, as a lot of these examples are, I believe I am not too far off in saying there’s not a lot of useful guiding theory for model building, so that issues may abound is not very surprising to me.

    Like

Leave a Reply: please be polite to other commenters & no ad hominems.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s