Researchers Fool Themselves: Water and Cognition

A recent paper about the effect of water on cognition illustrates a common way that researchers overstate the strength of the evidence, apparently fooling themselves. Psychology researchers at the University of East London and the University of Westminster did an experiment in which subjects didn’t drink or eat anything starting at 9 pm and the next morning came to the testing room. All of them were given something to eat, but only half of them were given something to drink. They came in twice. On one week, subjects were given water to drink; on the other week, they weren’t given water. Half of the subjects were given water on the first week, half on the second. Then they gave subjects a battery of cognitive tests.

One result makes sense: subjects were faster on a simple reaction time test (press button when you see a light) after being given water, but only if they were thirsty. Apparently thirst slows people down. Maybe it’s distracting.

The other result emphasized by the authors doesn’t make sense: Water made subjects worse at a task called Intra-Extra Dimensional Set Shift. The task provided two measures (total trials and total errors) but the paper gives results only for total trials. The omission is not explained. (I asked the first author about this by email; she did not explain the omission.) On total trials, subjects given water did worse, p = 0.03. A surprising result: after persons go without water for quite a while, giving them water makes them worse.

This p value is not corrected for number of tests done. A table of results shows that 14 different measures were used. There was a main effect of water on two of them. One was the simple reaction time result; the other was the IED Stages Completed (IED = intra/extra dimensional) result. It is likely that the effect of water on simple reaction time was a “true positive” because the effect was influenced by thirst. In contrast, the IED Stages Completed effect wasn’t reliably influenced by thirst. Putting the simple reaction time result aside, there are 13 p values for the main effect of water; one is weakly reliable (p = 0.03).  If you do 20 independent tests, purely by chance one is likely to have p < 0.05 at least once even when there are no true effects. Taken together, there is no good reason to believe that water had main effects aside from the simple reaction time test. The paper would be a good question for an elementary statistics class (“Question: If 13 tests are independent, and there are no true effects present, how likely will at least one be p = 0.03 or better by chance? Answer: 1 – (0.97^13) = 0.33″). 

I wrote to the first author (Caroline Edmonds) about this several days ago. My email asked two questions. She replied but failed to answer the question about number of tests. Her answer was written in haste; maybe she will address this question later.

A better analysis would have started by assuming that the 14 measures are unlikely to be independent. It would have done (or used) a factor analysis that condensed the 14 measures into (say) three factors. Then the researchers could ask if water affected each of the three factors. Far fewer tests, far more independent tests, far harder to fool yourself or cherry-pick.

The problem here — many tests, failure to correct for this or do an analysis with far fewer tests — is common but the analysis I suggest is, in experimental psychology papers, very rare. (I’ve never seen it.) Factor analysis is taught as part of survey psychology (psychology research that uses surveys, such as personality research), not as part of experimental psychology.  In the statistics textbooks I’ve seen, the problem of too many tests and correction for/reduction of number of tests isn’t emphasized. Perhaps it is a research methodology example of Gresham’s Law: methods that make it easier to find what you want (differences with p < 0.05) drive out better methods.

Thanks to Allan Jackson.

7 Responses to “Researchers Fool Themselves: Water and Cognition”

  1. Alex Chernavsky Says:

    In my (limited) experience, even people who claim to be experts in statistics often don’t know what they’re talking about. I remember working on a paper back in graduate school. The lead author was trying to determine whether to use the Kolmogorov-Smirnov test, the Mann-Whitney test, or something else altogether. The people we consulted did not exactly fill us with confidence in their abilities. (Of course, maybe we just asked the wrong people.)

    Seth: I have never used the Kolmogorov-Smirnov test nor the Mann-Whitney test and expect never to do so.

  2. David Johnston Says:

    KS is to MW as is sameness of distribution is to sameness of means.

    KS is very handy in my line of work, MW not so much.

    Seth: KS is much less sensitive than other tests.

  3. Three Pipe Problem Says:

    Can you recommend any resources for people who may have a decent math background and know some very basics about statistics and are looking to improve their insight in a time effective manner?

    Seth: Their insight about what?

  4. Kevin Miller Says:

    I think a better and, to my mind, more convincing approach, is to require replication. Finding the same effect in two independent samples is pretty powerful. Controls for multiple comparisons are a mixed bag, and some of them can really be too strict. Multiple-experiment papers, which are the norm in most cognitive psychology journals, typically include a replication as part of later experiments, and I find that to be more convincing than statistical controls for multiple comparisons. Of course, neither control for other issues such as non-representative of stimuli or subjects, but such is live in this vale of tears.

    Seth: The first person not to fool is yourself — preferably without the enormous cost of repeating the experiment. You don’t seem to be taking the cost of replication into account. Or maybe you are thinking like a reader rather than as a practitioner. If you have two or three independent tests, the adjustments for multiple comparisons are realistic. You really can set the overall p value to 0.05. Sure, replication is a good idea, too. In multiple experiment papers, nothing is said about experiments that didn’t work out — there is no statement like “we are not omitting relevant evidence” — so it is less than obvious what repetition means. Of course, when authors omit relevant evidence they are deceiving others, not themselves. What’s interesting about this example is that the researchers appear to have deceived themselves.

  5. Mark Says:

    Completely agree that replication is key (in experimental studies, that is… Replication is a less reliable tool in non-randomised studies because bias is systematic), and is a much more important consideration than multiple testing. Any “significant” result might be a type 1 error, and a p-value doesn’t give any indication how likely that might be. Replication is the only way to rule that out (which is why FDA typically requires at least 2 phase 3 trials).

  6. John Eels Says:

    @Three Pipe Problem, that’s a good question. I’d like to run simple experiments and test for significance. I studied psychology and had courses in statistics. I don’t remember much. I’d like to fresh up basic concepts in a playful style: Self experimentation 101 with suggested experiments (week one eat 60g butter daily, week two don’t, week three like week one again and so on and test reaction time or whatever concept daily, or connection amount of coffee consumed and self-measured sleep quality, or sleep quality and seeing people in the evening). Stuff gets so much more interesting if it has real life value.

  7. gwern Says:

    > A better analysis would have started by assuming that the 14 measures are unlikely to be independent. It would have done (or used) a factor analysis that condensed the 14 measures into (say) three factors. Then the researchers could ask if water affected each of the three factors. Far fewer tests, far more independent tests, far harder to fool yourself or cherry-pick.

    Where would these 3 factors come from? If they’re being estimated from the data, without any previous reason to expect particular factorization, and then fed into the tests with the expectation we have reduced our false positive problem, that makes me feel uneasy and wonder how kosher this procedure really is.

    Incidentally, as far as the Man-Whitney goes, I’ve used it twice recently when I wanted to compare 2 groups but they were blatantly not normally-distributed (in http://www.gwern.net/Google%20Alerts#it-was-mid-2011 and http://www.gwern.net/Anchoring#article-effect). Normality is common, sure, but so is non-normality; if one never feels the need to use it, perhaps one is simply shoehorning t-tests where they don’t belong.

    Seth: Check out “data transformation” (e.g., log transformation) the next time you encounter non-normal data.