The Blindness of Scientists: The Problem isn’t False Positives, It’s Undetected Positives

Suppose you have a car that can only turn right. Someone says, Your car turns right too much. You might wonder why they don’t see the bigger problem (can’t turn left).

This happens in science today. People complain about how well the car turns right, failing to notice (or at least say) it can’t turn left. Just as a car should turn both right and left, scientists should be able to (a) test ideas and (b) generate ideas worth testing. Tests are expensive. To be worth the cost of testing, an idea needs a certain plausibility. In my experience, few scientists have clear ideas about how to generate ideas plausible enough to test. The topic is not covered in any statistics text I have seen — the same books that spend many pages on to how to test ideas.

Apparently not noticing the bigger problem, scientists sometimes complain that this or that finding “fails to replicate”. My former colleague Danny Kahneman is an example. He complained that priming effects were not replicating. Implicit in a complaint that Finding X fails to replicate is a complaint about testing. If you complain that X fails to replicate, you are saying that something was wrong with the tests that established X. There is a connection between replication failure and failure to generate ideas worth testing. If you cannot generate new ideas, you are forced to test old ideas. You cannot test an old idea exactly — that would be boring/repetitive. So you give an old idea a slight tweak and test the variation. For example, someone has shown that X is true in North America. You ask if X is true in South America. You hope you haven’t tweaked X too much. No idea is true everywhere, except maybe in physics, so as this process continues — it goes on for decades — the tested ideas gradually become less true and the experimental effects get weaker. This is what happened in the priming experiments that Kahneman complained about. At the core of priming — the priming effects studied 30 years ago — is a true phenomenon. After reading “doctor” it becomes easier to decide that “nurse” is a word, for example. This was followed by 30 years of drift away from word recognition. Not knowing how to generate new ideas worth testing, social psychologists have ended up studying weak effects (recent priming effects) that are random walks away from strong effects (old priming effects). The weak effects cannot bear the professional weight (people’s careers rest on them) they are asked to carry and sometimes collapse (“failure to replicate”). Sheena Iyengar, a Columbia Business School professor and social psychologist, got a major award (best dissertation) for and wrote a book about a new effect that has turned out to be very close to non-existent. Inability to generate ideas — to understand how to do so — means that what appear to be new ideas (not just variations of old ideas) are more likely to be mistakes. I have no idea whether Iyengar’s original effect was true or not. I am sure, however, that it was weak and made little sense.

Statistics textbooks ignore the problem. They say nothing about how to generate ideas worth testing. I haven’t asked statisticians about this, but they might respond in one of two ways: 1. That’s someone else’s problem. Statistics is about what to do with data after you gather it. That makes as much sense as teaching someone how to land a plane but not how to take off. 2. That’s what exploratory data analysis is forIf I said “Exploratory data analysis can only identify effects of factors that the researcher decided to vary or track. Which is expensive. What about other factors?” they’d be baffled, I believe. In my experience, exploratory data analysis = full analysis of your data. (Many people do only a small fraction, such as 10%, of all reasonable analyses of their data.) Full analysis is better than partial analysis, but calling it a way to find new ideas fails to understand that professional scientists study the same factors over and over. 

I suppose many scientists feel the gap acutely. I did. I became interested in self-experimentation most of all because it generated new ideas at a much higher rate (per year) than my professional experiments with rats. I had no idea why, at first, but as it kept happening — my self-experimentation generated one new idea after another. I came to believe that by accident I was doing something “right”. I was doing something that fit a general rule of how to generate ideas, even though I didn’t know what the general rule was.

The sciences I know about (psychology and nutrition) have great trouble coming up with new ideas. The paleo movement is a response to stagnation in the field of nutrition. The Shangri-La Diet shows what a new idea looks like in the area of weight control. The failure of nutritionists to study fermented foods is ongoing. Stagnation in psychology can be seen in the fact that antidepressants remain heavily prescribed, many years after the introduction of Prozac (my work on morning faces and mood suggests a much different approach), lack of change in treatments for bipolar disorder over the last 50 years (again, my morning-faces work suggests another approach), and in the failure of social psychologists to discover any big new effects in the last ten years. 

 

Here is the secret to idea generation: Cheaper tests. To find ideas plausible enough to be worth testing with Test X, you need a way of testing ideas that is cheaper than Test X. The cheaper your test, the larger the region of cause-effect space you can explore. Let’s say Test Y is cheaper than Test X. With Test Y, you can explore more of cause-effect space than you can explore with Test X. In the region unexplored by Test X, you can find points (cause-effect relationships) that pass Test Y. They are worth testing with Test X. My self-experimentation generated new ideas worth testing with more expensive tests because it was much cheaper than existing tests. Via self-experimentation, I could test many ideas too implausible or too expensive to be tested conventionally. Even cheaper than a self-experiment was simply monitoring myself — tracking my sleep, for example. Again and again, this generated ideas worth testing via self-experimentation. I did what all scientists should do: use cheaper tests to generate ideas worth testing with more expensive tests.

 

7 Responses to “The Blindness of Scientists: The Problem isn’t False Positives, It’s Undetected Positives”

  1. August Says:

    Cheaper tests are good, unless the cheaper tests are computer models. We need cheaper real world tests. Didn’t you post a long time ago about the error rate of DNA computer models? I learned about the dubious nature of computer modeling via the intersection of politics and climatology.

    Seth: Computer models aren’t tests of theories, they are theories.

  2. Tom Says:

    A good illustration of your point:
    http://well.blogs.nytimes.com/2013/09/10/myths-surround-breakfast-and-weight/

  3. kxmoore Says:

    i wonder if this test is replicable. The implications are huge considering the explosion of obesity and it’s related health effects.

    http://www.ncbi.nlm.nih.gov/pubmed/22863169

  4. Wil B Says:

    Experience has shown, I believe, that this is a particularly glaring problem in the context of drug testing and development in the pharmaceutical industry. Much of this is surely brought about by conflicts of interest (and a certain amount of corruption) which are endemic in this exceedingly profit-driven industry.

    Another issue is the common disregard of true scientific method (Popper anyone?) in the planning and conduct of testing because so much of the “testing” is done with profits in mind. Shouldn’t the object (protocol) for whatever is being tested be to disprove a particular hypothesis, rather than allowing biases to creep in so that a certain desired result can be realized and money can be made? A failure to disprove a hypothesis would be much more honest and convincing, wouldn’t it?

    Seth: If “profit” includes career advancement, salary increase, grant renewal, and so on, I think about 100% of testing is done with profit in mind. Almost always some results are more profitable than the others.

  5. Kim Øyhus, physicist Says:

    Tests confirming a theory, will give less and less confirmation for more and more work.
    Tests falsifying a theory, are much more efficient.

    This means that cheap good scientific results will consist of lots of partially confirmed theories, with lots more falsified theories discarded.

    Seth: “Tests falsifying a theory are much more efficient.” Hmm. You don’t have a choice. You can’t choose whether your test will support or falsify your theory. That depends on the results. However, I do think that tests of ideas of intermediate plausibility produce more information — more shift in belief — than tests of ideas that are very high or very low in plausibility. But that’s just the average shift in belief. More interesting is what the distributions look like.

  6. August Says:

    I agree, but they are ‘testing’ against computer models and presenting the results as evidence to the public. I’ve even heard of it locally, in biochem. My coworker’s former boss was doing research into proteins related to cancer. They kept trying to get him to test against computer models. They closed his lab down eventually, and he had to go work as someone else’s assistant.

  7. Wil B Says:

    Seth said (re the scientific approach of attempting to disprove a hypothesis vs. trying to prove the hypothesis): You don’t have a choice. You can’t choose whether your test will support or falsify your theory. That depends on the results.

    Perhaps we are discussing different categories and goals of “testing.” Staying with the category of drug development and testing for purposes of illustration, let’s say a company’s hypothesis is that compound X, being developed in a laboratory, is generally effective to treat Y disease in humans and will not have any bad side effects. Employing the usual testing methods, a study group of investigators administers compound X to human subjects with disease Y to see if the compound cures (or ameliorates) the disease. If a significant number of the subjects are cured (or get better) after taking the compound, and do not appear to suffer debilitating side effects, the hypothesis is “proven,” the results are positive, and the company will make a lot of money selling it to patients with disease Y.

    But can we be certain that the testing and record keeping was conducted honestly and without bias in advance; i.e., free of undue influence and a strong desire to reach a certain result? Wouldn’t it be better to at least have a separate, independent arm of the study (if for no other reason than to keep the results-oriented group of investigators honest) examining and tracking the same test subjects and maintaining their own data with the opposite “goal” of falsifying the first group’s finding?

    If the latter group fails in its effort to falsify the first group’s results, that’s fine because the company (and the FDA) could have much more confidence that the compound will, in fact, be a safe and efficacious product. In this instance the failure to falsify was in reality a positive result!

    In this context, which method of drug development, testing (and perhaps ultimate approval) would most doctors and end users prefer?

    Wil B.