Recently the New York Times published a fascinating article by John Schwartz in the science section about how two teenagers discovered that a lot of raw fish sold in New York is mislabeled. Unfortunately, the article contained two big mistakes: 1. The teenagers’ results were dismissed as unconvincing because the sample size (10 stores and 4 sushi restaurants) was, according to Schwartz, too small. For many purposes the sample was large enough, if their sampling method was good. 2. The sampling method wasn’t described. Without knowing how the stores and restaurants were chosen, it’s impossible to know to what population the results apply. This was like reviewing a car and not saying the price.
In an email to the Times I pointed out the first mistake:
Your article titled “Fish Tale Has DNA Hook” by John Schwartz, which appeared in your August 22, 2008 issue, has two serious errors:
1. The article states: “The sample size is too small to serve as an indictment of all New York fishmongers and restaurateurs.” To whom the results apply — whom they “indict” — depends on the sampling method used — how the teenagers decided what businesses to check. Sample size has almost nothing to do with it. This was the statistician John Tukey’s complaint about the Kinsey Report. The samples were large but the sampling method was terrible — so it didn’t matter that the samples were large.
2. The article states: “the results are unlikely to be a mere statistical fluke.” It’s unclear what this means. In particular, I have no idea what it would mean that the results are “a mere statistical fluke.” The error rate of the lab where the teenagers sent the fish to be identified is probably very low.
In retrospect the second error is “serious” only if incomprehensibility is serious. Maybe not. I should have pointed out the failure to describe the sampling protocol) but didn’t.
I got the following reply from Schwartz:
Thank you for your note about my article, “Fish Tale Has DNA Hook,” which appeared in the newspaper on Friday. You state that the story misstated the importance of sampling size as “an indictment of all New York fishmongers and restaurateurs.” Although you are certainly correct in stating that poor methodology can undercut work performed using even the largest samples, it is also ill advised to try to establish broad conclusions from a very small sample. The fact that mislabeling occurred one in four pieces of seafood from 14 restaurants and shops in no way allows us to conclude that 25 percent of fish sold in New York or in the United States is mislabeled. And that is all I was trying to say with the reference to sample size was that while the girls’ experiment shows that some mislabeling has occurred, their work cannot say how much of it goes on or whether any given restaurant or shop is mislabeling its products. Similarly, when I wrote that it is unlikely the findings are a “statistical fluke,” I merely meant that while it is possible that Kate and Louisa found the only 8 restaurants and shops in New York City that mislabel their products, that is not likely, and so the possibility that the practice is widespread should not be discounted. And, of course, I hope you can forgive the pun.
Thanks again for taking the time read the article and respond to it, and I hope that you will find more to like in other stories that I write.
Uh-oh. The email was as mistaken as the article, although it did clear up what “statistical fluke” meant. I wrote again:
Thanks for your reply. I’m sorry to say that you still have things more or less completely wrong.
“Their work cannot say how much of it goes on or whether any given restaurant or shop is mislabeling its products.” Wrong. [Except for the obvious point that the survey does not supply info about particular places.] I don’t know what sampling protocol they used — how they chose the restaurants and fish sellers. (This is another big problem with your article, that you didn’t state how they sampled.) Maybe they used a really good sampling protocol, one that gave each restaurant and fish seller an equal chance of being in the sample. If so, then their work can indeed “say how much [mislabeling] goes on.” They can give an estimate and put confidence intervals around that estimate. Just like the Gallup poll does.
Somewhere you got the idea that big samples are a lot better than small ones. Sometimes you do need a big sample — if you want to predict the outcome of a close election, for example. But for many things you don’t need a big sample to answer the big questions. And this is one of those cases. There is no need to know with any precision how much mislabeling goes on. If it’s above 50%, it’s a major scandal, if it’s 10-50% it’s a minor scandal, if it’s less than 10%, it’s not a scandal at all. And the study you described in your article probably puts the estimate firmly in the minor scandal category. In contrast to your “it’s cute but doesn’t really tell us anything” conclusion quite the opposite is probably true (if their sampling procedure was good): It probably tells us most of what we want to know. You’re making the same mistake Alfred Kinsey made: He thought a big sample was wonderful. As John Tukey told him, he was completely wrong. Tukey said he’d rather have a sample of 3, well-chosen.
Thanks for explaining what you meant by “statistical fluke.” You may not realize you are breaking new ground here. Scientists wonder all the time if their results are “a statistical fluke.” What they mean by this is that they’ve done an experiment and have two groups, A (treated) and B (untreated) and wonder if the measured difference between them — there is always some difference — could be due to chance, that is, is a statistical fluke. In your example of the mislabeled fish there are not two groups — this is why your usage is mysterious. I have never seen the phrase used the way you used it. And I think that the readers of the Times already realized, without your saying so, that it is exceptionally unlikely that these were the only fish sellers in New York that mislabeled fish.
I understand your points, and certainly see the difference between a small-but-helpful sample and a large-but-useless sample. but four restaurants simply cannot represent the variety of dining establishments in New York City. Four restaurants, ten markets.
I also realize that you must think I am thickheaded to keep at this, but I will certainly keep in mind your points in the future and will try not make facile references to small and large samples when the principles are, as you state, more complicated than that.
To be continued. My original post about this article.