In an earlier post I asked 15 questions about Zelinski et al. (2011) (“Improvement in memory with plasticity-based adaptive cognitive training: results of the 3-month follow-up”), a study done to measure the efficacy of the brain training sold by Posit Science. The study asked if the effects of training were detectable three months after it stopped. Henry Mahncke, the head of Posit Science, recently sent me answers to a few of my questions.
Most of my questions he declined to answer. He didn’t answer them, he said, because they contained “innuendo”. My questions were ordinary tough (or “critical”) questions. Their negative slant was not at all hidden (in contrast to innuendo). For the questions he didn’t answer, he substituted less critical questions. I give a few examples below. Unwillingness to answer tough questions about a study raises doubts about it.
His answers raised more doubts. From his answer to Question 7, I learned that although the investigators gave their subjects the whole RBANS, (a) they failed to report the results from the visual subtests and (b) these unreported results did not support their conclusions. Mahncke says this result was not reported “due to lack of publication space.” The original paper did not say that some results were omitted due to lack of space. I assume all favorable results were reported. To report all favorable results but omit some unfavorable results is misleading.
To further explain the omission, Mahncke says
We used the auditory measures as the primary outcome measure because we hypothesized that cognitive domains [by "cognitive domains" he means the cognitive gains due to training -- Seth] would be restricted to the trained sensory domain, in this case the auditory system. [emphasis added]
He doesn’t say he believed the gains would be greater with auditory stimuli, he says he believed they would be restricted to auditory stimuli. The Posit Science website says their training increases “memory”, “intelligence”, “focus” and “thinking speed”. None of these are restricted to the auditory system — far from it. Unless I am misunderstanding something, the head of Posit Science doesn’t believe the main claims of the Posit Science website.
Why Mahncke fails to see a difference between methods (Question 13) and results (Question 14), fails to see a difference between methods (Question 11) and discussion (Question 15), and gives a one-word answer (“yes”) to Question 12, I cannot say. In each case, however, he errs on the side of not answering.
My overall conclusion is that this study does not support Posit Science claims. The main measure (RBANS auditory subtests) didn’t show significant retention. A closely related set of measures (RBANS visual subtests) didn’t show significant retention. A third set of measures (“secondary composite measure”) did show retention, but the p value was not corrected for multiple tests. When the p value is corrected for multiple tests, the secondary composite measure may not show significant retention. Because of the large number of subjects (more than 500), repeated failure to find significant retention under presumably near-optimal conditions (e.g., 1 hour/day of training) suggests that the training effect, after three months without training, is small or zero.
I assume that Posit Science sponsored this study because they believed it was unrealistic for subjects to spend 1 hour/day for the rest of their life doing their training. One hour/day was realistic for a while, yes, but not forever. So subjects will stop. Will the gains last? was the question. Apparently the answer is no.
If Mahncke has any response to this, I will post it.
This is another illustration of why personal science (science done for your own benefit, rather than as a job) is important. Professional scientists are under pressure to get certain results. This study is an example. Mahncke was a co-author. Someone employed by Posit Science is under pressure to get results that benefit Posit Science. (I am not saying Mahncke was affected by this pressure.) A personal scientist is not under pressure to get certain results. For example, if I study the effect of tetracycline (an antibiotic) on my acne, I simply want to know if it helps. Both possible answers (yes and no) are equally acceptable. We may need personal scientists to get unbiased answers.
Here are my original questions along with Mahncke’s answer or lack of answer.
1. Isn’t it correct that after three months there was no longer reliable improvement due to training according to the main measure that was chosen by you (the investigators) in advance? If so, shouldn’t that have been the main conclusion (e.g., in the abstract and final paragraph)?
[Seth: Here is Mahncke's substitute question: "Why do you conclude that “Training effects were maintained but waned over the 3-month no-contact period” given that the “previously significant improvements became non-significant at the 3-month follow-up for the primary outcome”?"]
2. The training is barely described. The entire description is this: “a brain plasticity-based computer program designed to improve the speed and accuracy of auditory information processing and to engage neuromodulatory systems.” To learn more, readers are referred to a paper that is not easily available — in particular, I could not find it on the Posit Science website. Because the training is so briefly described, I was unable to judge how much the outcome tests differ from the training tasks. This made it impossible for me to judge how much the training generalizes to other tasks — which is the whole point. Why wasn’t the training better described?
[Seth: Here is Mahncke's substitute question: "Could you describe the training program in more depth, to help judge the similarity between the training exercises and the cognitive outcome measures?"]
3. What was the “ET [experimental treatment] processing speed exercise”?
The processing speed exercise is a time order judgment task in which two brief auditory frequency modulated sweeps are presented, either of which may sweep up or down in frequency. The subject must identify each sweep in the correct order (i.e., up/up, down/down, up/down, down/up). The inter-stimulus interval is adaptively manipulated to determine a threshold for reliable task performance. Note that this is not a reaction time task. The characteristics of the sweeps are chosen to match the frequency modulated sweeps common in stop consonant sounds (like /ba/ or /da/). Older listeners generally show strong correlations between processing speed, speech reception accuracy, and memory; which led us to the hypothesis that improving core processing speed in this way would contribute to improving memory. This approach is discussed extensively in “Brain plasticity and functional losses in the aged: scientific bases for a novel intervention” available at http://www.ncbi.nlm.nih.gov/pubmed/17046669
3. [continue] It sounds like a reaction-time task. People will get faster at any reaction-time task if given extensive practice on that task. How is such improvement relevant to daily life? If it is irrelevant, why is it given considerable attention (one of the paper’s four graphs)?
4. According to Table 2, the CSRQ (Cognitive Self-Report Questionnaire) questions showed no significant improvement in trainees’ perceptions of their own daily cognitive functioning, although the p value was close to 0.05. Given the large sample size (~500), this failure to find significant improvement suggests the self-report improvements were small or zero. Why wasn’t this discussed? Is the amount of improvement suggested by Posit Science’s marketing consistent with these results?
5. Is it possible that the improvement subjects experienced was due to the acquisition of strategies for dealing with rapidly presented auditory material, and especially for focusing on the literal words (rather than on their meaning, as may be the usual approach taken in daily life)? If so, is it possible that the skills being improved have little value in daily life, explaining the lack of effect on the CSRQ?
6. In the Methods section, you write “In the a priori data analysis plan for the IMPACT Study, it was hypothesized that the tests constituting the secondary outcome measure would be more sensitive than the RBANS given their larger raw score ranges and sensitivity to cognitive aging effects.” Do the initial post-training tests (measurements of the training effect soon after training ended) support this hypothesis? Why aren’t the initial post-training results described so that readers can see for themselves if this hypothesis is plausible? If you thought the “secondary outcome measure would be more sensitive than the RBANS” why wasn’t the secondary outcome measure the primary measure?
In a large-scale clinical trial such as IMPACT, it is considered best practice to pick as the primary outcome measure a measure that has been employed in earlier studies. We had used the RBANS in two previous studies (references 8 and 17 in the paper). While we had seen significant results in both studies, it was also clear from those studies that the RBANS had ceiling effects in cognitively intact populations that would limit the statistical sensitivity of the measure. For example, the RBANS list recall measure had 10 words, and a reasonable portion of participants get all 10 correct at baseline, leaving no room for improvement regardless of the efficacy of the intervention. Given that observation, we added measures to the IMPACT study that we hypothesized would be more sensitive. For example, the RAVLT has 15 words, leaving more room for improvement and fewer ceiling effects. [It is unclear that more words = more sensitivity. It depends on the words -- Seth] However, since we had not used those measures in previous studies, we decided to define these new measures as secondary outcome measures in the data analysis plan. This issue is discussed in depth in the methods section of the main training effect paper (reference 6), and of course that’s where all of the initial post-training results you mention are described. This improved sensitivity of the secondary outcome measures was quite evident in the post-training data; however for reasons of publication length we did not discuss it in that paper. The comparative data would make an interesting publication, and one that might be helpful to other researchers in this field.
7. The primary outcome measure was some of the RBANS (Repeatable Battery for the Assessment of Neuropsychological Status). Did subjects take the whole RBANS or only part of it? If they took the whole RBANS, what were the results with the rest of the RBANS (the subtests not included in the primary outcome measure)?
Participants took the entire RBANS. We used the auditory measures as the primary outcome measure because we hypothesized that cognitive domains [by "domains" he means "gains" -- Seth] would be restricted to the trained sensory domain, in this case the auditory system. Interestingly, there was a significant effect on the overall RBANS measure, however there was no significant effect on a composite of the RBANS visual measures. This interesting result was not included in our papers for reasons of publication length.
[Seth: As I said earlier, a surprising answer.]
8. The data analysis refers to a “secondary composite measure”. Why that particular composite and not any of the many other possible composite measures? Were other secondary composite measures considered? If so, were p values corrected for this?
The measures used were the Rey Auditory Verbal Learning Test total score (sum of trials 1–5) and word list delayed recall, Rivermead Behavioral Memory Test immediate and delayed recall, and Wechsler Memory Scale letter-number sequencing and digit span backwards tests. These measures were chosen a priori as more sensitive than their RBANS cognate measures, and a priori we conservatively chose to integrate all 6 into a single composite measure. Individual test scores are all shown in table 2. This issue is discussed in depth in the methods section of the main training effect paper (reference 6). It’s straightforward to evaluate what the effects shown on other potential composites would be simply from inspecting the individual test data in table 2. In the methods section of the main training effect paper (reference 6), we discuss our approach to multiple comparisons, where we state “A single primary outcome measure (RBANS Memory/ Attention) was predefined to conserve an overall alpha level of 0.05. No corrections for multiple comparisons were made on the secondary measures.” I can see that it would have been helpful to re-iterate that statement in the 2011 paper, and my apologies for the oversight.
[Seth: He doesn't answer my question "were other secondary measures considered?"]
9. If Test A resembles training more closely than Test B, Test A should show more effect of training (at any retention interval) than Test B. In this case Test A = the RBANS auditory subtests and Test B = the secondary composite measure. In contrast to this prediction, you found that Test B showed a clearer training effect (in terms of p value) than Test A. Why wasn’t this anomaly discussed (beyond what was said in the Methods section)?
10. Were any tests given the subjects not described in this report? If there were other tests, why were their results not described?
All outcome measures performed in the study are reported in the publication.
[Seth: I have no idea how this answer is consistent with (a) the subjects took the visual subtests of the RBANS and (b) the paper fails to report the results of those tests (see answer to Question 7). The paper does not say that the subjects took the visual subtests of the RBANS.]
11. The secondary composite measure is composed of several memory tests and called “Overall Memory”. The Posit Science website says their training will not only help you “remember more” but also “think faster” and “focus better”. Why weren’t tests of thinking speed (different from the training tasks) and focus included in the assessment?
12. Do the results support the idea that the training causes trainees to “focus better”?
[Seth: That's his whole answer.]
13. The Posit Science homepage suggests that their training increases “intelligence”. Was intelligence measured in this study?
At the time we designed IMPACT, we were focused on establishing the effect of the training on memory, as the most common complaint of people with general cognitive difficulties. As IMPACT was in progress, Jaeggi et. al published their very interesting paper on the effect of N-back training on measures of intelligence, where they stated that improving working memory was likely to improve measures of intelligence. It would be quite interesting to repeat the IMPACT study with those or other measures of intelligence, given the improvements in working memory documented in IMPACT. The statement on the Posit Science web page relates to the Jaeggi et. al. paper, given that the Posit training program (BrainHQ) includes N-back training.
13 (continued). If not, why not?
[Seth: In Question 12, Mahncke failed to explain his answer about focus ("yes") apparently because I left out "if yes, please explain how". In this question, he dislikes my inclusion of "if not, why not?"]
14. Do the results support the idea that the training causes trainees to become more intelligent?
This question appears to be redundant with 13.
[Seth: Question 13 asked: Was intelligence measured? (A methods question.) This question asked: What about the results? Do they support claims about intelligence? (A results question.)]
15. The only test of thinking speed included in the assessment appears to be a reaction-time task that was part of the training. Are you saying that getting faster on one reaction-time task after lots of practice with that task shows that your training causes trainees to “think faster”?
This question appears to be redundant with 11.
[Seth: Question 11 was a methods question. This is a question about what the results mean -- a discussion question. I still have no idea why Posit Science says their training causes trainees to "think faster" or why I should care that their subjects get faster on a laboratory task after lots of practice.]