The Gino-Colada Affair

Ulrich Schimmack

Sep 30

There is no doubt that social psychology and its applied fields like behavioral economics and consumer psychology have a credibility problem. Many of the findings cannot be replicated because they were obtained with questionable research practices or p-hacking. QRPs are statistical tricks that help researchers to obtain p-values below the necessary threshold to claim a discovery (p < .05). To be clear, although lay people and undergraduate students consider these practices to be deceptive, fraudulent, and unscientific, they are not considered fraudulent by researchers, professional organizations, funding agencies, or universities. Demonstrating that a researchers used QRPs to obtain significant results is easy-peasy, undermines the credibility of their work, but they can keep their jobs because it is not (yet) illegal to use these practices.

The Gino-Harvard scandal is different because the DataColada team claimed that they found "four studies for which we had accumulated the strongest evidence of fraud" and that they "believe that many more Gino-authored papers contain fake data." To lay people, it can be hard to understand the difference between allowed QRPs and forbidden fraud or data manipulation. An example of QRPs, could be selectively removing extreme values so that the difference between two groups becomes larger (e.g., removing extremely low depression scores from a control group to show a bigger treatment effect). Outright data manipulation would be switching participants with low scores from the control group to the treatment group and vice versa.

DataColada used features of the excel spreadsheet that contained the data to claim that the data were manually manipulated.

The focus is on six rows that have a strong influence on the results for all three dependent variables that were reported in the article, namely cheated or not, overreporting of performance, and deductions.

Based on the datasheet, participants in the sign-at-the-top condition (1) in rows 67, 68, and 69, did not cheat and therewith also did not overreport performance, and had very low deductions an independent measure of cheating. In contrast, participants in rows 70, 71, and 72 all cheated, had moderate amounts of overreporting, and very high deductions.

Yadi, yadi, yada, yesterday Gino posted a blog post that responded to these accusations. Personally, the most interesting rebuttal was the claim that there was no need to switch rows because the study results hold even without the flagged rows.

"Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered "suspicious" using Data Colada's lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?"

This argument makes sense to me because fraud appears to be the last resort for researchers who are eager to present a statistically significant results. After all, nobody claims that there was no data collection as in some cases by Diederik Stapel, who committed blatant fraud around the time this article in question was published and the use of questionable research practices was rampant. When researchers conduct an actual study, they probably hope to get the desired result without QRPs or fraud. As significance requires luck, they may just hope to get lucky. When this does not work, they can use a few QRPs. When this does not work, they can just shelf the study and try again. All of this would be perfectly legal by current standards of research ethics. However, if the results are close and it is not easy to collect more data to hope for better results), it may be tempting to change a few labels of conditions to reach p < .05. And the accusation here (there are other studies) is that only 6 (or a couple more) rows were switched to get significance. However, Gino claims that the results were already significant and I agree that it makes no sense for somebody to temper with data, if the p-value is already below .05.

However, Gino did not present evidence that the results hold without the contested cases. So, I downloaded the data and took a look.

First, I was able to reproduce the published result of an ANOVA with the three conditions as categorical predictor variable and deductions as outcome variable.

In addition, the original article reported that the differences between the experimental "signature-on-top" and each of the two control conditions ("signature-on-bottom", "no signature") were significant. I also confirmed these results.

Now I repeated the analysis without rows 67 to 72. Without the six contested cases, the results are no longer statistically significant, F(2, 92) = 2.96, p = .057.

Interestingly, the comparisons of the experimental group with the two control groups were statistically significant.

Combining the two control groups and comparing it to the experimental group and presenting the results as a planned contrast would also have produced a significant result.

However, these results do not support Gino's implication that the same analysis that was reported in the article would have produced a statistically significant result, p < .05, without the six contested cases. Moreover, the accusation is that she switched rows with low values to the experimental condition and rows with high values to the control condition. To simulate this scenario, I recoded the contested rows 67-69 as signature-at-the-bottom and 70-72 as signature-at-the-top and repeated the analysis. In this case, there was no evidence that the group means differed from each other, F(2,98) = 0.45, p = .637.

Conclusion

Experimental social psychology has a credibility crisis because researchers were (and still are) allowed to use many statistical tricks to get significant results or to hide studies that didn't produce the desired results. The Gino scandal is only remarkable because outright manipulation of data is the only ethics violations that has personal consequences for researchers when it can be proven. Lack of evidence that fraud was committed or lack of fraud do not imply that results are credible. For example, the results in Study 2 are meaningless even without fraud because the null-hypothesis was rejected with a confidence interval that had a value close to zero as a plausible value. While the article claims to show evidence of mediation, the published data alone show that there is no empirical evidence for this claim even if p < .05 was obtained without p-hacking or fraud. Misleading claims based on weak data, however, do not violate any ethics guidelines and are a common, if not essential, part of a game called social psychology.

This blog post only examined one minor question. Gino claimed that she did not have to manipulate data because the results were already significant.

My results suggest that this claim lacks empirical support. A key result was only significant under the assumption that the data were not manipulated. Of course, this finding does not warrant the conclusion that the data were tempered with to get statistical significance. We have to wait to get the answer to this 25 million dollar question.