Loken and Gelman wrote a confusing article about the interpretation of effect sizes in studies with small samples and selection for significance. They compare random measurement error to a backpack and the outcome of a study to running speed. Common sense suggests that the same individual under identical conditions would run faster without a backpack than with a backpack. The same outcome is also suggested by psychometric theories that suggest random measurement error attenuates population effect sizes, which would make it harder to demonstrate significance and produce, on average, weaker effect sizes.
The key point of Loken and Gelman's article is to suggest that this intuition fails under some conditions. "Should we assume that if statistical significance is achieved in the presence
of measurement error, the associated effects would have been stronger without
noise? We caution against the fallacy"
To support their clam that common sense is a fallacy under certain conditions, they present the results of a simple simulation study. After some concerns about their conclusions were raised, Loken and Gelman shared the actual code of their simulation study. In this blog post, I share the code with annotations (###US) and reproduce their results. I also show that their results are based on selecting for significance only for the measure with random measurement error (with a backpack) and not for the measure without a backpack (no random measurement error). Reversing the selection shows that selection for significance without measurement error produces stronger effect sizes even more often than selection for significance with a backpack. Thus, it is not a fallacy to assume that we would all run faster without a backpack holding all other factors equal. However, a runner with a heavy backpack and tailwinds might run faster than a runner without a backpack facing strong headwinds. While this is true, the influence of wind on performance makes it difficult to see the influence of the backpack. Under identical conditions backpacks slow people down and random measurement error attenuates effects.

Loken and Gelman's misleading comparison that favored the studies with random measurement error may explain why many applied researchers falsely assume that a study with random measurement error is more likely to produce a statistically significant result.
Hopefully, this figure makes it clear that Loken and Gelman's article can be easily misunderstood and that it is not a fallacy to assume that random measurement error attenuates effect size estimates and power to obtain significant results. Running with a heavy backpack always will slow you down, unless it is a jet pack, but that would be akin to p-hacking.
Sometimes You Can Be
Faster With a Heavy Backpack
Annotated Original Code
### This is the final code used for the simulation studies posted by Andrew Gelman on his blog
### Comments are highlighted with my initials ###US
# First just the original two plots, high power N = 3000, low power N = 50, true slope = .15
r <- .15
sims<-array(0,c(1000,4))
xerror <- 0.5
yerror<-0.5
for (i in 1:1000) {
x <- rnorm(50,0,1)
y <- r*x + rnorm(50,0,1)
###US this is a sloppy way to simulate a correlation of r = .15
###US The proper code is r*x + rnorm(50,0,1)*sqrt(1-r^2)
###US However, with the specific value of r = .15, the difference is trivial
###US However, however, it raises some concerns about expertise
xx<-lm(y~x)
sims[i,1]<-summary(xx)$coefficients[2,1]
x<-x + rnorm(50,0,xerror)
y<-y + rnorm(50,0,yerror)
xx<-lm(y~x)
sims[i,2]<-summary(xx)$coefficients[2,1]
x <- rnorm(3000,0,1)
y <- r*x + rnorm(3000,0,1)
xx<-lm(y~x)
sims[i,3]<-summary(xx)$coefficients[2,1]
x<-x + rnorm(3000,0,xerror)
y<-y + rnorm(3000,0,yerror)
xx<-lm(y~x)
sims[i,4]<-summary(xx)$coefficients[2,1]
}
plot(sims[,2] ~ sims[,1],ylab="Observed with added error",xlab="Ideal Study")
abline(0,1,col="red")
plot(sims[,4] ~ sims[,3],ylab="Observed with added error",xlab="Ideal Study")
abline(0,1,col="red")
###US There is no major issue with graphs 1 and 2.
###US They merely show that high sampling error produces large uncertainty in the estimates.
###US The small attenuation effect of r = .15 vs. r = 12 is overwhelmed by sampling error
###US The real issue is the simulation of selection for significance in the third graph
# third graph
# run 2000 regressions at points between N = 50 and N = 3050
r <- .15
propor <-numeric(31)
propor.reversed.selection <-numeric(31) ###US This line of code is added to illustrate the biased selection for significane
powers<-seq(50,3050,100)
###US It is sloppy to refer to sample sizes as powers.
###US In between subject studies, the power to produce a true positive result
###US is a function of the population correlation and the sample size
###US With population correlations fixed at r = .15 or r = .12, sample size is the
###US only variable that influences power
###US However, power varies from alpha to 1 and it would be interesting to compare the
###US power of studies with r = .15 and r = .12 to produce a significant result.
###US The claim that "one would always run faster without a backback"
###US could be interpreted as a claim that it is always easier to obtain a
###US significant result without measurement error, r = .15, than with measurement error, r = .12
###US This claim can be tested with Loken and Gelman's simulation by computing
###US the percentage of significant results obtained without and with measurement error
###US Loken and Golman do not show this comparison of power.
###US The reason might be the confusion of sample size with power.
###US While sample sizes are held constant, power varies as a function of the population correlations
###US without, r = .15, and with, r = .12, measurement error.
xerror<-0.5
yerror<-0.5
j = 1
i = 1
for (j in 1:31) {
sims<-array(0,c(1000,4))
for (i in 1:1000) {
x <- rnorm(powers[j],0,1)
y <- r*x + rnorm(powers[j],0,1)
###US the same sloppy simulation of population correlations as before
xx<-lm(y~x)
sims[i,1:2]<-summary(xx)$coefficients[2,1:2]
x<-x + rnorm(powers[j],0,xerror)
y<-y + rnorm(powers[j],0,yerror)
xx<-lm(y~x)
sims[i,3:4]<-summary(xx)$coefficients[2,1:2]
}
###US The code is the same as before, it just adds variation in sample sizes
###US The crucial aspect to understand figure 3 is the following code that
###US compares the results for the paired outcomes without and with measurement error
# find significant observations (t test > 2) and then check proportion
temp<-sims[abs(sims[,3]/sims[,4])> 2,]
propor[j] <- table((abs(temp[,3]/temp[,4])> abs(temp[,1]/temp[,2])))[2]/length(temp[,1])
###US the use of t > 2 is sloppy and unnecessary.
###US summary(lm) gives the exact p-values that could be used to select for significance
###US summary(xx)[2,4] < .05
###US However, this does not make a substantial difference
###US The crucial part of this code is that it uses the outcomes of the simulation
###US with random measurement error to select for significance
###US As outcomes are paired, this means that the code sometimes selects outcomes
###US in which sampling error produces significance with random measurement error
###US but not without measurement error.
###US As a result, the simulation compares apples (with random error always significant)
###US to pears (without measurement error, sometimes not significant).
###US This is a strange way to compare outcomes with and without measurement error
###US Obviously, the opposite result would be obtained when oranges are commpared to tangerines
###US That is, the following code would show a higher proporition without error than with error
temp<-sims[abs(sims[,1]/sims[,2])> 2,]
propor.reversed.selection[j] <- table((abs(temp[,1]/temp[,2])> abs(temp[,3]/temp[,2])))[2]/length(temp[,4])
print(j)
###US we can also add to comparisons that are more meaningful and avoid the comparison
###
}
###US the plot code had to be modified slightly to have matching y-axes
###US I also added a title
title = "Original Loken and Gelman Code"
plot(powers,propor,type="l",
ylim=c(0,1),main=title, ### added code
xlab="Sample Size",ylab="Prop where error slope greater",col="blue")
###US We can now plot the two outcomes in the same figure
###US The original color was blue. I used red for the reversed selection
par(new=TRUE)
plot(powers,propor.reversed.selection,type="l",
ylim=c(0,1), ### added code
xlab="Sample Size",ylab="Prop where error slope greater",col="firebrick2")
###US adding a legend
legend(1500,.9,legend=c("with backpack only sig. \n shown in article \n ",
"without backpack only sig. \n added by me"),pch=c(15),
pt.cex=2,col=c("blue","firebrick2"))
###US adding a horizontal line at 50%
abline(h=.5,lty=2)
###US The results reproduce Loken and Gelman's finding.
###US The finding shows that selection for significance inflates effect sizes,
###US and that this effect can overwhelm the attenuation effect of random measurement error.
###US Thus, under some conditions more than 50% of effect sizes can be larger than
###US effect sizes without random measurement error that are not as severely selected for significance.
###US However,...
###US this is a silly comparison of apples with pears.
###US It remains true that running without a backpack leads to better performance
###US than running with a backpack, even with strong tailwinds and in a sprint.
###US Random measurement error attenuates population effect sizes.
###US This lowers power and produces smaller effect sizes estimates,
###US EVEN IN SMALL SAMPLES WITH SELECTION FOR SIGNIFICANCE.
No comments:
Post a Comment