海图学术|为什么我担心实验社会科学正朝着错误的方向前进
Why I worry experimental social science is headed in the wrong
direction
CHRISTOPHER BLATTMAN
I joke with my graduate students they need to get as many
technical skills as possible as PhD students because the moment
they graduate it’s a slow decline into obsolescence. And of course
by “joke” I mean “cry on the inside because it’s true”.
Take experiments. Every year the technical bar gets raised.
Some days my field feels like an arms race to make each experiment
more thorough and technically impressive, with more and more
attention to formal theories, structural models, pre-analysis
plans, and (most recently) multiple hypothesis testing. The list
goes on. In part we push because want to do better work. Plus, how
else to get published in the best places and earn the respect of
your peers?
It seems to me that all of this is pushing social scientists
to produce better quality experiments and more accurate answers.
But it’s also raising the size and cost and time of any one
experiment.
This should lead to fewer, better experiments. Good, right?
I’m not sure. Fewer studies is a problem if you think that the
generalizabilty of any one experiment is very small. What you want
is many experiments in many places and people, which help
triangulate an answer.
The funny thing is, after all that pickiness about getting the
perfect causal result, we then apply it in the most unscientific
way possible. One example is deworming. It’s only a slight
exaggeration to say that one randomized trial on the shores of Lake
Victoria in Kenya led some of the best development economists to
argue we need to deworm the world. I make the same mistake all the
time.
We are not exceptional. All of us—all humans—generalize from
small samples of salient personal experiences. Social scientists do
it with one or two papers. Usually ones they wrote
themselves.
The latest thing that got me thinking in this vein is an
amazing new paper by Alwyn Young. The brave masochist spent three
years re-analyzing more than 50 experiments published in several
major economics journals, and argues that more than half the
regressions that claim statistically significant results don’t
actually have them.
My first reaction was “This is amazingly cool and important.”
My second reaction was “We are doomed.”
Here’s the abstract:
I follow R.A. Fisher’s The Design of Experiments, using
randomization statistical inference to test the null hypothesis of
no treatment effect in a comprehensive sample of 2003 regressions
in 53 experimental papers drawn from the journals of the American
Economic Association.
Randomization tests reduce the number of regression
specifications with statistically significant treatment effects by
30 to 40 percent. An omnibus randomization test of overall
experimental significance that incorporates all of the regressions
in each paper finds that only 25 to 50 percent of experimental
papers, depending upon the significance level and test, are able to
reject the null of no treatment effect whatsoever. Bootstrap
methods support and confirm these results.
The basic story is this. First, papers often look at more than
one treatment and many outcomes. There are so many tests that some
are bound to look statistically significant. What’s more, when you
see a significant effect of a treatment on one outcome (like
earnings), you are more likely to see a significant effect on
related outcome (like consumption), and if you treat these like
independent tests you overstate the significance of the
results.
Second, the ordinary statistics most people use to estimate
treatment effects are biased in favor of finding a result. When we
cluster standard errors or make other corrections, we rely on
assumptions that simply don’t apply to experimental samples.
One way to deal with this is something called Randomization
Inference. You take your sample, with its actual outcomes. You
engage in a thought experiment, where you randomly assign treatment
thousands of times, and generate a treatment effect each time. Most
of these imaginary randomizations will generate no significant
treatment effect. Some will. You then look at your actual treatment
effects, compare them to the distribution of potential treatment
effects, and ask “what are the chances I would get these treatment
effects by chance?”
RI has been around for a while, but very few experimenters in
economics have adopted it. It’s more common but still unusual in
political science. Here is Jed Friedman with a short intro. The
main textbook is by Don Green and I recommend it to newcomers and
oldcomers alike. I have been reading it all week and it’s a
beautiful book.
Alwyn Young has very usefully asked what happens if we apply
RI (and other methods) to existing papers. I don’t completely buy
his conclusion that half the experiments are actually not
statistically significant. Young analyzed almost 2000 regressions
across 50 papers, or about 40 regressions per paper. Not all
regressions are equal. Some outcomes we don’t expect treatment to
affect, for instance. So Young’s tests are probably too stringent.
Pre-analysis plans are designed to help fix this problem. But he
has a good point. And work like this will raise the bar for
experiments going forward.
But I don’t want to get into that. Rather, I want to talk
about why this trend worries me.
I predict that, to get published in top journals, experimental
papers are going to be expected to confront the multiple treatments
and multiple outcomes problem head on.
This means that experiments starting today that do not tackle
this issue will find it harder to get into major journals in five
years.
I think this could mean that researchers are going to start to
reduce the number of outcomes and treatment they plan to test, or
at least prioritize some tests over others in pre-analysis
plans.
I think it could also going to push experimenters to increase
sample sizes, to be able to meet these more strenuous standards. If
so, I’d expect this to reduce the quantity of field experiments
that get done.
Experiments are probably the field’s most expensive kind of
research, so any increase in demands for statistical power or
technical improvements could have a disproportionately large effect
on the number of experiments that get done.
This will probably put field experiments even further out of
the reach of younger scholars or sole authors, pushing the field to
larger and more team based work.
I also expect that higher standards will be disproportionately
applied to experiments. So it some sense it will raise the bar for
some work over others. Younger and junior scholars will have
stronger incentives to do observational work.
On some level, this will make everyone more careful about what
is and is not statistically significant. More precision is a good
thing. But at what cost?
Well for one, I expect it to make experiments a little more
rote and boring.
I can tell you from experience it is excruciating to polish
these papers to the point that a top journal and its exacting
referees will accept them. I appreciate the importance of this
polish, but I have a hard time believing the current state is the
optimal allocation of scholarly effort. The opportunity cost of
time is huge.
Also, all of this is fighting over fairly ad hoc thresholds of
statistical significance. Rather than think of this as “we’re
applying a common standard to all our work more correctly”, you
could instead think of this as “we’re elevating the bar for
believing certain types of results over others”.
Finally, and most importantly to me, if you think that the
generalizability of any one field experiment is low, then a large
number of smaller but less precise experiments in different places
is probably better than a smaller number of large, very precise
studies.
There’s no problem here if you think that a large number of
slightly biased studies are worse than a smaller number of unbiased
and more precise studies. But I’m not sure that’s true. My bet is
that it’s false. Meanwhile, the momentum of technical advance is
pushing us in the direction of fewer studies.
I don’t see a way to change the professional incentives. I
think the answer so far has been “raise more money for experiments
so that the profession will do more of them.” This is good. But
surely there are better answers than just throwing more fuel on the
fire.
Incentives for technical advances in external rather than just
internal validity strike me as the best investment right now.
Journal editors could play a role too, rewarding the study of scale
ups and replications (effectiveness trials) as much as the new and
counter-intuitive findings (efficacy trials).
Of course, every plea for academic change ends with “more
money for us” and “journal editors should change their
preferences.” This is a sign of either lazy or hopeless thinking.
Or, in my case, both.
I welcome ideas from readers, because to me the danger is
this: That all the effort to make experiments more transparent and
accurate in the end instead limits how well we understand the
world, and that a reliance on too few studies makes our theory and
judgment and policy worse rather than better.
推荐 0
财新博客版权声明:财新博客所发布文章及图片之版权属博主本人及/或相关权利人所有,未经博主及/或相关权利人单独授权,任何网站、平面媒体不得予以转载。财新网对相关媒体的网站信息内容转载授权并不包括财新博客的文章及图片。博客文章均为作者个人观点,不代表财新网的立场和观点。