Better than Standard Errors | Brian Quistorff: 06/01/2015

Monday, June 22, 2015

Indepedent backups of researcher data

Open data is important for healthy scientific research. A recent journal article by Vines et al. studies how available data is from authors. Sadly, they find that

[R]esearch data cannot be reliably preserved by individual researchers, and further demonstrates the urgent need for policies mandating data sharing via public archives.

Most policy mandates are for newly published research, but what about data from existing publications? Is there a way that independent parties could try to preserve what material is available?

One idea is to have an established group systematically look for data that is only available on a researcher's personal/institutional website and to post in a repository (either at the journal or separately at a place like dataverse). The group should have some reputation so there is trust that the data was not manipulated.

Secondly, the group could try to acquire data even when the data is not available for download. They could ask authors. This would take much more time and have a low response rate, but still might be worth the investment. The group could also accept material from those who had already requested and received replication materials from the authors. This would require little effort on the part of the group.

Wednesday, June 17, 2015

Stand-alone Pre-analysis Plans for observational studies

The push for research transparency in social science has included advocating for pre-analysis plans (PAPs) where researchers detail what they will do before seeing the data. These plans provide many benefits, the most obvious ones are (a) it rules out specification searching so that p-values are more believable, and (b) since it registers research before the findings it helps in understanding the body of work done that is not published. PAPs are increasingly used for verifiably prospective research (e.g. RCTs, lab experiments, and using non-open data that requires application). Their use for observational studies on existing research are, however, controversial (see Dal-Re et al. 2014). The primary challenge is that outsiders can not verify that the analysis plan was actually written before the research was undertaken.

While this challenge is likely not fully solvable, it can potentially be mitigated by restructuring incentives. The research designer could be separated from the research implementer and each earn professional credit for their independent contribution. Notably, the research designer would earn credit for publishing their PAP in a well-ranked, PAP-accepting journal independent of whether the results of the implementation are significant. This does not entirely remove the element of trust, so it is likely that published PAPs will be from those that the community trusts, such as a senior researcher. It frees, however, the research implement of this burned of trust. Indeed, once the PAP is published, many people could work to implement it independently, in a fashion similar to replication. Registering intent to implement could help coordinate efforts.

Disaggregating the pieces of research could increase the efficiency of the process in a number of ways. Researchers with ideas but without implementation skills (programming skills or access to data) can publish stand-alone PAPs without the potentially costly process of matching with others with the necessary skills. Even for those that could implement the research themselves (or with close colleagues) there are likely benefits to specializing. Researchers often have ideas that they think should be pursued but which is not a priority for them. Often these ideas are captured by colleagues or advisees, but quickly writing up a PAP could be another outlet that would confer visible benefit (see related post). On the other side, implementing PAPs could be a good way to train graduate students. Many graduate students state that the hardest part of research is coming up with a good idea. These PAPs would be especially helpful in areas where there is a lot of researcher subjectivity like structural models.

In order to suitably track credit, there could be a norm to cite the PAP whenever one would cite the implementation paper.

Taking this to the extreme, the scientific process could be even more unlinked. For example, a scientist currently has an incentive to "sell" their research (or tell a convince "story") in submissions to journals, possibly at the expense of being honest about the complexity of the findings. What if the science writer was somehow separated and judged on how well they communicated existing research findings? This could be judged if research data and methods were open.

Thursday, June 11, 2015

Getting versions of Stata modules

For reproducible research it is important to be able to reproduce your analysis environment. For Stata analyses, one needs to list the version of Stata modules used. This is especially important as the ssc archive does not store previous versions of modules. Ideally, one is storing the module files with a project, but it is still useful to have a measure of the version of the modules. Stata modules are not required to list a version number (NB: this is separate from which -version- of Stata the module is requesting). Many modules do, and if so one can often find them in the text of the ado files (use -viewsource command.ado- for this). If not, one can use the version date. Modules on SSC are required to list their "distribution date" . If you are installing modules from a version control server (e.g. Github) then the install date is sufficient. Stata keeps this information in a "stata.trk" folder at the base of each module directory tree (e.g. PLUS) which depends on your adopath. To get this information you can use the following simple script.

$ cat stata.trk | grep "^$[SND]\|d Distribution$"