Better than Standard Errors | Brian Quistorff: 2015

Sunday, September 13, 2015

Compiling Stata plugins

Here are some notes on how to compile plugins for Stata. The official plugin page is a bit out of date.

Linux: Add option -fPIC. If you want to build a 32-bit plugin on a 64-bit system then you will need the 32-bit version of libc (e.g. on Debian install g++-multilib and libc6-dev-i386).

Windows: You will need the mingw-w64 compiler (unless you just want 32-bit in which case you can use the original MinGW). You can get this in several ways, but since I already had Cygwin I just installed the mingw64-x86_64-* packages (for 32 bit you can use mingw64-i686-*). You will then use the x86_64-w64-mingw32-g++ command rather than g++ -mno-cygwin. Add the -static flag in the link step otherwise Stata might give a "could not load plugin" due to unfound DLLs (e.g. I used C++ classes and I found using Procmon that after reading my plugin file Stata couldn't find libgcc_s_seh-1.dll).

Mac: Install the Xcode command-line tools. I added the -shared option and removed the -bundle option (-bundle can't be used with dylibs). For modern Xcode (since version 5.0) then you have to use clang/clang++.

Wednesday, September 02, 2015

assign_treatment: A Tool for Stratified Randomization

This blog post by David McKenzie and Miriam Bruhn talks about assigning treatments at random when stratification variables cause cells (unique combinations of stratification variable values) to have uneven number of methods. The provided Stata do-file is hard-coded for a six treatments and is a bit difficult to adapt for another number. I've made a module that does the assignment for varying numbers of treatments as well of providing different methods of assigning the remainder units ("misfits") from each cell.

The three main methods provided all assign the misfits at random and achieve cell-level balance (counts differ at most by 1):

full - This is the same method as McKenzie and Bruhn. The misfits from each cell are separately randomized to any combination of treatments. This can cause the overall number of units per treatment to be unbalanced even though they will be balanced at the cell-level level.
reduction - This method achieves overall balance as well as better balance for specified stratification variables. It does this by limiting misfits to be assigned to a "wrapped" interval of treatments (e.g. (2,3,4) or (6,1,2) if there are six treatments) and then having those intervals dovetail together.
full-obalance - This method allows misfits to be assigned to any combination of treatments and achieves overall balance. It does so by assigning units one at a time to fill repeating slots of (1,...,T,1,...T,...,1..). At each stage it keeps track of possible units that could fill a spot (without causing two from the same cell to have the same treatment). It randomly picks one and then attempts to fill the next spot (while giving a slight weight to trying to fit first misfits from cells with many misfits). If filling a spot is impossible the algorithm backs up to the last point where there was a choice and tries a new option.

To install the module, run
. net install assign_treatment, from(https://raw.github.com/bquistorff/Stata-modules/master/a/) replace

Tests are provided in here.

Tuesday, August 18, 2015

Continuous Integration with Stata

Continuous Integration is a workflow that frequently merges code changes and automatically runs tests. This can be especially helpful when working in teams. Several free services exist, such as Travis-CI which provides Linux build infrastructure and integrates with GitHub. Basically, you instruct Travis how to setup a Linux environment for your project and then direct it what tests to run. Travis will then automate this, checking your repository at every push.

First off, you'll obviously need a Stata license that allows for this kind of work. The single-user license allows for 3 "seats", or you may have a network license with extra capacity. In Travis, you should go to the Settings for your repo and "Limit concurrent builds" to 1 (or whatever is permitted by you license).

Travis works each time off of clean Linux machine images so your setup will have to allow Travis to install Stata automatically. Travis allows you to store private passwords on the test infrastructure, so you could either host the Stata files on a password protected machine or you could encrypt the files with a password (e.g. with gpg --symmetric). Either way, Travis will download files repeatedly so you may want to make them as small as possible. I've found the best way is to take an existing install and then strip out anything unnecessary, including documentation (all *.pdf, *.sthlp, *.ihlp, *.key, and *.dta which are almost all example data), graphical tools (*.mnu, *.png, *.dlg, *.idlg), and alternate executables. Compress this folder with xz. Then upload the files to a web server that allows for command-line download (if you are using a general hosting site, you could check out plowshare) making sure to take care of security.

Now you can setup your .travis.yml to setup the machine.

Extract the password needed for you Stata files (you can do this by installing the Travis client on your local machine, and then use it to encrypt as an environment variable).
Download and set-up the Stata files.
Add the Stata folder to the PATH (it needs to be able to find the license file).

Finally, add commands to test your code:

Obviously, you already have lots of code to test your programs, right? :)
In order for Travis to know if there was a problem, the executable that runs the tests should return a non-zero error code upon failure. The Stata executable will not do this (!), but you can use a simple wrapper like statab.sh. If you download this, remember you will have to make it executable.
The tests can be run either from a master do file, or be called independently from a script or makefile.

The more I use Stata the more it valuable to me it becomes!

Monday, June 22, 2015

Indepedent backups of researcher data

Open data is important for healthy scientific research. A recent journal article by Vines et al. studies how available data is from authors. Sadly, they find that

[R]esearch data cannot be reliably preserved by individual researchers, and further demonstrates the urgent need for policies mandating data sharing via public archives.

Most policy mandates are for newly published research, but what about data from existing publications? Is there a way that independent parties could try to preserve what material is available?

One idea is to have an established group systematically look for data that is only available on a researcher's personal/institutional website and to post in a repository (either at the journal or separately at a place like dataverse). The group should have some reputation so there is trust that the data was not manipulated.

Secondly, the group could try to acquire data even when the data is not available for download. They could ask authors. This would take much more time and have a low response rate, but still might be worth the investment. The group could also accept material from those who had already requested and received replication materials from the authors. This would require little effort on the part of the group.

Wednesday, June 17, 2015

Stand-alone Pre-analysis Plans for observational studies

The push for research transparency in social science has included advocating for pre-analysis plans (PAPs) where researchers detail what they will do before seeing the data. These plans provide many benefits, the most obvious ones are (a) it rules out specification searching so that p-values are more believable, and (b) since it registers research before the findings it helps in understanding the body of work done that is not published. PAPs are increasingly used for verifiably prospective research (e.g. RCTs, lab experiments, and using non-open data that requires application). Their use for observational studies on existing research are, however, controversial (see Dal-Re et al. 2014). The primary challenge is that outsiders can not verify that the analysis plan was actually written before the research was undertaken.

While this challenge is likely not fully solvable, it can potentially be mitigated by restructuring incentives. The research designer could be separated from the research implementer and each earn professional credit for their independent contribution. Notably, the research designer would earn credit for publishing their PAP in a well-ranked, PAP-accepting journal independent of whether the results of the implementation are significant. This does not entirely remove the element of trust, so it is likely that published PAPs will be from those that the community trusts, such as a senior researcher. It frees, however, the research implement of this burned of trust. Indeed, once the PAP is published, many people could work to implement it independently, in a fashion similar to replication. Registering intent to implement could help coordinate efforts.

Disaggregating the pieces of research could increase the efficiency of the process in a number of ways. Researchers with ideas but without implementation skills (programming skills or access to data) can publish stand-alone PAPs without the potentially costly process of matching with others with the necessary skills. Even for those that could implement the research themselves (or with close colleagues) there are likely benefits to specializing. Researchers often have ideas that they think should be pursued but which is not a priority for them. Often these ideas are captured by colleagues or advisees, but quickly writing up a PAP could be another outlet that would confer visible benefit (see related post). On the other side, implementing PAPs could be a good way to train graduate students. Many graduate students state that the hardest part of research is coming up with a good idea. These PAPs would be especially helpful in areas where there is a lot of researcher subjectivity like structural models.

In order to suitably track credit, there could be a norm to cite the PAP whenever one would cite the implementation paper.

Taking this to the extreme, the scientific process could be even more unlinked. For example, a scientist currently has an incentive to "sell" their research (or tell a convince "story") in submissions to journals, possibly at the expense of being honest about the complexity of the findings. What if the science writer was somehow separated and judged on how well they communicated existing research findings? This could be judged if research data and methods were open.

Thursday, June 11, 2015

Getting versions of Stata modules

For reproducible research it is important to be able to reproduce your analysis environment. For Stata analyses, one needs to list the version of Stata modules used. This is especially important as the ssc archive does not store previous versions of modules. Ideally, one is storing the module files with a project, but it is still useful to have a measure of the version of the modules. Stata modules are not required to list a version number (NB: this is separate from which -version- of Stata the module is requesting). Many modules do, and if so one can often find them in the text of the ado files (use -viewsource command.ado- for this). If not, one can use the version date. Modules on SSC are required to list their "distribution date" . If you are installing modules from a version control server (e.g. Github) then the install date is sufficient. Stata keeps this information in a "stata.trk" folder at the base of each module directory tree (e.g. PLUS) which depends on your adopath. To get this information you can use the following simple script.

$ cat stata.trk | grep "^$[SND]\|d Distribution$"

Wednesday, April 22, 2015

Code Management in Dropbox

Dropbox is not the ideal version control system, but some people do not want to make the leap to a full VCS (like git or svn which full software developers use). Given that one should not make the perfect the enemy of the good, what simple tips can Stata-base economists follow that will make their lives easier? Here are some tips that will be helpful on projects that last a while (longer than short-term memory or the involvement of an assistant) or involve multiple people. As always, one should have the over-arching goal that the project be reproducible (at almost all points, you can follow an explicit recipe and produce all required outputs). While my tips will help with this, you may require some project-specific solutions, but it is well worth the effort.

Backup: Structure your folders so that a minimal number can be used to reproduce everything in the project. A common approach is to have a code folder (which changes often) and a data/raw folder (which should almost never change). (This means all "manual" actions like cleaning should be embedded in code!) With this approach one can store a one-time backup of the raw data and then store period snapshots of the code folder. At key times (e.g. after every new draft or presentation) zip up the code folder and archive it untouched. Since all code is periodically backed up, you can get ride of no longer used code.
Avoid editing conflicts: If multiple people are editing the same files here is a tip to reduce friction. If you start editing a file, make an innocuous edit and save it. Then, before editing a file, check the "last modified" time. If someone else edited the file recently, chat to negotiate access (your team should decide what "recently" means).
Branching: If you want to a make a series of changes that will leave a key file in an unusable state then you should make a "branch". You can do this (a) in Dropbox by duplicating the files (including unstable outputs) with modified names (have a convention here), or (b) copying the project out of Dropbox for editing. Try to make changes small enough that they can be merged back into the main file quickly. The longer files diverge the harder this will be.
Syncing: For intermediate files only needed internally by one piece of code, you can reduce syncing by creating those files in your `c(tmpdir)' (with -tempfile- or just named explicitly) rather than in your project folder.

Here are other simple general tips that can be applied to Dropbox:

No one's full path should be written into the code. Code should be run given Stata is in the appropriate working directory (often the root of the project folder or in the code subfolder). A nice benefit is that if the project folder is copied out of Dropbox it should run easily. If you have machine-specific configuration details, those should be defined on the machine in environment variables and brought into Stata using -local env_var : environment ENV_VAR-
Have a local code/ado folder where you store all the needed modules. In the header of your code you should include a script that sets code/ado as PERSONAL and then restricts your adopath to just PERSONAL;BASE. You may want to -net set ado PERSONAL- and -mata: mata mlib index- also.
Learn how to use a visual text diff/merge utility like the one that comes with TortoiseSVN. This will make it easy to compare files between editors and across time.
If possible, do not use spaces in file names (use "_")

What other tips do you all have?

Friday, January 30, 2015

Using Yik Yak to quantify aspects of student life?

Just listened to a nice piece on Reply All about Yik Yak, which is app where users can post messages anonymously and see the messages of those nearby (10 miles). The podcast mentioned how it was useful in showing written evidence of racism that previously was hard to prove. It got me thinking that if there was a way to get the feeds from different universities then one could analyze these using Natural Language Processing routines to get metrics for different schools. What percentage of anonymous posts at your school are about race, violence, sex, etc.? Could be something prospective students might want to know.

Wednesday, January 28, 2015

Prediction markets for human capital investment

While there has been a growing literature on prediction markets, most of the markets have been confined to politics and macro-economics. These are great to have, but I would like to see markets that could help with more decisions that typical people make, such as when, and in what field to get training. The information that people have is usually about the current economy (this industry currently has a tight labor market) or employment projections by the government. What could be better would be a set of markets about future industry-specific indicators such as vacancy rates. If they were started at least 2-3 years before maturity then it could be used by people in college to help pick a major or by workers deciding whether to get training in a new field. The market might have to be "seeded" by the government, but it could be worth it.

Tuesday, January 20, 2015

Getting latexmk working within LyX

If you would like to add latexmk as an export option in Lyx, here are the two basic steps to do in Tools -> Options.

1) In File Handling -> File Format, hit "New" and then put in these options, then "Apply".

2) Then in File Handling -> Converter, click on any existing converter, change the options to the below, and then click "Add", then "Save".

Now you should be able to File->Export->PDF (Latexmk). Hope that works for you.

If you are on Windows, then you might have problems with Perl versions (Lyx has one, MiKTeK another). Initially, I was getting that perl couldn't find File::Glob in @INC when latexmk was run. Turns out that Lyx was running it's own version of perl 5.18.2 and had @INC set to just the Lyx/lib (which oddly did have File/Glob.pm). I installed Strawberry Perl same version (needed that!) then set the PERL5LIB to the Strawberry perl folder. Then it works fine.

This was on LyX 2.1.2, updated MiKTeX, and Windows 8.

Running Stata 12 on Ubuntu 14.04 (Trusty Tahr)

After upgrading my linux machine I realized that I could no longer run my copy of Stata 12 GUI. It game the error:

./xstata-se: error while loading shared libraries: libgtksourceview-1.0.so.0: cannot open shared object file: No such file or directory

These older versions of libraries aren't in the normal Ubuntu repositories anymore, so this is a simple work around to get Stata working again. (I'm on a 64 bit machine so change amd64 to the appropriate)

Install libgtksourceview2-2.0
sudo ln -s /usr/lib/libgtksourceview-2.0.so.0 /usr/lib/libgtksourceview-1.0.so.0
Download libgnomecups1.0-1 0.2.3-4 from here. Install it with dpkg.
Download libgnomeprint2.2-data_2.18.8-3ubuntu1_all.deb and libgnomeprint2.2-0_2.18.8-3ubuntu1_amd64.deb from here. Install them at the same time with dpkg.

Happy estimating.