Tuesday, August 18, 2015

Continuous Integration with Stata

Continuous Integration is a workflow that frequently merges code changes and automatically runs tests. This can be especially helpful when working in teams. Several free services exist, such as Travis-CI which provides Linux build infrastructure and integrates with GitHub. Basically, you instruct Travis how to setup a Linux environment for your project and then direct it what tests to run. Travis will then automate this, checking your repository at every push.

First off, you'll obviously need a Stata license that allows for this kind of work. The single-user license allows for 3 "seats", or you may have a network license with extra capacity. In Travis, you should go to the Settings for your repo and "Limit concurrent builds" to 1 (or whatever is permitted by you license).

Travis works each time off of clean Linux machine images so your setup will have to allow Travis to install Stata automatically. Travis allows you to store private passwords on the test infrastructure, so you could either host the Stata files on a password protected machine or you could encrypt the files with a password (e.g. with gpg --symmetric). Either way, Travis will download files repeatedly so you may want to make them as small as possible. I've found the best way is to take an existing install and then strip out anything unnecessary, including documentation (all *.pdf, *.sthlp, *.ihlp, *.key, and *.dta which are almost all example data), graphical tools (*.mnu, *.png, *.dlg, *.idlg), and alternate executables. Compress this folder with xz. Then upload the files to a web server that allows for command-line download (if you are using a general hosting site, you could check out plowshare) making sure to take care of security. 

Now you can setup your .travis.yml to setup the machine.

  1. Extract the password needed for you Stata files (you can do this by installing the Travis client on your local machine, and then use it to encrypt as an environment variable).
  2. Download and set-up the Stata files.
  3. Add the Stata folder to the PATH (it needs to be able to find the license file).
Finally, add commands to test your code:
  • Obviously, you already have lots of code to test your programs, right? :)
  • In order for Travis to know if there was a problem, the executable that runs the tests should return a non-zero error code upon failure. The Stata executable will not do this (!), but you can use a simple wrapper like statab.sh. If you download this, remember you will have to make it executable.
  • The tests can be run either from a master do file, or be called independently from a script or makefile.
The more I use Stata the more it valuable to me it becomes!

Monday, June 22, 2015

Indepedent backups of researcher data

Open data is important for healthy scientific research. A recent journal article by Vines et al. studies how available data is from authors. Sadly, they find that
[R]esearch data cannot be reliably preserved by individual researchers, and further demonstrates the urgent need for policies mandating data sharing via public archives.
Most policy mandates are for newly published research, but what about data from existing publications? Is there a way that independent parties could try to preserve what material is available?

One idea is to have an established group systematically look for data that is only available on a researcher's personal/institutional website and to post in a repository (either at the journal or separately at a place like dataverse). The group should have some reputation so there is trust that the data was not manipulated.

Secondly, the group could try to acquire data even when the data is not available for download. They could ask authors. This would take much more time and have a low response rate, but still might be worth the investment. The group could also accept material from those who had already requested and received replication materials from the authors.  This would require little effort on the part of the group.

Wednesday, June 17, 2015

Stand-alone Pre-analysis Plans for observational studies

The push for research transparency in social science has included advocating for pre-analysis plans (PAPs) where researchers detail what they will do before seeing the data. These plans provide many benefits, the most obvious ones are (a) it rules out specification searching so that p-values are more believable, and (b) since it registers research before the findings it helps in understanding the body of work done that is not published. PAPs are increasingly used for verifiably prospective research (e.g. RCTs, lab experiments, and using non-open data that requires application). Their use for observational studies on existing research are, however, controversial (see Dal-Re et al. 2014). The primary challenge is that outsiders can not verify that the analysis plan was actually written before the research was undertaken.

While this challenge is likely not fully solvable, it can potentially be mitigated by restructuring incentives. The research designer could be separated from the research implementer and each earn professional credit for their independent contribution. Notably, the research designer would earn credit for publishing their PAP in a well-ranked, PAP-accepting journal independent of whether the results of the implementation are significant. This does not entirely remove the element of trust, so it is likely that published PAPs will be from those that the community trusts, such as a senior researcher. It frees, however, the research implement of this burned of trust. Indeed, once the PAP is published, many people could work to implement it independently, in a fashion similar to replication. Registering intent to implement could help coordinate efforts.

Disaggregating the pieces of research could increase the efficiency of the process in a number of ways. Researchers with ideas but without implementation skills (programming skills or access to data) can publish stand-alone PAPs without the potentially costly process of matching with others with the necessary skills. Even for those that could implement the research themselves (or with close colleagues) there are likely benefits to specializing. Researchers often have ideas that they think should be pursued but which is not a priority for them. Often these ideas are captured by colleagues or advisees, but quickly writing up a PAP could be another outlet that would confer visible benefit (see related post). On the other side, implementing PAPs could be a good way to train graduate students. Many graduate students state that the hardest part of research is coming up with a good idea. These PAPs would be especially helpful in areas where there is a lot of researcher subjectivity like structural models.

In order to suitably track credit, there could be a norm to cite the PAP whenever one would cite the implementation paper.

Taking this to the extreme, the scientific process could be even more unlinked. For example, a scientist currently has an incentive to "sell" their research (or tell a convince "story") in submissions to journals, possibly at the expense of being honest about the complexity of the findings. What if the science writer was somehow separated and judged on how well they communicated existing research findings? This could be judged if research data and methods were open.

Thursday, June 11, 2015

Getting versions of Stata modules

For reproducible research it is important to be able to reproduce your analysis environment. For Stata analyses, one needs to list the version of Stata modules used. This is especially important as the ssc archive does not store previous versions of modules. Ideally, one is storing the module files with a project, but it is still useful to have a measure of the version of the modules. Stata modules are not required to list a version number (NB: this is separate from which -version- of Stata the module is requesting). Many modules do, and if so one can often find them in the text of the ado files (use -viewsource command.ado- for this). If not, one can use the version date. Modules on SSC are required to list their "distribution date" . If you are installing modules from a version control server (e.g. Github) then the install date is sufficient. Stata keeps this information in a "stata.trk" folder at the base of each module directory tree (e.g. PLUS) which depends on your adopath. To get this information you can use the following simple script.

$ cat stata.trk | grep "^\([SND]\|d Distribution\)"

Wednesday, April 22, 2015

Code Management in Dropbox

Dropbox is not the ideal version control system, but some people do not want to make the leap to a full VCS (like git or svn which full software developers use). Given that one should not make the perfect the enemy of the good, what simple tips can Stata-base economists follow that will make their lives easier? Here are some tips that will be helpful on projects that last a while (longer than short-term memory or the involvement of an assistant) or involve multiple people. As always, one should have the over-arching goal that the project be reproducible (at almost all points, you can follow an explicit recipe and produce all required outputs). While my tips will help with this, you may require some project-specific solutions, but it is well worth the effort.

  1. Backup: Structure your folders so that a minimal number can be used to reproduce everything in the project. A common approach is to have a code folder (which changes often) and a data/raw folder (which should almost never change). (This means all "manual" actions like cleaning should be embedded in code!) With this approach one can store a one-time backup of the raw data and then store period snapshots of the code folder. At key times (e.g. after every new draft or presentation) zip up the code folder and archive it untouched. Since all code is periodically backed up, you can get ride of no longer used code.
  2. Avoid editing conflicts: If multiple people are editing the same files here is a tip to reduce friction. If you start editing a file, make an innocuous edit and save it. Then, before editing a file, check the "last modified" time. If someone else edited the file recently, chat to negotiate access (your team should decide what "recently" means).
  3. Branching: If you want to a make a series of changes that will leave a key file in an unusable state then you should make a "branch". You can do this (a) in Dropbox by duplicating the files (including unstable outputs) with modified names (have a convention here), or (b) copying the project out of Dropbox for editing. Try to make changes small enough that they can be merged back into the main file quickly. The longer files diverge the harder this will be.
  4. Syncing: For intermediate files only needed internally by one piece of code, you can reduce syncing by creating those files in your `c(tmpdir)' (with -tempfile- or just named explicitly) rather than in your project folder.

Here are other simple general tips that can be applied to Dropbox:

    1. No one's full path should be written into the code. Code should be run given Stata is in the appropriate working directory (often the root of the project folder or in the code subfolder). A nice benefit is that if the project folder is copied out of Dropbox it should run easily. If you have machine-specific configuration details, those should be defined on the machine in environment variables and brought into Stata using -local env_var : environment ENV_VAR-
    2. Have a local code/ado folder where you store all the needed modules. In the header of your code you should include a script that sets code/ado as PERSONAL and then restricts your adopath to just PERSONAL;BASE. You may want to -net set ado PERSONAL- and -mata: mata mlib index- also. 
    3. Learn how to use a visual text diff/merge utility like the one that comes with TortoiseSVN. This will make it easy to compare files between editors and across time.
    4. If possible, do not use spaces in file names (use "_")
    What other tips do you all have?

    Friday, January 30, 2015

    Using Yik Yak to quantify aspects of student life?

    Just listened to a nice piece on Reply All about Yik Yak, which is app where users can post messages anonymously and see the messages of those nearby (10 miles). The podcast mentioned how it was useful in showing written evidence of racism that previously was hard to prove. It got me thinking that if there was a way to get the feeds from different universities then one could analyze these using Natural Language Processing routines to get metrics for different schools. What percentage of anonymous posts at your school are about race, violence, sex, etc.? Could be something prospective students might want to know.

    Wednesday, January 28, 2015

    Prediction markets for human capital investment

    While there has been a growing literature on prediction markets, most of the markets have been confined to politics and macro-economics. These are great to have, but I would like to see markets that could help with more decisions that typical people make, such as when, and in what field to get training. The information that people have is usually about the current economy (this industry currently has a tight labor market) or employment projections by the government. What could be better would be a set of markets about future industry-specific indicators such as vacancy rates. If they were started at least 2-3 years before maturity then it could be used by people in college to help pick a major or by workers deciding whether to get training in a new field. The market might have to be "seeded" by the government, but it could be worth it.

    Tuesday, January 20, 2015

    Getting latexmk working within LyX

    If you would like to add latexmk as an export option in Lyx, here are the two basic steps to do in Tools -> Options.
    1) In File Handling -> File Format, hit "New" and then put in these options, then "Apply".
    2) Then in File Handling -> Converter, click on any existing converter, change the options to the below, and then click "Add", then "Save".

    Now you should be able to File->Export->PDF (Latexmk). Hope that works for you.

    If you are on Windows, then you might have problems with Perl versions (Lyx has one, MiKTeK another). Initially, I was getting that perl couldn't find File::Glob in @INC when latexmk was run. Turns out that Lyx was running it's own version of perl 5.18.2 and had @INC set to just the Lyx/lib (which oddly did have File/Glob.pm). I installed Strawberry Perl same version (needed that!) then set the PERL5LIB to the Strawberry perl folder.  Then it works fine.

    This was on LyX 2.1.2, updated MiKTeX, and Windows 8.

    Running Stata 12 on Ubuntu 14.04 (Trusty Tahr)

    After upgrading my linux machine I realized that I could no longer run my copy of Stata 12 GUI. It game the error:
    ./xstata-se: error while loading shared libraries: libgtksourceview-1.0.so.0: cannot open shared object file: No such file or directory
    These older versions of libraries aren't in the normal Ubuntu repositories anymore, so this is a simple work around to get Stata working again. (I'm on a 64 bit machine so change amd64 to the appropriate)
    1. Install libgtksourceview2-2.0
    2. sudo ln -s /usr/lib/libgtksourceview-2.0.so.0 /usr/lib/libgtksourceview-1.0.so.0
    3. Download libgnomecups1.0-1 0.2.3-4 from here. Install it with dpkg.
    4. Download libgnomeprint2.2-data_2.18.8-3ubuntu1_all.deb and libgnomeprint2.2-0_2.18.8-3ubuntu1_amd64.deb from here. Install them at the same time with dpkg.
    Happy estimating.

    Tuesday, November 25, 2014

    Graphs with log-scales in the positive and negative domains

    Often to show data that is highly dispersed one will compress the data by graphing its log. One down-side of this is that it only works if the data is all-positive or all-negative (if you use \(-ln(-x)\)). If your data contains zero and/or points in both domains then the you have to do something else. Here is a simple extension that uses a linear function around zero to smoothly connect a log function and it's opposite. $$x= \begin{cases} \ln(x) & \text{if }x>e\\ x/e & \text{if }-e\leq x\leq e\\ -\ln(-x) & \text{if }x<-e \end{cases}$$ The function is log-linear-log ("trilog").
    You can get a simple Stata utility -trilog- from here to make this transformation and create axis labels.
    Another intuitive extension would be to shift the log and its opposite closer to zero, such as $$x= \begin{cases} \ln(x+1) & \text{if }x\geq0\\ -\ln(-x+1) & \text{if }x<0 \end{cases}$$ The downside of this is that no longer are equal proportional changes reflected as equal distance changes.

    Wednesday, November 19, 2014

    Using make with Stata

    Having a makefile helps automate a lot of tasks with a project.

    • Generating different formats of logs, tex tables, and gphs (including making versions of the figs that have no titles). And removing orphan files if the primary ones are removed.
    • Normalizing logs, gphs, and dtas prior to committing.
    • Generating PDFs of Lyx files.
    • Updating the project mlib when mata files change.
    • Installing updated versions of packages that have to be installed.
    • Running Stata in batch mode, knowing the dependencies between code files (and setting up a gateway file so that on Windows modules can run shell commands).
    • Deals with SVN commands.
    Here is stub version. It uses statab.sh, cli_build_proj_mlib.do, gph2fmt.ado, cli_gph_eps.do,  cli-install-module.docli_smcl_log.do (plus the normalizing ones).

    Edit 2015-01-28: I've posted a project template with updated versions of these (and better makefiles as my skill improves) at GitHub

    Version control for research projects

    While I had used version control on earlier projects, I didn't start using version control for collaborative research projects until reading Code and Data for the Social Sciences: A Practitioner’s Guide by Matthew Gentzkow and Jesse M. Shapiro. If you haven't read it, it's a good read (I agree with the general guidelines and most of the specifics).

    The first decision is which files to version. I version the dta, gph, tex, and log files.  I chose not to version easily generated files such as different formats of outputted figures and tables and instead generate them automatically using my Makefile. I normalize the dta, gph, and log files before committing them so that changes are noted only if real content has changed.

    Some miscellaneous tools: rm-non-svn.shsvn_batch_rename.sh.


    Stata wishlist

    Here's what I wish Stata would add:

    1. Primary output files (dtas and gphs) should be able to be reproduced byte-for-byte. Primarily this requires being able to zero-out the timestamps and zero-out any junk padding.
    2. Make PDF and PNG exporting of figures available on console Unix.
    3. Shell commands should work in Windows batch-mode.
    4. All built-in commands should return values through the return classes (e.g. r()) so that they can be used programmatically
    5. Also, allow the Windows do-editor to automatically word wrap (this is a main reason why people I know use other editors).
    6. The program should set a positive return code on error. When running Stata in batch-mode the program always exits with 0 (success) even if the program had an error.
    Hopefully, some of these will be available for version 1415.

    Monday, October 13, 2014

    Module to sort Stata matrix by column

    I noticed recently that the module to sort Stata matrices by a column (matsort) incorrectly handles row names with spaces. This error is partly due to the limited Stata functions available for handling matrices prior to version 9 (when Mata was introduced). I've quickly made a bug-free replacement -matrixsort- that you can download from my Github repository.

    Sunday, October 12, 2014

    Bookmarklets

    Here are two bookmarklets that I created recently:

    1. 0){N=T[0];N.outerHTML = N.innerHTML;T=d.getElementsByTagName(t);}}}catch(E){r=0}return r}R(self);var i,x;for(i=0;x=frames[i];++i)R(x)})()">CleanupSingleDoc which I use (with Printliminator) to save a simple version of a web page (remove text links, iframes, etc.) in order to save the content as an ePub (using Calibre).
    2. Connect via ResearchPort which will append the University of Maryland suffix proxy for authentication

    Friday, October 10, 2014

    Managing sets of files produced by different configurations

    With statistical estimation you often run the same programs multiple times with slightly different options. If these program produces file outputs you can have some trouble managing them all. Here are some tools I use.  I have a global suffix ${extra_f_suff}

    Tracking sets

    The first task is tracking the files produced. In general you want to have files names that include the option. Here, I employ two strategies depending on the work.

    1. For real runs, I usually want to collect the names of all files produced so that they can be checked and then stored/deleted/etc. Therefore I have wrapper functions saving dtas, logs, graphs, tables, and text snippets that append the name of the file they are writting to a separate text file. 
    2. A standard option that I always include is a "testing" switch. This is useful for when I want to just test if a small change causes an error. It does the bare minimum for a program (limits the number of observations, reduced the number repetitions, etc.). It also sets a global extra_file_suffix="_testing" which is appended to all file names at the point of file writing (easier than passing a testing option through several layers of programs).

    Manipulating sets

    If you can build a list of files (either because they were saved or you do find | grep *.blah) then here are some handy tools for dealing with them.

    $ cat file_list | xargs rm # delete
    #manage them via svn (similar options for git):
    $ cat file_list | svn add --targets -
    $ cat file_list | svn remove --keep-local --targets -
    $ cat file_list | svn commit -m ""  --targets -

    Noting the primary key in Stata files

    Most tables in databases have a primary key defined and this can be a help with Stata files too. If you have a primary key defined by a single variable then you can use xtset (or tsset if it's a time variable). If you have a composite key and one of them is a time variable you can use xtset/tsset. Otherwise, you should have a consistent way of listing it. One way is to store it as a dta characteristic, such as:
    . char _dta[key] keyvar(s)
    See also isid.

    Thursday, August 28, 2014

    Parallel processing in Stata: Example code

    Here are some files from a presentation I did about running parallel operations in Stata (beyond what Stata-MP does). They include a simple presentation covering the basic idea and example code for doing parallel bootstrap estimation. They use the parallel Stata module (I prefer dev version).

    Wednesday, August 27, 2014

    Easy LyX Beamer products

    When making presentations in LyX with the Beamer template, one often wants to make three PDFs every time: slides, handouts, and handouts+notes. Here are some scripts to do this. You will need Cygwin installed if you are on Windows.

    Command-line version:
    In Windows I've added this functionality to the right-click menu for LyX files. It requires the slightly modified script below. To edit the right-click menu I use the FileMenu Tools utility. In a new entry, set the program as a shell (I've tested git -- "C:\Program Files (x86)\Git\bin\sh.exe" and cygwin -- "C:\cygwin64\bin\mintty.exe"). For arguments put "absolute/path/to/beamer_outputs.sh %FILENAME1%".

    Monday, August 25, 2014

    Storing Stata project dependencies

    Newer versions of an environment can break existing code so it is often helpful to maintain access to the specific versions you use. For the Stata environment this is particularly important. The SSC archive doesn't store previous versions of modules so you should store them in your project folder. To ensure that a project is using only the locally-stored Stata programs, I set the shell environment variable S_ADO to "\"<project_base>/code/ado/\";BASE".

    The process is a bit more work if a module has machine-specific files (e.g. compiled plugins) and you want to allow your project to run on different platforms. If you're working across platforms you should have your code stored in some kind of version control repository (e.g. subversion or git). For modules with machine-specific file you can't store the installed files in the repo since they differ constantly between machines. Instead you store the installation files in the repo and then you do a local install on each machine. To store the installation files locally, find the URL of pkg files (e.g. http://fmwww.bc.edu/repec/bocode/s/synth.pkg) and use store-mod-install-files.sh. To install locally, do the following

    A more general solution will reinstall the files when the the installation files are updated. For this we use make and generate the makefile. The file dependencies are stored in a the pkg files so you can use gen-makefile.sh (and statab.sh and cli-install-module.do) to scan for such files, extract the dependencies, and generate the makefile.You will then be able to type:
    $ make all_modules
    to update/install all the modules.

    For the paranoid: The Stata program itself is closed-source, so specific versions may be unavailable in the future. Stata states that you can use the version command to enable dated-interpretation in newer versions. If you are not satisfied by that, store installation files yourself (preferably for an open-source system like Linux).