Better than Standard Errors | Brian Quistorff: project management

Showing posts with label project management. Show all posts

Tuesday, August 18, 2015

Continuous Integration with Stata

Continuous Integration is a workflow that frequently merges code changes and automatically runs tests. This can be especially helpful when working in teams. Several free services exist, such as Travis-CI which provides Linux build infrastructure and integrates with GitHub. Basically, you instruct Travis how to setup a Linux environment for your project and then direct it what tests to run. Travis will then automate this, checking your repository at every push.

First off, you'll obviously need a Stata license that allows for this kind of work. The single-user license allows for 3 "seats", or you may have a network license with extra capacity. In Travis, you should go to the Settings for your repo and "Limit concurrent builds" to 1 (or whatever is permitted by you license).

Travis works each time off of clean Linux machine images so your setup will have to allow Travis to install Stata automatically. Travis allows you to store private passwords on the test infrastructure, so you could either host the Stata files on a password protected machine or you could encrypt the files with a password (e.g. with gpg --symmetric). Either way, Travis will download files repeatedly so you may want to make them as small as possible. I've found the best way is to take an existing install and then strip out anything unnecessary, including documentation (all *.pdf, *.sthlp, *.ihlp, *.key, and *.dta which are almost all example data), graphical tools (*.mnu, *.png, *.dlg, *.idlg), and alternate executables. Compress this folder with xz. Then upload the files to a web server that allows for command-line download (if you are using a general hosting site, you could check out plowshare) making sure to take care of security.

Now you can setup your .travis.yml to setup the machine.

Extract the password needed for you Stata files (you can do this by installing the Travis client on your local machine, and then use it to encrypt as an environment variable).
Download and set-up the Stata files.
Add the Stata folder to the PATH (it needs to be able to find the license file).

Finally, add commands to test your code:

Obviously, you already have lots of code to test your programs, right? :)
In order for Travis to know if there was a problem, the executable that runs the tests should return a non-zero error code upon failure. The Stata executable will not do this (!), but you can use a simple wrapper like statab.sh. If you download this, remember you will have to make it executable.
The tests can be run either from a master do file, or be called independently from a script or makefile.

The more I use Stata the more it valuable to me it becomes!

Wednesday, April 22, 2015

Code Management in Dropbox

Dropbox is not the ideal version control system, but some people do not want to make the leap to a full VCS (like git or svn which full software developers use). Given that one should not make the perfect the enemy of the good, what simple tips can Stata-base economists follow that will make their lives easier? Here are some tips that will be helpful on projects that last a while (longer than short-term memory or the involvement of an assistant) or involve multiple people. As always, one should have the over-arching goal that the project be reproducible (at almost all points, you can follow an explicit recipe and produce all required outputs). While my tips will help with this, you may require some project-specific solutions, but it is well worth the effort.

Backup: Structure your folders so that a minimal number can be used to reproduce everything in the project. A common approach is to have a code folder (which changes often) and a data/raw folder (which should almost never change). (This means all "manual" actions like cleaning should be embedded in code!) With this approach one can store a one-time backup of the raw data and then store period snapshots of the code folder. At key times (e.g. after every new draft or presentation) zip up the code folder and archive it untouched. Since all code is periodically backed up, you can get ride of no longer used code.
Avoid editing conflicts: If multiple people are editing the same files here is a tip to reduce friction. If you start editing a file, make an innocuous edit and save it. Then, before editing a file, check the "last modified" time. If someone else edited the file recently, chat to negotiate access (your team should decide what "recently" means).
Branching: If you want to a make a series of changes that will leave a key file in an unusable state then you should make a "branch". You can do this (a) in Dropbox by duplicating the files (including unstable outputs) with modified names (have a convention here), or (b) copying the project out of Dropbox for editing. Try to make changes small enough that they can be merged back into the main file quickly. The longer files diverge the harder this will be.
Syncing: For intermediate files only needed internally by one piece of code, you can reduce syncing by creating those files in your `c(tmpdir)' (with -tempfile- or just named explicitly) rather than in your project folder.

Here are other simple general tips that can be applied to Dropbox:

No one's full path should be written into the code. Code should be run given Stata is in the appropriate working directory (often the root of the project folder or in the code subfolder). A nice benefit is that if the project folder is copied out of Dropbox it should run easily. If you have machine-specific configuration details, those should be defined on the machine in environment variables and brought into Stata using -local env_var : environment ENV_VAR-
Have a local code/ado folder where you store all the needed modules. In the header of your code you should include a script that sets code/ado as PERSONAL and then restricts your adopath to just PERSONAL;BASE. You may want to -net set ado PERSONAL- and -mata: mata mlib index- also.
Learn how to use a visual text diff/merge utility like the one that comes with TortoiseSVN. This will make it easy to compare files between editors and across time.
If possible, do not use spaces in file names (use "_")

What other tips do you all have?

Wednesday, November 19, 2014

Using make with Stata

Having a makefile helps automate a lot of tasks with a project.

Generating different formats of logs, tex tables, and gphs (including making versions of the figs that have no titles). And removing orphan files if the primary ones are removed.
Normalizing logs, gphs, and dtas prior to committing.
Generating PDFs of Lyx files.
Updating the project mlib when mata files change.
Installing updated versions of packages that have to be installed.
Running Stata in batch mode, knowing the dependencies between code files (and setting up a gateway file so that on Windows modules can run shell commands).
Deals with SVN commands.

Here is stub version. It uses statab.sh, cli_build_proj_mlib.do, gph2fmt.ado, cli_gph_eps.do, cli-install-module.do, cli_smcl_log.do (plus the normalizing ones).

Edit 2015-01-28: I've posted a project template with updated versions of these (and better makefiles as my skill improves) at GitHub.

Version control for research projects

While I had used version control on earlier projects, I didn't start using version control for collaborative research projects until reading Code and Data for the Social Sciences: A Practitioner’s Guide by Matthew Gentzkow and Jesse M. Shapiro. If you haven't read it, it's a good read (I agree with the general guidelines and most of the specifics).

The first decision is which files to version. I version the dta, gph, tex, and log files. I chose not to version easily generated files such as different formats of outputted figures and tables and instead generate them automatically using my Makefile. I normalize the dta, gph, and log files before committing them so that changes are noted only if real content has changed.

Some miscellaneous tools: rm-non-svn.sh, svn_batch_rename.sh.

Friday, October 10, 2014

Managing sets of files produced by different configurations

With statistical estimation you often run the same programs multiple times with slightly different options. If these program produces file outputs you can have some trouble managing them all. Here are some tools I use. I have a global suffix ${extra_f_suff}

Tracking sets

The first task is tracking the files produced. In general you want to have files names that include the option. Here, I employ two strategies depending on the work.

For real runs, I usually want to collect the names of all files produced so that they can be checked and then stored/deleted/etc. Therefore I have wrapper functions saving dtas, logs, graphs, tables, and text snippets that append the name of the file they are writting to a separate text file.
A standard option that I always include is a "testing" switch. This is useful for when I want to just test if a small change causes an error. It does the bare minimum for a program (limits the number of observations, reduced the number repetitions, etc.). It also sets a global extra_file_suffix="_testing" which is appended to all file names at the point of file writing (easier than passing a testing option through several layers of programs).

Manipulating sets

If you can build a list of files (either because they were saved or you do find | grep *.blah) then here are some handy tools for dealing with them.

$ cat file_list | xargs rm # delete
#manage them via svn (similar options for git):
$ cat file_list | svn add --targets -
$ cat file_list | svn remove --keep-local --targets -
$ cat file_list | svn commit -m "" --targets -

Monday, August 25, 2014

Storing Stata project dependencies

Newer versions of an environment can break existing code so it is often helpful to maintain access to the specific versions you use. For the Stata environment this is particularly important. The SSC archive doesn't store previous versions of modules so you should store them in your project folder. To ensure that a project is using only the locally-stored Stata programs, I set the shell environment variable S_ADO to "\"<project_base>/code/ado/\";BASE".

The process is a bit more work if a module has machine-specific files (e.g. compiled plugins) and you want to allow your project to run on different platforms. If you're working across platforms you should have your code stored in some kind of version control repository (e.g. subversion or git). For modules with machine-specific file you can't store the installed files in the repo since they differ constantly between machines. Instead you store the installation files in the repo and then you do a local install on each machine. To store the installation files locally, find the URL of pkg files (e.g. http://fmwww.bc.edu/repec/bocode/s/synth.pkg) and use store-mod-install-files.sh. To install locally, do the following

A more general solution will reinstall the files when the the installation files are updated. For this we use make and generate the makefile. The file dependencies are stored in a the pkg files so you can use gen-makefile.sh (and statab.sh and cli-install-module.do) to scan for such files, extract the dependencies, and generate the makefile.You will then be able to type:
$ make all_modules
to update/install all the modules.

For the paranoid: The Stata program itself is closed-source, so specific versions may be unavailable in the future. Stata states that you can use the version command to enable dated-interpretation in newer versions. If you are not satisfied by that, store installation files yourself (preferably for an open-source system like Linux).

Wednesday, September 11, 2013

Automated Statistical Reports in MS Office

I prefer LyX to Microsoft Office when writing academic works, but sometimes MS Word/Powerpoint is necessary. My one requirement for the workflow is that manual fiddling isn't required. This is a bit challenging because Office products can't show PS/PDF material inline. You can include an EPS as a graphic which can print OK (if you print to a PDF or have a PS printer), which is fine for Word, but will only show an unavoidably ugly "preview" inline which doesn't work for PowerPoint. For presentations, you should output to a raster format (like PNG) or for tables you can possibly output to EMF.

Generating EPS/PNG files from graphics is built into most stats software (R, Stata), so the only hard part is converting tables. I'll assume you can generate these a tex and start there (so as to minimize differences between Office and Latex generated reports). Scripts below were tested on Windows 7 with Cygwin and I had placed the programs in step four in my path.

Export a table to mytable.frag.tex.
Wrap that tex fragment in a minimal "standalone" document.

>echo \documentclass[varwidth=true, border=10pt]{standalone} > mytable.tex
>echo \begin{document} >> mytable.tex
>cat mytable.frag.tex >> mytable.tex
>echo \end{document} >> mytable.tex

Make your table into a "standalone" PDF (the PDF will be just the size of the table). Your TeX distribution will automatically download the 'standalone' package.

>pdflatex mytable.tex

Convert the PDF to the output required.

>pdftops -eps mytable.pdf mytable.eps

Could also do this in Ghostscript like below.

>pstoedit -f emf mytable.pdf mytable.emf

This will change the fonts on you. If you have Greek letters this conversion might fail in which case you need to add -adt which turns letters into polygons (not perfect but OK). You can also convert with Adobe Illustrator which will also do font conversion.

>gswin64c -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pnggray -r600 -sOutputFile=mytable.png mytable.pdf

There are two options for inserting. The first is the easiest uses absolute paths for images so won't work well with sharing the files among other users (e.g. in Dropbox) or moving the project folder.

Use normal image insertion but use the dropdown next to "Insert" and select "Insert and Link". Some VBA scripts are available that will convert the absolute path to relative paths (e.g. here).
Insert pictures using field codes. Insert -> Quick Parts -> IncludePicture. Then put in a relative path like pics\pic1.png and select "Data not stored with document" (and you probably want to check resizing both horizontally and vertically). If you want to convert to pictures (if you send it out), then remove the "Data not ..." by removing "\d" from the field code, and make a formatting edit (like going to Format Picture and then changing the Layout wrapping style).

Now when the linked files are updated by programs, they will be updated in Office. If the document is open then select an image (or using select all) and press F9.

Edit 2016-01-22: Expanded part about relative paths to images.

Also note that emfs might not display correctly on Mac Powerpoints (same for MathType equations). You can print pptxs to pdfs but then you will lose any animations.