Better than Standard Errors | Brian Quistorff: Stata

Showing posts with label Stata. Show all posts

Monday, August 30, 2021

Cross-fitting in Stata

Cross-fitting is a method to produce unbiased (honest) predictions/residuals from a model. It is commonly used these days when dealing with Machine Learning models as they typically have flexibility/bias built-in. Cross-fitting can be seen as the initial part of cross-validation: the data are split into K folds, and predictions for fold k are done using a model trained on all data but fold k. When K is the same as the sample-size this generates jackknife predictions (whose residuals are approximately unbiased).

You can install it from my Stata-modules GitHub repo. It encapsulates both the fit and prediction step and generates either predictions or residuals (if you specify the outcome).

//net install crossfit, from(https://raw.github.com/bquistorff/Stata-modules/master/c/) replace
sysuse auto
crossfit price_hat_oos, k(5): reg price mpg
crossfit price_resid_oos, k(5) outcome(price): reg price mpg
ereturn list //additionally get MSE, MAE, R2

The crossfold package almost does everything needed except it's non-trivial to get the predictions/residuals as a new variable (especially when there's a global 'if' clause). Maybe one day, we should merge!

Random Forest imlementation in Stata

I recently needed a Random Forest implementation on a slightly older version of Stata and found the choices quite lacking.

crtrees (deck) is a Stata-native implementation, but I found confusing errors when running this. I tried to fix them, but the code looks like it has been sent through a code obfuscator!
Stata's native interface with Python wasn't available to me since I was using Stata 15.
rforest is a binding to a JAVA implementation. I both couldn't install Java and as most machine learning these days happens in R or Python, Java is very odd language choice today.

I then stumbled upon rcall, which allows calling R programs. R was on my platform and so I made a simple Stata binding to the fast ranger package on R. You can install it from my Stata-modules GitHub repo. An R program is spun up and all work done with a single command so it encapsulates both the fit and prediction step (either standard or out-of-bag predictions).

//net install ranger, from(https://raw.github.com/bquistorff/Stata-modules/master/r/) replace
sysuse auto
ranger price mpg turn, predict(price_hat_oos)

Sunday, September 13, 2015

Compiling Stata plugins

Here are some notes on how to compile plugins for Stata. The official plugin page is a bit out of date.

Linux: Add option -fPIC. If you want to build a 32-bit plugin on a 64-bit system then you will need the 32-bit version of libc (e.g. on Debian install g++-multilib and libc6-dev-i386).

Windows: You will need the mingw-w64 compiler (unless you just want 32-bit in which case you can use the original MinGW). You can get this in several ways, but since I already had Cygwin I just installed the mingw64-x86_64-* packages (for 32 bit you can use mingw64-i686-*). You will then use the x86_64-w64-mingw32-g++ command rather than g++ -mno-cygwin. Add the -static flag in the link step otherwise Stata might give a "could not load plugin" due to unfound DLLs (e.g. I used C++ classes and I found using Procmon that after reading my plugin file Stata couldn't find libgcc_s_seh-1.dll).

Mac: Install the Xcode command-line tools. I added the -shared option and removed the -bundle option (-bundle can't be used with dylibs). For modern Xcode (since version 5.0) then you have to use clang/clang++.

Wednesday, September 02, 2015

assign_treatment: A Tool for Stratified Randomization

This blog post by David McKenzie and Miriam Bruhn talks about assigning treatments at random when stratification variables cause cells (unique combinations of stratification variable values) to have uneven number of methods. The provided Stata do-file is hard-coded for a six treatments and is a bit difficult to adapt for another number. I've made a module that does the assignment for varying numbers of treatments as well of providing different methods of assigning the remainder units ("misfits") from each cell.

The three main methods provided all assign the misfits at random and achieve cell-level balance (counts differ at most by 1):

full - This is the same method as McKenzie and Bruhn. The misfits from each cell are separately randomized to any combination of treatments. This can cause the overall number of units per treatment to be unbalanced even though they will be balanced at the cell-level level.
reduction - This method achieves overall balance as well as better balance for specified stratification variables. It does this by limiting misfits to be assigned to a "wrapped" interval of treatments (e.g. (2,3,4) or (6,1,2) if there are six treatments) and then having those intervals dovetail together.
full-obalance - This method allows misfits to be assigned to any combination of treatments and achieves overall balance. It does so by assigning units one at a time to fill repeating slots of (1,...,T,1,...T,...,1..). At each stage it keeps track of possible units that could fill a spot (without causing two from the same cell to have the same treatment). It randomly picks one and then attempts to fill the next spot (while giving a slight weight to trying to fit first misfits from cells with many misfits). If filling a spot is impossible the algorithm backs up to the last point where there was a choice and tries a new option.

To install the module, run
. net install assign_treatment, from(https://raw.github.com/bquistorff/Stata-modules/master/a/) replace

Tests are provided in here.

Tuesday, August 18, 2015

Continuous Integration with Stata

Continuous Integration is a workflow that frequently merges code changes and automatically runs tests. This can be especially helpful when working in teams. Several free services exist, such as Travis-CI which provides Linux build infrastructure and integrates with GitHub. Basically, you instruct Travis how to setup a Linux environment for your project and then direct it what tests to run. Travis will then automate this, checking your repository at every push.

First off, you'll obviously need a Stata license that allows for this kind of work. The single-user license allows for 3 "seats", or you may have a network license with extra capacity. In Travis, you should go to the Settings for your repo and "Limit concurrent builds" to 1 (or whatever is permitted by you license).

Travis works each time off of clean Linux machine images so your setup will have to allow Travis to install Stata automatically. Travis allows you to store private passwords on the test infrastructure, so you could either host the Stata files on a password protected machine or you could encrypt the files with a password (e.g. with gpg --symmetric). Either way, Travis will download files repeatedly so you may want to make them as small as possible. I've found the best way is to take an existing install and then strip out anything unnecessary, including documentation (all *.pdf, *.sthlp, *.ihlp, *.key, and *.dta which are almost all example data), graphical tools (*.mnu, *.png, *.dlg, *.idlg), and alternate executables. Compress this folder with xz. Then upload the files to a web server that allows for command-line download (if you are using a general hosting site, you could check out plowshare) making sure to take care of security.

Now you can setup your .travis.yml to setup the machine.

Extract the password needed for you Stata files (you can do this by installing the Travis client on your local machine, and then use it to encrypt as an environment variable).
Download and set-up the Stata files.
Add the Stata folder to the PATH (it needs to be able to find the license file).

Finally, add commands to test your code:

Obviously, you already have lots of code to test your programs, right? :)
In order for Travis to know if there was a problem, the executable that runs the tests should return a non-zero error code upon failure. The Stata executable will not do this (!), but you can use a simple wrapper like statab.sh. If you download this, remember you will have to make it executable.
The tests can be run either from a master do file, or be called independently from a script or makefile.

The more I use Stata the more it valuable to me it becomes!

Thursday, June 11, 2015

Getting versions of Stata modules

For reproducible research it is important to be able to reproduce your analysis environment. For Stata analyses, one needs to list the version of Stata modules used. This is especially important as the ssc archive does not store previous versions of modules. Ideally, one is storing the module files with a project, but it is still useful to have a measure of the version of the modules. Stata modules are not required to list a version number (NB: this is separate from which -version- of Stata the module is requesting). Many modules do, and if so one can often find them in the text of the ado files (use -viewsource command.ado- for this). If not, one can use the version date. Modules on SSC are required to list their "distribution date" . If you are installing modules from a version control server (e.g. Github) then the install date is sufficient. Stata keeps this information in a "stata.trk" folder at the base of each module directory tree (e.g. PLUS) which depends on your adopath. To get this information you can use the following simple script.

$ cat stata.trk | grep "^$[SND]\|d Distribution$"

Wednesday, April 22, 2015

Code Management in Dropbox

Dropbox is not the ideal version control system, but some people do not want to make the leap to a full VCS (like git or svn which full software developers use). Given that one should not make the perfect the enemy of the good, what simple tips can Stata-base economists follow that will make their lives easier? Here are some tips that will be helpful on projects that last a while (longer than short-term memory or the involvement of an assistant) or involve multiple people. As always, one should have the over-arching goal that the project be reproducible (at almost all points, you can follow an explicit recipe and produce all required outputs). While my tips will help with this, you may require some project-specific solutions, but it is well worth the effort.

Backup: Structure your folders so that a minimal number can be used to reproduce everything in the project. A common approach is to have a code folder (which changes often) and a data/raw folder (which should almost never change). (This means all "manual" actions like cleaning should be embedded in code!) With this approach one can store a one-time backup of the raw data and then store period snapshots of the code folder. At key times (e.g. after every new draft or presentation) zip up the code folder and archive it untouched. Since all code is periodically backed up, you can get ride of no longer used code.
Avoid editing conflicts: If multiple people are editing the same files here is a tip to reduce friction. If you start editing a file, make an innocuous edit and save it. Then, before editing a file, check the "last modified" time. If someone else edited the file recently, chat to negotiate access (your team should decide what "recently" means).
Branching: If you want to a make a series of changes that will leave a key file in an unusable state then you should make a "branch". You can do this (a) in Dropbox by duplicating the files (including unstable outputs) with modified names (have a convention here), or (b) copying the project out of Dropbox for editing. Try to make changes small enough that they can be merged back into the main file quickly. The longer files diverge the harder this will be.
Syncing: For intermediate files only needed internally by one piece of code, you can reduce syncing by creating those files in your `c(tmpdir)' (with -tempfile- or just named explicitly) rather than in your project folder.

Here are other simple general tips that can be applied to Dropbox:

No one's full path should be written into the code. Code should be run given Stata is in the appropriate working directory (often the root of the project folder or in the code subfolder). A nice benefit is that if the project folder is copied out of Dropbox it should run easily. If you have machine-specific configuration details, those should be defined on the machine in environment variables and brought into Stata using -local env_var : environment ENV_VAR-
Have a local code/ado folder where you store all the needed modules. In the header of your code you should include a script that sets code/ado as PERSONAL and then restricts your adopath to just PERSONAL;BASE. You may want to -net set ado PERSONAL- and -mata: mata mlib index- also.
Learn how to use a visual text diff/merge utility like the one that comes with TortoiseSVN. This will make it easy to compare files between editors and across time.
If possible, do not use spaces in file names (use "_")

What other tips do you all have?

Tuesday, January 20, 2015

Running Stata 12 on Ubuntu 14.04 (Trusty Tahr)

After upgrading my linux machine I realized that I could no longer run my copy of Stata 12 GUI. It game the error:

./xstata-se: error while loading shared libraries: libgtksourceview-1.0.so.0: cannot open shared object file: No such file or directory

These older versions of libraries aren't in the normal Ubuntu repositories anymore, so this is a simple work around to get Stata working again. (I'm on a 64 bit machine so change amd64 to the appropriate)

Install libgtksourceview2-2.0
sudo ln -s /usr/lib/libgtksourceview-2.0.so.0 /usr/lib/libgtksourceview-1.0.so.0
Download libgnomecups1.0-1 0.2.3-4 from here. Install it with dpkg.
Download libgnomeprint2.2-data_2.18.8-3ubuntu1_all.deb and libgnomeprint2.2-0_2.18.8-3ubuntu1_amd64.deb from here. Install them at the same time with dpkg.

Happy estimating.

Tuesday, November 25, 2014

Graphs with log-scales in the positive and negative domains

Often to show data that is highly dispersed one will compress the data by graphing its log. One down-side of this is that it only works if the data is all-positive or all-negative (if you use $-ln(-x)$). If your data contains zero and/or points in both domains then the you have to do something else. Here is a simple extension that uses a linear function around zero to smoothly connect a log function and it's opposite. $$x= \begin{cases} \ln(x) & \text{if }x>e\\ x/e & \text{if }-e\leq x\leq e\\ -\ln(-x) & \text{if }x<-e \end{cases}$$ The function is log-linear-log ("trilog").

You can get a simple Stata utility -trilog- from here to make this transformation and create axis labels.
Another intuitive extension would be to shift the log and its opposite closer to zero, such as $$x= \begin{cases} \ln(x+1) & \text{if }x\geq0\\ -\ln(-x+1) & \text{if }x<0 \end{cases}$$ The downside of this is that no longer are equal proportional changes reflected as equal distance changes.

Wednesday, November 19, 2014

Using make with Stata

Having a makefile helps automate a lot of tasks with a project.

Generating different formats of logs, tex tables, and gphs (including making versions of the figs that have no titles). And removing orphan files if the primary ones are removed.
Normalizing logs, gphs, and dtas prior to committing.
Generating PDFs of Lyx files.
Updating the project mlib when mata files change.
Installing updated versions of packages that have to be installed.
Running Stata in batch mode, knowing the dependencies between code files (and setting up a gateway file so that on Windows modules can run shell commands).
Deals with SVN commands.

Here is stub version. It uses statab.sh, cli_build_proj_mlib.do, gph2fmt.ado, cli_gph_eps.do, cli-install-module.do, cli_smcl_log.do (plus the normalizing ones).

Edit 2015-01-28: I've posted a project template with updated versions of these (and better makefiles as my skill improves) at GitHub.

Stata wishlist

Here's what I wish Stata would add:

Primary output files (dtas and gphs) should be able to be reproduced byte-for-byte. Primarily this requires being able to zero-out the timestamps and zero-out any junk padding.
Make PDF and PNG exporting of figures available on console Unix.
Shell commands should work in Windows batch-mode.
All built-in commands should return values through the return classes (e.g. r()) so that they can be used programmatically
Also, allow the Windows do-editor to automatically word wrap (this is a main reason why people I know use other editors).
The program should set a positive return code on error. When running Stata in batch-mode the program always exits with 0 (success) even if the program had an error.

Hopefully, some of these will be available for version 1415.

Monday, October 13, 2014

Module to sort Stata matrix by column

I noticed recently that the module to sort Stata matrices by a column (matsort) incorrectly handles row names with spaces. This error is partly due to the limited Stata functions available for handling matrices prior to version 9 (when Mata was introduced). I've quickly made a bug-free replacement -matrixsort- that you can download from my Github repository.

Friday, October 10, 2014

Managing sets of files produced by different configurations

With statistical estimation you often run the same programs multiple times with slightly different options. If these program produces file outputs you can have some trouble managing them all. Here are some tools I use. I have a global suffix ${extra_f_suff}

Tracking sets

The first task is tracking the files produced. In general you want to have files names that include the option. Here, I employ two strategies depending on the work.

For real runs, I usually want to collect the names of all files produced so that they can be checked and then stored/deleted/etc. Therefore I have wrapper functions saving dtas, logs, graphs, tables, and text snippets that append the name of the file they are writting to a separate text file.
A standard option that I always include is a "testing" switch. This is useful for when I want to just test if a small change causes an error. It does the bare minimum for a program (limits the number of observations, reduced the number repetitions, etc.). It also sets a global extra_file_suffix="_testing" which is appended to all file names at the point of file writing (easier than passing a testing option through several layers of programs).

Manipulating sets

If you can build a list of files (either because they were saved or you do find | grep *.blah) then here are some handy tools for dealing with them.

$ cat file_list | xargs rm # delete
#manage them via svn (similar options for git):
$ cat file_list | svn add --targets -
$ cat file_list | svn remove --keep-local --targets -
$ cat file_list | svn commit -m "" --targets -

Noting the primary key in Stata files

Most tables in databases have a primary key defined and this can be a help with Stata files too. If you have a primary key defined by a single variable then you can use xtset (or tsset if it's a time variable). If you have a composite key and one of them is a time variable you can use xtset/tsset. Otherwise, you should have a consistent way of listing it. One way is to store it as a dta characteristic, such as:
. char _dta[key] keyvar(s)
See also isid.

Thursday, August 28, 2014

Parallel processing in Stata: Example code

Here are some files from a presentation I did about running parallel operations in Stata (beyond what Stata-MP does). They include a simple presentation covering the basic idea and example code for doing parallel bootstrap estimation. They use the parallel Stata module (I prefer dev version).

Version 1
Version 2 (recommended)

Monday, August 25, 2014

Storing Stata project dependencies

Newer versions of an environment can break existing code so it is often helpful to maintain access to the specific versions you use. For the Stata environment this is particularly important. The SSC archive doesn't store previous versions of modules so you should store them in your project folder. To ensure that a project is using only the locally-stored Stata programs, I set the shell environment variable S_ADO to "\"<project_base>/code/ado/\";BASE".

The process is a bit more work if a module has machine-specific files (e.g. compiled plugins) and you want to allow your project to run on different platforms. If you're working across platforms you should have your code stored in some kind of version control repository (e.g. subversion or git). For modules with machine-specific file you can't store the installed files in the repo since they differ constantly between machines. Instead you store the installation files in the repo and then you do a local install on each machine. To store the installation files locally, find the URL of pkg files (e.g. http://fmwww.bc.edu/repec/bocode/s/synth.pkg) and use store-mod-install-files.sh. To install locally, do the following

A more general solution will reinstall the files when the the installation files are updated. For this we use make and generate the makefile. The file dependencies are stored in a the pkg files so you can use gen-makefile.sh (and statab.sh and cli-install-module.do) to scan for such files, extract the dependencies, and generate the makefile.You will then be able to type:
$ make all_modules
to update/install all the modules.

For the paranoid: The Stata program itself is closed-source, so specific versions may be unavailable in the future. Stata states that you can use the version command to enable dated-interpretation in newer versions. If you are not satisfied by that, store installation files yourself (preferably for an open-source system like Linux).

Monday, March 17, 2014

Stata's serset file format

Sersets are minimal versions of datasets in Stata. They are commonly embedded in .gph graphics files to store the data so that graphs can be reproduced by Stata. Sersets are sometimes also written directly to files. The file format is undocumented by Stata but very similar to Stata dta formats like v115 and previous ones. If you need to access the data in a serset file or .gph then here is the basic format. The below assumes that you are generally familiar with the dta v115 standard. Let nvar be the number of variables and nobs the number of observations. The length of each block is in brackets.

[16] "sersetreadwrite" (null terminated).
[1] either "0x2" for versions 11-13 (gph format 3) or "0x3" for version 14 (gph format 4).
[1] I think this is a byte-order field, so 0x2 for lohi (standard Windows/Linux machines), otherwise 0x1.
[4] Number of variables
[4] Number of observations
[nvar] The typlist. A byte characterizing the type of each variable.
[nvar x 54] The varlist. Null-terminated strings of the variable names. For versions >=14 this is 150.
[nvar x 49] The fmtlist. Null-terminated strings of the variable formats. For versions >=14 this is 57.
[nvar x 8] The maximums. A double for each variable representing that variable's non-missing maximum. If the variable is a string then the double's missing value is used.
[nvar x 8] The minimums. Same as above except for the minimums.
[nobs x sum(sizeof each variable type)] Data: All data from an observation are contiguous.

Edit 2016-01-31: Figured out more about the bytes immediately after "sersetreadwrite" and about how version 14 is different.

Thursday, September 05, 2013

Modifying Stata library routines

I was recently not understanding why some error messages were being generated by mata's optimize(). I found that it was quite easy to modify the routines to include some debugging print-outs. Broadly we want to create modified routines that mask the existing ones. Here's the basic steps:

Find the mata file (in my case deriv.mata) in the Stata folder, copied it to your own directory, and make changes.
Turn it into a personal library. See the basic template at -help mata_mlib-.
Have the personal library called instead of the builtin routines. The -mata mlib index- command from the previous step outputted the order in which the libraries are searched for a function.

Enjoy your modified version. When you're all done you can use -mata mlib index- again to get the default ordering.