Monday, August 30, 2021

Cross-fitting in Stata

Cross-fitting is a method to produce unbiased (honest) predictions/residuals from a model. It is commonly used these days when dealing with Machine Learning models as they typically have flexibility/bias built-in. Cross-fitting can be seen as the initial part of cross-validation: the data are split into K folds, and predictions for fold k are done using a model trained on all data but fold k. When K is the same as the sample-size this generates jackknife predictions (whose residuals are approximately unbiased).

You can install it from my Stata-modules GitHub repo. It encapsulates both the fit and prediction step and generates either predictions or residuals (if you specify the outcome).


//net install crossfit, from(https://raw.github.com/bquistorff/Stata-modules/master/c/) replace
sysuse auto
crossfit price_hat_oos, k(5): reg price mpg
crossfit price_resid_oos, k(5) outcome(price): reg price mpg
ereturn list //additionally get MSE, MAE, R2

The crossfold package almost does everything needed except it's non-trivial to get the predictions/residuals as a new variable (especially when there's a global 'if' clause). Maybe one day, we should merge!

Random Forest imlementation in Stata

I recently needed a Random Forest implementation on a slightly older version of Stata and found the choices quite lacking. 

  1. crtrees (deck) is a Stata-native implementation, but I found confusing errors when running this. I tried to fix them, but the code looks like it has been sent through a code obfuscator!
  2. Stata's native interface with Python wasn't available to me since I was using Stata 15.
  3. rforest is a binding to a JAVA implementation. I both couldn't install Java and as most machine learning these days happens in R or Python, Java is very odd language choice today. 

I then stumbled upon rcall, which allows calling R programs. R was on my platform and so I made a simple Stata binding to the fast ranger package on R. You can install it from my Stata-modules GitHub repo. An R program is spun up and all work done with a single command so it encapsulates both the fit and prediction step (either standard or out-of-bag predictions).


//net install ranger, from(https://raw.github.com/bquistorff/Stata-modules/master/r/) replace
sysuse auto
ranger price mpg turn, predict(price_hat_oos)