Monday, August 30, 2021

Cross-fitting in Stata

Cross-fitting is a method to produce unbiased (honest) predictions/residuals from a model. It is commonly used these days when dealing with Machine Learning models as they typically have flexibility/bias built-in. Cross-fitting can be seen as the initial part of cross-validation: the data are split into K folds, and predictions for fold k are done using a model trained on all data but fold k. When K is the same as the sample-size this generates jackknife predictions (whose residuals are approximately unbiased).

You can install it from my Stata-modules GitHub repo. It encapsulates both the fit and prediction step and generates either predictions or residuals (if you specify the outcome).


//net install crossfit, from(https://raw.github.com/bquistorff/Stata-modules/master/c/) replace
sysuse auto
crossfit price_hat_oos, k(5): reg price mpg
crossfit price_resid_oos, k(5) outcome(price): reg price mpg
ereturn list //additionally get MSE, MAE, R2

The crossfold package almost does everything needed except it's non-trivial to get the predictions/residuals as a new variable (especially when there's a global 'if' clause). Maybe one day, we should merge!

No comments: