Friday, April 22, 2022

Alleviating bias from testing pre-trends in event study designs

 Jonathan Roth has a recent paper Pre-test with Caution: Event-study Estimates After Testing for Parallel Trends, dealing with event study designs that rely on the "parallel trends" assumption (i.e., in the absence of treatment, the treated units would have moved parallel to the control units). He makes the case that (1) many tests for parallel trends in the pre-treatment period in the literature have low power, and (2) there is often correlations between the coefficients in the pre-period (where you'd test for parallel trends) and the post-period (where you do treatment effect estimation). Selecting to move forward with analysis only when you pass tests in the former can then cause bias in the latter. This is a subtle case of the common problem of data-dependent decision making that affects lots of modelling tasks.

In a follow-up paper, Roth and Ashesh Rambachan advocate that the analyst should look at the range of possible deviations from parallel pre-treatment trends to form bounds on the future treatment effect estimates. This is a very sensible suggestion, though, it can be a hard case to make if people think bounds will be too wide and few other people use them.

From my own work on machine learning, my natural reaction to data-dependent analysis is to think about sample splitting. The standard idea of splitting treated and control units into "test assumptions" and "effect estimation" subsamples would work, but would make worse the problem of power. Another way is possible when you see how separate time-coefficients are correlated. In a model with i.i.d data, this is because each time-coefficient is an offset from a base period, so that the data from the base period is part of all temporal coefficients. Correlations can be alleviated if we then separate the samples in time. One would use all but one period of the pre-treatment period to test for pre-trend differences (resulting in a small difference is power) and use the other periods for effect estimation. Now the two sets of estimations use different base periods and so are no-longer correlated. There is still a problem if the there is serial correlation in the outcome after accounting for controls (e.g., serial correlation in the error), though this will be smaller than in the naive analysis because you are increasing the gap between the two samples.

The longer version of the pre-trends paper actually has fancier modifications (Appendix B)  that tries to reduce the bias even without sample splitting. Maybe this will have some take-up.

In the end, Roth's papers forces one to remember that, fundamentally, parallel trends is an untestable assumption. There happens to often be some related data that might be helpful, but we should still think careful about it like we would other assumptions (e.g., exclusion restrictions).

 

Monday, August 30, 2021

Cross-fitting in Stata

Cross-fitting is a method to produce unbiased (honest) predictions/residuals from a model. It is commonly used these days when dealing with Machine Learning models as they typically have flexibility/bias built-in. Cross-fitting can be seen as the initial part of cross-validation: the data are split into K folds, and predictions for fold k are done using a model trained on all data but fold k. When K is the same as the sample-size this generates jackknife predictions (whose residuals are approximately unbiased).

You can install it from my Stata-modules GitHub repo. It encapsulates both the fit and prediction step and generates either predictions or residuals (if you specify the outcome).


//net install crossfit, from(https://raw.github.com/bquistorff/Stata-modules/master/c/) replace
sysuse auto
crossfit price_hat_oos, k(5): reg price mpg
crossfit price_resid_oos, k(5) outcome(price): reg price mpg
ereturn list //additionally get MSE, MAE, R2

The crossfold package almost does everything needed except it's non-trivial to get the predictions/residuals as a new variable (especially when there's a global 'if' clause). Maybe one day, we should merge!

Random Forest imlementation in Stata

I recently needed a Random Forest implementation on a slightly older version of Stata and found the choices quite lacking. 

  1. crtrees (deck) is a Stata-native implementation, but I found confusing errors when running this. I tried to fix them, but the code looks like it has been sent through a code obfuscator!
  2. Stata's native interface with Python wasn't available to me since I was using Stata 15.
  3. rforest is a binding to a JAVA implementation. I both couldn't install Java and as most machine learning these days happens in R or Python, Java is very odd language choice today. 

I then stumbled upon rcall, which allows calling R programs. R was on my platform and so I made a simple Stata binding to the fast ranger package on R. You can install it from my Stata-modules GitHub repo. An R program is spun up and all work done with a single command so it encapsulates both the fit and prediction step (either standard or out-of-bag predictions).


//net install ranger, from(https://raw.github.com/bquistorff/Stata-modules/master/r/) replace
sysuse auto
ranger price mpg turn, predict(price_hat_oos)



Thursday, July 29, 2021

Getting Stata automation working without administrative priveleges

"You have no power here!"

If you're using Stata on Windows, Stata automation is really powerful. It can allow you to efficiently use Jupyter notebooks using stata_kernel (or IPyStata's automation interface). The "installation", however, typically requires admin right. If you have a non-admin user, here's how to do it.

Installation consists of registering the Stata type library in the Windows Registry. The default process adds entries in the HKEY_CLASSES_ROOT root, which is not user-editable. With a bit of tweaking, however, these entries can work if installed in HKEY_CURRENT_USER, which is user-editable. Here is my modified reg entry used for Stata 15. It worked for both SE and MP flavors. Possibly it works for other versions. You can copy it to the target system,  edit the Stata path if different, rename it from .txt to .reg, and imported using regedit.exe.[1]

Here are the steps to I took in case one needs to replicate the procedure (e.g., if the above does not work for you). I did the registration on a system where I did have admin rights and logged what was added to the registry using NirSofts RegFromApp:

RegFromApp.exe /AutoSave "modified.reg" "original.reg" /RunProcess "C:\Program Files (x86)\Stata15\StataMP-64.exe" /ProcessParams "/Register"

Then I following these instructions to modify modified.reg.[2]  Hope that helps

[1] If you don't have access to Regedit then you can try C:\Windows\System32\reg.exe import modified.reg.

[2] Note, that when I registered, unregistered, and re-registered again, the second time it was missing some keys: in HKCU\SOFTWARE\Classes\: AppID\stata.EXE, AppID\<GUID from previous>, stata.StataOLEApp.1, stata.StataOLEApp.1\CLSID, stata.StataOLEApp, stata.StataOLEApp\CLSID, stata.StataOLEApp\CurVer.

Saturday, July 07, 2018

A safer way to allow gambling

With the recent decision by the Supreme Court to allow states to legalize sports betting, states should think through possibly ways to legalize gambling. We should thinking through how to do it while reducing the negative consequences, such as, gambling addiction. I think one of the keys is separating as much as possible gambling from temptation. We should worry less about someone who has planned a gambling trip with a known monetary limit than we should about someone who is caught up in the excitement of gambling, looses too much, and tries irrationally to win it back by gambling more. I propose to do this by a mechanism that enforces planning and limits. Gambling regulation would create special-purpose gambling accounts where people could deposit money. Casinos and other gambling establishments would then not accept cash, but instead, could only accept money from these accounts. The key is that casinos could only take money that had been in the account for a certain minimum amount of time (e.g. five days). A mechanism would be needed to prevent people from giving each other chips in the casino in exchange for cash, but this is likely possible with cash cards, authentication, and oversight. There has been some attempts along these lines: several countries have instituted something similar for electronic gambling. More research would obviously be best.

Obviously, casinos prefer unrestricted gambling but they may support this type of regulation if it is the only way to operate in a state. Likely the biggest road-block would be entrenched interests, such as state lotteries and localities (e.g. Native American reservations) that have already legalized gambling. Overall, if we can drastically reduce the negative consequences of gambling in the legal/main market then we should be trying to move people away from worse areas, such as the black market, given that it appears many people don't always operate rationally.

Thursday, February 08, 2018

FinTech: Service to optimize credit card payments

Many people carry credit balances across multiple cards. The best general strategy to allocate a given amount towards monthly payments is to make the minimum payments on each card and then allocate the remainder to the card with the highest interest rates. Multiple studies have shown, however, that people do not do this, but instead follow simple heuristic such as dividing payments in proportion to each card's balance. This suggests that a FinTech startup could offer a service of managing payments in an optimal way. It could incorporate more complex concerns like not maxing out certain cards, being smart about card-specific rewards/changes, and managing a family portfolio. Also, consumers have high levels of anxiety about debt so a comforting intermediary might be preferred.

Sunday, November 12, 2017

Blinded voters guides

A common practice when hiring is to "blind" the resumes of candidates at initial evaluation stages so that attributes such as race/ethnicity, gender, and age do not unduly influence the initial process. I would like a similar "screen" for voters guides for local elections where I may be unfamiliar with the candidates. The guide would present candidates as unnamed #1, #2, ..., and then give the pros/cons (e.g., policy stances, competence, etc.) for each. The viewer could read the information and then later choose to reveal the names (and associated demographics) of the candidates.

Monday, September 12, 2016

Guidelines for making a Stata package

Here are some guidelines for making a well behaved Stata package (ado):
  1. Provide all relevant output programmatically, not just textually. Another package may want to work with yours.
  2. Provide a way to check the version of your package.
  3. Use version.
  4. The package file should have a line like 'd Distribution-Date: 01jan2000' so that the package can be updated from adoupdate.
  5. Make clear in the help file what side effects your program may have (globals, characteristics, mata objects, scalars, incrementing of the RNG state by using randomization functions, files left around). Cleanup after yourself if you can, especially after errors (use preserve, tempfiles/tempvars/tempnames, and capture blocks with cleanup sections).
  6. List your package dependencies. You can do this in the package file, but you should also mention this in the help file (and say if you autoinstall dependencies). 
  7. If you capture the output from a long running process, make sure to allow for exit if the break key was pressed (check _rc==1).
  8. Use _assert. It makes error checking code much nicer to read.
  9. Provide tests (such as static source code checks and unit checks) with good coverage. Relatedly, your help material should include useful examples.
  10. Put a brief description/authors in *! initial lines above the command in the ado file (used by which). Don't make this a full changelog, put that in a changelog file.
  11. If you use compiled plugins or mata mlibs, provide the source code. This can help users determine solutions to errors.
  12. If the package is estimating statistical routines then include references for the procedure and make very explicit details of your algorithm.
  13. Make sure your program works with when either the current directory or the tempdir contain spaces.
Extra-special niceties:
  1. Have an option so that relevant output (such as displayed text or saved files) is not dependent on machine-specific characteristics (e.g. directory, # processors, speed/time). This allows for log files to be compared across runs to check for real differences using text comparison methods.
  2. Provide a way to access previous versions of package and a place to post errors. You can easily do this by host your project on a development website like GitHub.
  3. Provide a way to cite the programs.
  4. Provide a centralized place for listing issues/bug reports/enhacement requests/etc (e.g. at GitHub).
  5. If you provide a web-based development site (e.g. at GitHub) provide a version of your help in HTML (see the log2html Stata package).