Tuesday, November 25, 2014

Graphs with log-scales in the positive and negative domains

Often to show data that is highly dispersed one will compress the data by graphing its log. One down-side of this is that it only works if the data is all-positive or all-negative (if you use \(-ln(-x)\)). If your data contains zero and/or points in both domains then the you have to do something else. Here is a simple extension that uses a linear function around zero to smoothly connect a log function and it's opposite. $$x= \begin{cases} \ln(x) & \text{if }x>e\\ x/e & \text{if }-e\leq x\leq e\\ -\ln(-x) & \text{if }x<-e \end{cases}$$ The function is log-linear-log ("trilog").
You can get a simple Stata utility -trilog- from here to make this transformation and create axis labels.
Another intuitive extension would be to shift the log and its opposite closer to zero, such as $$x= \begin{cases} \ln(x+1) & \text{if }x\geq0\\ -\ln(-x+1) & \text{if }x<0 \end{cases}$$ The downside of this is that no longer are equal proportional changes reflected as equal distance changes.

Wednesday, November 19, 2014

Using make with Stata

Having a makefile helps automate a lot of tasks with a project.

  • Generating different formats of logs, tex tables, and gphs (including making versions of the figs that have no titles). And removing orphan files if the primary ones are removed.
  • Normalizing logs, gphs, and dtas prior to committing.
  • Generating PDFs of Lyx files.
  • Updating the project mlib when mata files change.
  • Installing updated versions of packages that have to be installed.
  • Running Stata in batch mode, knowing the dependencies between code files (and setting up a gateway file so that on Windows modules can run shell commands).
  • Deals with SVN commands.
Here is stub version. It uses statab.sh, cli_build_proj_mlib.do, gph2fmt.ado, cli_gph_eps.do,  cli-install-module.docli_smcl_log.do (plus the normalizing ones).

Edit 2015-01-28: I've posted a project template with updated versions of these (and better makefiles as my skill improves) at GitHub

Version control for research projects

While I had used version control on earlier projects, I didn't start using version control for collaborative research projects until reading Code and Data for the Social Sciences: A Practitioner’s Guide by Matthew Gentzkow and Jesse M. Shapiro. If you haven't read it, it's a good read (I agree with the general guidelines and most of the specifics).

The first decision is which files to version. I version the dta, gph, tex, and log files.  I chose not to version easily generated files such as different formats of outputted figures and tables and instead generate them automatically using my Makefile. I normalize the dta, gph, and log files before committing them so that changes are noted only if real content has changed.

Some miscellaneous tools: rm-non-svn.shsvn_batch_rename.sh.

Stata wishlist

Here's what I wish Stata would add:

  1. Primary output files (dtas and gphs) should be able to be reproduced byte-for-byte. Primarily this requires being able to zero-out the timestamps and zero-out any junk padding.
  2. Make PDF and PNG exporting of figures available on console Unix.
  3. Shell commands should work in Windows batch-mode.
  4. All built-in commands should return values through the return classes (e.g. r()) so that they can be used programmatically
  5. Also, allow the Windows do-editor to automatically word wrap (this is a main reason why people I know use other editors).
  6. The program should set a positive return code on error. When running Stata in batch-mode the program always exits with 0 (success) even if the program had an error.
Hopefully, some of these will be available for version 1415.