Tuesday, November 25, 2014

Graphs with log-scales in the positive and negative domains

Often to show data that is highly dispersed one will compress the data by graphing its log. One down-side of this is that it only works if the data is all-positive or all-negative (if you use \(-ln(-x)\)). If your data contains zero and/or points in both domains then the you have to do something else. Here is a simple extension that uses a linear function around zero to smoothly connect a log function and it's opposite. $$x= \begin{cases} \ln(x) & \text{if }x>e\\ x/e & \text{if }-e\leq x\leq e\\ -\ln(-x) & \text{if }x<-e \end{cases}$$ The function is log-linear-log ("trilog").
You can get a simple Stata utility -trilog- from here to make this transformation and create axis labels.
Another intuitive extension would be to shift the log and its opposite closer to zero, such as $$x= \begin{cases} \ln(x+1) & \text{if }x\geq0\\ -\ln(-x+1) & \text{if }x<0 \end{cases}$$ The downside of this is that no longer are equal proportional changes reflected as equal distance changes.

Wednesday, November 19, 2014

Using make with Stata

Having a makefile helps automate a lot of tasks with a project.

  • Generating different formats of logs, tex tables, and gphs (including making versions of the figs that have no titles). And removing orphan files if the primary ones are removed.
  • Normalizing logs, gphs, and dtas prior to committing.
  • Generating PDFs of Lyx files.
  • Updating the project mlib when mata files change.
  • Installing updated versions of packages that have to be installed.
  • Running Stata in batch mode, knowing the dependencies between code files (and setting up a gateway file so that on Windows modules can run shell commands).
  • Deals with SVN commands.
Here is stub version. It uses statab.sh, cli_build_proj_mlib.do, gph2fmt.ado, cli_gph_eps.do,  cli-install-module.docli_smcl_log.do (plus the normalizing ones).

Edit 2015-01-28: I've posted a project template with updated versions of these (and better makefiles as my skill improves) at GitHub

Version control for research projects

While I had used version control on earlier projects, I didn't start using version control for collaborative research projects until reading Code and Data for the Social Sciences: A Practitioner’s Guide by Matthew Gentzkow and Jesse M. Shapiro. If you haven't read it, it's a good read (I agree with the general guidelines and most of the specifics).

The first decision is which files to version. I version the dta, gph, tex, and log files.  I chose not to version easily generated files such as different formats of outputted figures and tables and instead generate them automatically using my Makefile. I normalize the dta, gph, and log files before committing them so that changes are noted only if real content has changed.

Some miscellaneous tools: rm-non-svn.shsvn_batch_rename.sh.


Stata wishlist

Here's what I wish Stata would add:

  1. Primary output files (dtas and gphs) should be able to be reproduced byte-for-byte. Primarily this requires being able to zero-out the timestamps and zero-out any junk padding.
  2. Make PDF and PNG exporting of figures available on console Unix.
  3. Shell commands should work in Windows batch-mode.
  4. All built-in commands should return values through the return classes (e.g. r()) so that they can be used programmatically
  5. Also, allow the Windows do-editor to automatically word wrap (this is a main reason why people I know use other editors).
  6. The program should set a positive return code on error. When running Stata in batch-mode the program always exits with 0 (success) even if the program had an error.
Hopefully, some of these will be available for version 1415.

Monday, October 13, 2014

Module to sort Stata matrix by column

I noticed recently that the module to sort Stata matrices by a column (matsort) incorrectly handles row names with spaces. This error is partly due to the limited Stata functions available for handling matrices prior to version 9 (when Mata was introduced). I've quickly made a bug-free replacement -matrixsort- that you can download from my Github repository.

Sunday, October 12, 2014

Bookmarklets

Here are two bookmarklets that I created recently:

  1. 0){N=T[0];N.outerHTML = N.innerHTML;T=d.getElementsByTagName(t);}}}catch(E){r=0}return r}R(self);var i,x;for(i=0;x=frames[i];++i)R(x)})()">CleanupSingleDoc which I use (with Printliminator) to save a simple version of a web page (remove text links, iframes, etc.) in order to save the content as an ePub (using Calibre).
  2. Connect via ResearchPort which will append the University of Maryland suffix proxy for authentication

Friday, October 10, 2014

Managing sets of files produced by different configurations

With statistical estimation you often run the same programs multiple times with slightly different options. If these program produces file outputs you can have some trouble managing them all. Here are some tools I use.  I have a global suffix ${extra_f_suff}

Tracking sets

The first task is tracking the files produced. In general you want to have files names that include the option. Here, I employ two strategies depending on the work.

  1. For real runs, I usually want to collect the names of all files produced so that they can be checked and then stored/deleted/etc. Therefore I have wrapper functions saving dtas, logs, graphs, tables, and text snippets that append the name of the file they are writting to a separate text file. 
  2. A standard option that I always include is a "testing" switch. This is useful for when I want to just test if a small change causes an error. It does the bare minimum for a program (limits the number of observations, reduced the number repetitions, etc.). It also sets a global extra_file_suffix="_testing" which is appended to all file names at the point of file writing (easier than passing a testing option through several layers of programs).

Manipulating sets

If you can build a list of files (either because they were saved or you do find | grep *.blah) then here are some handy tools for dealing with them.

$ cat file_list | xargs rm # delete
#manage them via svn (similar options for git):
$ cat file_list | svn add --targets -
$ cat file_list | svn remove --keep-local --targets -
$ cat file_list | svn commit -m ""  --targets -

Noting the primary key in Stata files

Most tables in databases have a primary key defined and this can be a help with Stata files too. If you have a primary key defined by a single variable then you can use xtset (or tsset if it's a time variable). If you have a composite key and one of them is a time variable you can use xtset/tsset. Otherwise, you should have a consistent way of listing it. One way is to store it as a dta characteristic, such as:
. char _dta[key] keyvar(s)
See also isid.

Thursday, August 28, 2014

Parallel processing in Stata: Example code

Here are some files from a presentation I did about running parallel operations in Stata (beyond what Stata-MP does). They include a simple presentation covering the basic idea and example code for doing parallel bootstrap estimation. They use the parallel Stata module (I prefer dev version).

Wednesday, August 27, 2014

Easy LyX Beamer products

When making presentations in LyX with the Beamer template, one often wants to make three PDFs every time: slides, handouts, and handouts+notes. Here are some scripts to do this. You will need Cygwin installed if you are on Windows.

Command-line version:
In Windows I've added this functionality to the right-click menu for LyX files. It requires the slightly modified script below. To edit the right-click menu I use the FileMenu Tools utility. In a new entry, set the program as a shell (I've tested git -- "C:\Program Files (x86)\Git\bin\sh.exe" and cygwin -- "C:\cygwin64\bin\mintty.exe"). For arguments put "absolute/path/to/beamer_outputs.sh %FILENAME1%".

Monday, August 25, 2014

Storing Stata project dependencies

Newer versions of an environment can break existing code so it is often helpful to maintain access to the specific versions you use. For the Stata environment this is particularly important. The SSC archive doesn't store previous versions of modules so you should store them in your project folder. To ensure that a project is using only the locally-stored Stata programs, I set the shell environment variable S_ADO to "\"<project_base>/code/ado/\";BASE".

The process is a bit more work if a module has machine-specific files (e.g. compiled plugins) and you want to allow your project to run on different platforms. If you're working across platforms you should have your code stored in some kind of version control repository (e.g. subversion or git). For modules with machine-specific file you can't store the installed files in the repo since they differ constantly between machines. Instead you store the installation files in the repo and then you do a local install on each machine. To store the installation files locally, find the URL of pkg files (e.g. http://fmwww.bc.edu/repec/bocode/s/synth.pkg) and use store-mod-install-files.sh. To install locally, do the following

A more general solution will reinstall the files when the the installation files are updated. For this we use make and generate the makefile. The file dependencies are stored in a the pkg files so you can use gen-makefile.sh (and statab.sh and cli-install-module.do) to scan for such files, extract the dependencies, and generate the makefile.You will then be able to type:
$ make all_modules
to update/install all the modules.

For the paranoid: The Stata program itself is closed-source, so specific versions may be unavailable in the future. Stata states that you can use the version command to enable dated-interpretation in newer versions. If you are not satisfied by that, store installation files yourself (preferably for an open-source system like Linux).

Thursday, July 31, 2014

Posting abandoned research ideas

Researchers often hold onto sub-par research projects for their option value. An idea might not seem worthwhile now, but maybe in the future (it might become feasible with better data or theoretical breakthrough, or might become more interesting if combined with another insight). But periodically, this option value drops. After a professor gets tenure they're probably much less likely to care about these long shots. Sometimes these ideas are lobbed at students, but they are often not picked up. I think that at these points where the option values drop, professors should post some of their least-likely-to-be-worked-on ideas somewhere. It could be their own blog, or a maybe an aggregator that specializes in abandoned research ideas.

Incentives and supply curves

I recently had thoughts about when exchanges between individuals are not exploitative. (I'm going to leave aside the issue of coercion though that is obviously related). Examples of "problematic" exchanges would be price hikes after events (like a storm) and blackmail. I think these are roughly in order of decreasing popularity. I'm going to steer clear of "repugnant markets" like organ sale.

The simplest thing to think of what happens in an exchange is that there is some surplus (as defined by their disagreement point or BATNA) from the transaction that should be shared evenly between the parties. This is the result from Nash bargaining. This setup, though, discounts the effects on possible future transactions. This is why many economists favor few price restrictions after calamities. If you don't allow people to profit, there won't be any supply response and the next time it happens (or later in the current after-math) the supply won't be improved. So in each case we are weighing the desire for equality with the desire to improve the world in the future. For price gouging we are then weighing a negative (short-term inequality) and a positive (improved future supply). Reasonable people could give different weights to these concerns and come to different conclusions (some events may be so particular that we don't think suppliers profiting now would cause any future benefit). Blackmailing is interesting because it is appears to be a negative in the short-term and long-term. The long-term is negative because if blackmailing was morally permissible it would incentivize people to violate the privacy of others.

Readings:
  1. Strings Attached: Untangling the Ethics of Incentives by Ruth W. Grant
  2. Questions for Free-Market Moralists by Amia Srinivasan (nytimes.com)
  3. Organ Sale - Stanford Encyclopedia of Philosophy

Thursday, June 19, 2014

Determining demand for unfinished products

Once a product has moved from the idea stage to the prototype stage, Kickstarter and similar sites are great ways to determine demand. But what about products that are still at the idea stage (but which are obviously feasible)?

The domain I was thinking about recently was audiobooks. I bet there are lots of potential long-tail audiobooks (especially older books) that aren't getting made because of uncertainty about demand. The existing systems to determine demand using the internet (that I'm aware of) likely don't allow many audiobooks to be published.
  • When an author tries to determine demand they will sometimes use a crowdfunding site (e.g. indiegogo which is very permissible in the type of project funded) to gauge demand. But this is very rare and I think authors of older books are very unlikely to do this. 
  • Amazon.com allows people to note their interest in an audiobook version on the book page. But amazon doesn't know how serious you are since it's practically costless to click that link. They also only use the data for make audiobooks through Audible.com which has a high production cost (potentially voice actors on ACX might do it for less).
What would be great was a site where people could monetarily show their interest (similar to indiegogo) on any potential audiobook and the site would track progress toward the goal. It would be more like bounty programs since the end-goal isn't specified and the producer hasn't been determined yet. Then entreprenuers could contact authors about audio rights (or authors could do it themselves). The site could even have free copyright as the end-goal like unglue.it for audiobooks. If that were the case, voice actors from LibriVox might record the audio for free and the payment would just be to secure the audio rights.

Monday, March 17, 2014

Stata's serset file format

Sersets are minimal versions of datasets in Stata. They are commonly embedded in .gph graphics files to store the data so that graphs can be reproduced by Stata. Sersets are sometimes also written directly to files. The file format is undocumented by Stata but very similar to Stata dta formats like v115 and previous ones. If you need to access the data in a serset file or .gph then here is the basic format. The below assumes that you are generally familiar with the dta v115 standard. Let nvar be the number of variables and nobs the number of observations. The length of each block is in brackets.

  1. [16] "sersetreadwrite" (null terminated).
  2. [1] either "0x2" for versions 11-13 (gph format 3) or "0x3" for version 14 (gph format 4). 
  3. [1] I think this is a byte-order field, so 0x2 for lohi (standard Windows/Linux machines), otherwise 0x1.
  4. [4] Number of variables
  5. [4] Number of observations
  6. [nvar] The typlist. A byte characterizing the type of each variable.
  7. [nvar x 54] The varlist. Null-terminated strings of the variable names. For versions >=14 this is 150.
  8. [nvar x 49] The fmtlist. Null-terminated strings of the variable formats. For versions >=14 this is 57.
  9. [nvar x 8] The maximums. A double for each variable representing that variable's non-missing maximum. If the variable is a string then the double's missing value is used.
  10. [nvar x 8] The minimums. Same as above except for the minimums.
  11. [nobs x sum(sizeof each variable type)] Data: All data from an observation are contiguous.
Edit 2016-01-31: Figured out more about the bytes immediately after "sersetreadwrite" and about how version 14 is different.

Thursday, February 27, 2014

Bigfoot module for LyX

For the lazy, here is a module for support for the bigfoot package for LyX (outlined here). Find your User directory from Help > About LyX and then put the file in "<User directory>/layouts/". Then do Tools > Reconfigure.

Friday, January 24, 2014

PDFs that allow toggling between images

Authors sometimes want to their documents to be able to display different images in different settings. For instance, color charts often become indecipherable when printed in gray-scale so a separate version with patterns might be preferable. One simple way to achieve this in PDFs is with Optional Content Groups. See this PDF for a working example. It has a link that a user can click and it will toggle images in the PDF. The file was produced using LyX with the tikz and ocg-p LaTeX packages. This zip contains source materials for those who would like to create your own. OCGs don't use JavaScript so this method should be fairly portable.