Monday, September 12, 2016

Guidelines for making a Stata package

Here are some guidelines for making a well behaved Stata package (ado):
  1. Provide all relevant output programmatically, not just textually. Another package may want to work with yours.
  2. Provide a way to check the version of your package.
  3. Use version.
  4. The package file should have a line like 'd Distribution-Date: 01jan2000' so that the package can be updated from adoupdate.
  5. Make clear in the help file what side effects your program may have (globals, characteristics, mata objects, scalars, incrementing of the RNG state by using randomization functions, files left around). Cleanup after yourself if you can, especially after errors (use preserve, tempfiles/tempvars/tempnames, and capture blocks with cleanup sections).
  6. List your package dependencies. You can do this in the package file, but you should also mention this in the help file (and say if you autoinstall dependencies). 
  7. If you capture the output from a long running process, make sure to allow for exit if the break key was pressed (check _rc==1).
  8. Use _assert. It makes error checking code much nicer to read.
  9. Provide tests (such as static source code checks and unit checks) with good coverage. Relatedly, your help material should include useful examples.
  10. Put a brief description/authors in *! initial lines above the command in the ado file (used by which). Don't make this a full changelog, put that in a changelog file.
  11. If you use compiled plugins or mata mlibs, provide the source code. This can help users determine solutions to errors.
  12. If the package is estimating statistical routines then include references for the procedure and make very explicit details of your algorithm.
  13. Make sure your program works with when either the current directory or the tempdir contain spaces.
Extra-special niceties:
  1. Have an option so that relevant output (such as displayed text or saved files) is not dependent on machine-specific characteristics (e.g. directory, # processors, speed/time). This allows for log files to be compared across runs to check for real differences using text comparison methods.
  2. Provide a way to access previous versions of package and a place to post errors. You can easily do this by host your project on a development website like GitHub.
  3. Provide a way to cite the programs.
  4. Provide a centralized place for listing issues/bug reports/enhacement requests/etc (e.g. at GitHub).
  5. If you provide a web-based development site (e.g. at GitHub) provide a version of your help in HTML (see the log2html Stata package).

Monday, May 02, 2016

LyX template for UMD dissertation format

The following is a template for formatting dissertation in LyX for PhD dissertations for the University of Maryland, College Park. The umdthesis.cls file is a slightly modified version of thesis.cls from here (and renamed so it doesn't conflict with existing thesis files). Instructions for using the LyX file are in a comment at the top of that file.

Files:

Saturday, April 23, 2016

"Auto-edit failed" errors in LyX

A nice feature of LyX is that you can edit \included (or \inputed) files from the InsertFileChild Document dialog window. I was recently having a problem, however, where I could edit file types that had notepad as the default handler but not those that had Notepad++ as the default handler. I would get dialog messages saying "Error: Cannot edit file \n Auto-edit file file_path.tex failed". If this happened to you, here's the reason and fix.

Unless the child document is a .lyx file, LyX will check to see which handler is associated with text files in Tools→ Preferences→ File Handling→ File Formats, "Plain Text". If this is set to "Custom:auto", then LyX will call Win32's ShellExecute() with the action/verb as edit. The problem my setup is that if you merely set Notepad++ as the default handler of a certain file type by right clicking on the file type and selecting Open withChoose Another App→ Use this App to open all XXX files then the application isn't registered as the handler with the edit verb. Possible solutions
  1. Set Notepad++ to be editor for all included files. Do this by setting the edit command for Plain Text files as "C:\Program Files (x86)\Notepad++\notepad++.exe". This is what I did
  2. LyX could be patched so that if the first call to ShellExecute()  in  osWin32.cpp:autoOpenFile() with edit failed with error code SE_ERR_NOASSOC (=31) to try again with a verb of NULL.
  3. The user can register an edit verb with the default file handler in the registry. See here and here.
General reference for this type of problem.

Thursday, February 18, 2016

More efficient news reading

Nassim Taleb famously doesn't read the news. One of his reasons is that much of the news is inconsequential. This is both because news outlets need to publish even on slows days (the BBC has on only one day stated "[t]here is no news") and because it is difficult to tell at the time what will be important, especially from just headlines. Existing metrics to help determine if something of note actually happened include article view/share counts (e.g. RSS feeds of the "top 10" articles) and links (Google News), but these are insufficient. A better metric would be if a financial market (such as a prediction market) was available. Imagine a news service that tracked markets you cared about and if there was a large and non-temporary price change it would you send you the top stories from around that time period. One could imagine that the user could use a set of news search terms and the link to a market to setup and alert themselves. This could also be a platform for deciding about new prediction markets to set up. The target end-user would be general news readers, not traders as there'd obviously have a bit of a lag.

Sunday, January 31, 2016

Reproducible PDFs from LaTeX

I like using version control, such as git, to store project files. When working with others it is often convenient to commit generated files, such as PDFs from LyX files. PDFs, however, may be different for inconsequential reasons, making them more inconvenient with version control systems.

I have patched the pdftex executables from MiKTeX so that if you are on Windows (x64), the PDFs can be identical. It removes the creation date, modification date, and trailer ID so that the PDF file is constant across runs. It also removes the filepath for included graphics so that if the file is made somewhere else (e.g. an a collaborator's machine) then they will still be identical (assuming they are built using the same LaTeX setup; would be nice to improve this). No changes are needed for you lyx/tex files.

To install:

  1. Backup your current pdftex executables (in C:\Program Files (x86)\MiKTeX 2.9\miktex\bin): MiKTeX209-pdftex.dllmiktex-pdftex.exe, and pdftex.exe.
  2.  Download this zip file and unpack the new versions of those three files into the above directory.
  3. Be careful of updating the MiKTeX executables (though packages are fine).
I patched the version of pdftex that came with MiKTeX 2.9 (2015-08-24). It was primarily cobbled together from existing changes to the pdftex source code (svn access). See the diffs when changing revisions 222->223, 723->724, and 727->728. Hopefully, these changes will eventually be mainlined into the LaTeX distributions.

Sunday, September 13, 2015

Compiling Stata plugins

Here are some notes on how to compile plugins for Stata. The official plugin page is a bit out of date.

Linux: Add option -fPIC. If you want to build a 32-bit plugin on a 64-bit system then you will need the 32-bit version of libc (e.g. on Debian install g++-multilib and libc6-dev-i386).

Windows: You will need the mingw-w64 compiler (unless you just want 32-bit in which case you can use the original MinGW).  You can get this in several ways, but since I already had Cygwin I just installed the mingw64-x86_64-* packages (for 32 bit you can use mingw64-i686-*). You will then use the x86_64-w64-mingw32-g++ command rather than g++ -mno-cygwin. Add the -static flag in the link step otherwise Stata might give a "could not load plugin" due to unfound DLLs (e.g. I used C++ classes and I found using Procmon that after reading my plugin file Stata couldn't find libgcc_s_seh-1.dll).

Mac: Install the Xcode command-line tools. I added the  -shared option and removed the -bundle option (-bundle can't be used with dylibs). For modern Xcode (since version 5.0) then you have to use clang/clang++.

Wednesday, September 02, 2015

assign_treatment: A Tool for Stratified Randomization

This blog post by David McKenzie and Miriam Bruhn talks about assigning treatments at random when stratification variables cause cells (unique combinations of stratification variable values) to have uneven number of methods. The provided Stata do-file is hard-coded for a six treatments and is a bit difficult to adapt for another number. I've made a module that does the assignment for varying numbers of treatments as well of providing different methods of assigning the remainder units ("misfits") from each cell.

The three main methods provided all assign the misfits at random and achieve cell-level balance (counts differ at most by 1):

  • full - This is the same method as McKenzie and Bruhn. The misfits from each cell are separately randomized to any combination of treatments. This can cause the overall number of units per treatment to be unbalanced even though they will be balanced at the cell-level level.
  • reduction - This method achieves overall balance as well as better balance for specified stratification variables. It does this by limiting misfits to be assigned to a "wrapped" interval of treatments (e.g. (2,3,4) or (6,1,2) if there are six treatments) and then having those intervals dovetail together.
  • full-obalance - This method allows misfits to be assigned to any combination of treatments and achieves overall balance. It does so by assigning units one at a time to fill repeating slots of (1,...,T,1,...T,...,1..). At each stage it keeps track of possible units that could fill a spot (without causing two from the same cell to have the same treatment). It randomly picks one and then attempts to fill the next spot (while giving a slight weight to trying to fit first misfits from cells with many misfits). If filling a spot is impossible the algorithm backs up to the last point where there was a choice and tries a new option.

To install the module, run
. net install assign_treatment, from(https://raw.github.com/bquistorff/Stata-modules/master/a/) replace

Tests are provided in here.

Tuesday, August 18, 2015

Continuous Integration with Stata

Continuous Integration is a workflow that frequently merges code changes and automatically runs tests. This can be especially helpful when working in teams. Several free services exist, such as Travis-CI which provides Linux build infrastructure and integrates with GitHub. Basically, you instruct Travis how to setup a Linux environment for your project and then direct it what tests to run. Travis will then automate this, checking your repository at every push.

First off, you'll obviously need a Stata license that allows for this kind of work. The single-user license allows for 3 "seats", or you may have a network license with extra capacity. In Travis, you should go to the Settings for your repo and "Limit concurrent builds" to 1 (or whatever is permitted by you license).

Travis works each time off of clean Linux machine images so your setup will have to allow Travis to install Stata automatically. Travis allows you to store private passwords on the test infrastructure, so you could either host the Stata files on a password protected machine or you could encrypt the files with a password (e.g. with gpg --symmetric). Either way, Travis will download files repeatedly so you may want to make them as small as possible. I've found the best way is to take an existing install and then strip out anything unnecessary, including documentation (all *.pdf, *.sthlp, *.ihlp, *.key, and *.dta which are almost all example data), graphical tools (*.mnu, *.png, *.dlg, *.idlg), and alternate executables. Compress this folder with xz. Then upload the files to a web server that allows for command-line download (if you are using a general hosting site, you could check out plowshare) making sure to take care of security. 

Now you can setup your .travis.yml to setup the machine.

  1. Extract the password needed for you Stata files (you can do this by installing the Travis client on your local machine, and then use it to encrypt as an environment variable).
  2. Download and set-up the Stata files.
  3. Add the Stata folder to the PATH (it needs to be able to find the license file).
Finally, add commands to test your code:
  • Obviously, you already have lots of code to test your programs, right? :)
  • In order for Travis to know if there was a problem, the executable that runs the tests should return a non-zero error code upon failure. The Stata executable will not do this (!), but you can use a simple wrapper like statab.sh. If you download this, remember you will have to make it executable.
  • The tests can be run either from a master do file, or be called independently from a script or makefile.
The more I use Stata the more it valuable to me it becomes!

Monday, June 22, 2015

Indepedent backups of researcher data

Open data is important for healthy scientific research. A recent journal article by Vines et al. studies how available data is from authors. Sadly, they find that
[R]esearch data cannot be reliably preserved by individual researchers, and further demonstrates the urgent need for policies mandating data sharing via public archives.
Most policy mandates are for newly published research, but what about data from existing publications? Is there a way that independent parties could try to preserve what material is available?

One idea is to have an established group systematically look for data that is only available on a researcher's personal/institutional website and to post in a repository (either at the journal or separately at a place like dataverse). The group should have some reputation so there is trust that the data was not manipulated.

Secondly, the group could try to acquire data even when the data is not available for download. They could ask authors. This would take much more time and have a low response rate, but still might be worth the investment. The group could also accept material from those who had already requested and received replication materials from the authors.  This would require little effort on the part of the group.

Wednesday, June 17, 2015

Stand-alone Pre-analysis Plans for observational studies

The push for research transparency in social science has included advocating for pre-analysis plans (PAPs) where researchers detail what they will do before seeing the data. These plans provide many benefits, the most obvious ones are (a) it rules out specification searching so that p-values are more believable, and (b) since it registers research before the findings it helps in understanding the body of work done that is not published. PAPs are increasingly used for verifiably prospective research (e.g. RCTs, lab experiments, and using non-open data that requires application). Their use for observational studies on existing research are, however, controversial (see Dal-Re et al. 2014). The primary challenge is that outsiders can not verify that the analysis plan was actually written before the research was undertaken.

While this challenge is likely not fully solvable, it can potentially be mitigated by restructuring incentives. The research designer could be separated from the research implementer and each earn professional credit for their independent contribution. Notably, the research designer would earn credit for publishing their PAP in a well-ranked, PAP-accepting journal independent of whether the results of the implementation are significant. This does not entirely remove the element of trust, so it is likely that published PAPs will be from those that the community trusts, such as a senior researcher. It frees, however, the research implement of this burned of trust. Indeed, once the PAP is published, many people could work to implement it independently, in a fashion similar to replication. Registering intent to implement could help coordinate efforts.

Disaggregating the pieces of research could increase the efficiency of the process in a number of ways. Researchers with ideas but without implementation skills (programming skills or access to data) can publish stand-alone PAPs without the potentially costly process of matching with others with the necessary skills. Even for those that could implement the research themselves (or with close colleagues) there are likely benefits to specializing. Researchers often have ideas that they think should be pursued but which is not a priority for them. Often these ideas are captured by colleagues or advisees, but quickly writing up a PAP could be another outlet that would confer visible benefit (see related post). On the other side, implementing PAPs could be a good way to train graduate students. Many graduate students state that the hardest part of research is coming up with a good idea. These PAPs would be especially helpful in areas where there is a lot of researcher subjectivity like structural models.

In order to suitably track credit, there could be a norm to cite the PAP whenever one would cite the implementation paper.

Taking this to the extreme, the scientific process could be even more unlinked. For example, a scientist currently has an incentive to "sell" their research (or tell a convince "story") in submissions to journals, possibly at the expense of being honest about the complexity of the findings. What if the science writer was somehow separated and judged on how well they communicated existing research findings? This could be judged if research data and methods were open.

Thursday, June 11, 2015

Getting versions of Stata modules

For reproducible research it is important to be able to reproduce your analysis environment. For Stata analyses, one needs to list the version of Stata modules used. This is especially important as the ssc archive does not store previous versions of modules. Ideally, one is storing the module files with a project, but it is still useful to have a measure of the version of the modules. Stata modules are not required to list a version number (NB: this is separate from which -version- of Stata the module is requesting). Many modules do, and if so one can often find them in the text of the ado files (use -viewsource command.ado- for this). If not, one can use the version date. Modules on SSC are required to list their "distribution date" . If you are installing modules from a version control server (e.g. Github) then the install date is sufficient. Stata keeps this information in a "stata.trk" folder at the base of each module directory tree (e.g. PLUS) which depends on your adopath. To get this information you can use the following simple script.

$ cat stata.trk | grep "^\([SND]\|d Distribution\)"

Wednesday, April 22, 2015

Code Management in Dropbox

Dropbox is not the ideal version control system, but some people do not want to make the leap to a full VCS (like git or svn which full software developers use). Given that one should not make the perfect the enemy of the good, what simple tips can Stata-base economists follow that will make their lives easier? Here are some tips that will be helpful on projects that last a while (longer than short-term memory or the involvement of an assistant) or involve multiple people. As always, one should have the over-arching goal that the project be reproducible (at almost all points, you can follow an explicit recipe and produce all required outputs). While my tips will help with this, you may require some project-specific solutions, but it is well worth the effort.

  1. Backup: Structure your folders so that a minimal number can be used to reproduce everything in the project. A common approach is to have a code folder (which changes often) and a data/raw folder (which should almost never change). (This means all "manual" actions like cleaning should be embedded in code!) With this approach one can store a one-time backup of the raw data and then store period snapshots of the code folder. At key times (e.g. after every new draft or presentation) zip up the code folder and archive it untouched. Since all code is periodically backed up, you can get ride of no longer used code.
  2. Avoid editing conflicts: If multiple people are editing the same files here is a tip to reduce friction. If you start editing a file, make an innocuous edit and save it. Then, before editing a file, check the "last modified" time. If someone else edited the file recently, chat to negotiate access (your team should decide what "recently" means).
  3. Branching: If you want to a make a series of changes that will leave a key file in an unusable state then you should make a "branch". You can do this (a) in Dropbox by duplicating the files (including unstable outputs) with modified names (have a convention here), or (b) copying the project out of Dropbox for editing. Try to make changes small enough that they can be merged back into the main file quickly. The longer files diverge the harder this will be.
  4. Syncing: For intermediate files only needed internally by one piece of code, you can reduce syncing by creating those files in your `c(tmpdir)' (with -tempfile- or just named explicitly) rather than in your project folder.

Here are other simple general tips that can be applied to Dropbox:

    1. No one's full path should be written into the code. Code should be run given Stata is in the appropriate working directory (often the root of the project folder or in the code subfolder). A nice benefit is that if the project folder is copied out of Dropbox it should run easily. If you have machine-specific configuration details, those should be defined on the machine in environment variables and brought into Stata using -local env_var : environment ENV_VAR-
    2. Have a local code/ado folder where you store all the needed modules. In the header of your code you should include a script that sets code/ado as PERSONAL and then restricts your adopath to just PERSONAL;BASE. You may want to -net set ado PERSONAL- and -mata: mata mlib index- also. 
    3. Learn how to use a visual text diff/merge utility like the one that comes with TortoiseSVN. This will make it easy to compare files between editors and across time.
    4. If possible, do not use spaces in file names (use "_")
    What other tips do you all have?

    Friday, January 30, 2015

    Using Yik Yak to quantify aspects of student life?

    Just listened to a nice piece on Reply All about Yik Yak, which is app where users can post messages anonymously and see the messages of those nearby (10 miles). The podcast mentioned how it was useful in showing written evidence of racism that previously was hard to prove. It got me thinking that if there was a way to get the feeds from different universities then one could analyze these using Natural Language Processing routines to get metrics for different schools. What percentage of anonymous posts at your school are about race, violence, sex, etc.? Could be something prospective students might want to know.

    Wednesday, January 28, 2015

    Prediction markets for human capital investment

    While there has been a growing literature on prediction markets, most of the markets have been confined to politics and macro-economics. These are great to have, but I would like to see markets that could help with more decisions that typical people make, such as when, and in what field to get training. The information that people have is usually about the current economy (this industry currently has a tight labor market) or employment projections by the government. What could be better would be a set of markets about future industry-specific indicators such as vacancy rates. If they were started at least 2-3 years before maturity then it could be used by people in college to help pick a major or by workers deciding whether to get training in a new field. The market might have to be "seeded" by the government, but it could be worth it.

    Tuesday, January 20, 2015

    Getting latexmk working within LyX

    If you would like to add latexmk as an export option in Lyx, here are the two basic steps to do in Tools -> Options.
    1) In File Handling -> File Format, hit "New" and then put in these options, then "Apply".
    2) Then in File Handling -> Converter, click on any existing converter, change the options to the below, and then click "Add", then "Save".

    Now you should be able to File->Export->PDF (Latexmk). Hope that works for you.

    If you are on Windows, then you might have problems with Perl versions (Lyx has one, MiKTeK another). Initially, I was getting that perl couldn't find File::Glob in @INC when latexmk was run. Turns out that Lyx was running it's own version of perl 5.18.2 and had @INC set to just the Lyx/lib (which oddly did have File/Glob.pm). I installed Strawberry Perl same version (needed that!) then set the PERL5LIB to the Strawberry perl folder.  Then it works fine.

    This was on LyX 2.1.2, updated MiKTeX, and Windows 8.

    Running Stata 12 on Ubuntu 14.04 (Trusty Tahr)

    After upgrading my linux machine I realized that I could no longer run my copy of Stata 12 GUI. It game the error:
    ./xstata-se: error while loading shared libraries: libgtksourceview-1.0.so.0: cannot open shared object file: No such file or directory
    These older versions of libraries aren't in the normal Ubuntu repositories anymore, so this is a simple work around to get Stata working again. (I'm on a 64 bit machine so change amd64 to the appropriate)
    1. Install libgtksourceview2-2.0
    2. sudo ln -s /usr/lib/libgtksourceview-2.0.so.0 /usr/lib/libgtksourceview-1.0.so.0
    3. Download libgnomecups1.0-1 0.2.3-4 from here. Install it with dpkg.
    4. Download libgnomeprint2.2-data_2.18.8-3ubuntu1_all.deb and libgnomeprint2.2-0_2.18.8-3ubuntu1_amd64.deb from here. Install them at the same time with dpkg.
    Happy estimating.

    Tuesday, November 25, 2014

    Graphs with log-scales in the positive and negative domains

    Often to show data that is highly dispersed one will compress the data by graphing its log. One down-side of this is that it only works if the data is all-positive or all-negative (if you use \(-ln(-x)\)). If your data contains zero and/or points in both domains then the you have to do something else. Here is a simple extension that uses a linear function around zero to smoothly connect a log function and it's opposite. $$x= \begin{cases} \ln(x) & \text{if }x>e\\ x/e & \text{if }-e\leq x\leq e\\ -\ln(-x) & \text{if }x<-e \end{cases}$$ The function is log-linear-log ("trilog").
    You can get a simple Stata utility -trilog- from here to make this transformation and create axis labels.
    Another intuitive extension would be to shift the log and its opposite closer to zero, such as $$x= \begin{cases} \ln(x+1) & \text{if }x\geq0\\ -\ln(-x+1) & \text{if }x<0 \end{cases}$$ The downside of this is that no longer are equal proportional changes reflected as equal distance changes.

    Wednesday, November 19, 2014

    Using make with Stata

    Having a makefile helps automate a lot of tasks with a project.

    • Generating different formats of logs, tex tables, and gphs (including making versions of the figs that have no titles). And removing orphan files if the primary ones are removed.
    • Normalizing logs, gphs, and dtas prior to committing.
    • Generating PDFs of Lyx files.
    • Updating the project mlib when mata files change.
    • Installing updated versions of packages that have to be installed.
    • Running Stata in batch mode, knowing the dependencies between code files (and setting up a gateway file so that on Windows modules can run shell commands).
    • Deals with SVN commands.
    Here is stub version. It uses statab.sh, cli_build_proj_mlib.do, gph2fmt.ado, cli_gph_eps.do,  cli-install-module.docli_smcl_log.do (plus the normalizing ones).

    Edit 2015-01-28: I've posted a project template with updated versions of these (and better makefiles as my skill improves) at GitHub

    Version control for research projects

    While I had used version control on earlier projects, I didn't start using version control for collaborative research projects until reading Code and Data for the Social Sciences: A Practitioner’s Guide by Matthew Gentzkow and Jesse M. Shapiro. If you haven't read it, it's a good read (I agree with the general guidelines and most of the specifics).

    The first decision is which files to version. I version the dta, gph, tex, and log files.  I chose not to version easily generated files such as different formats of outputted figures and tables and instead generate them automatically using my Makefile. I normalize the dta, gph, and log files before committing them so that changes are noted only if real content has changed.

    Some miscellaneous tools: rm-non-svn.shsvn_batch_rename.sh.


    Stata wishlist

    Here's what I wish Stata would add:

    1. Primary output files (dtas and gphs) should be able to be reproduced byte-for-byte. Primarily this requires being able to zero-out the timestamps and zero-out any junk padding.
    2. Make PDF and PNG exporting of figures available on console Unix.
    3. Shell commands should work in Windows batch-mode.
    4. All built-in commands should return values through the return classes (e.g. r()) so that they can be used programmatically
    5. Also, allow the Windows do-editor to automatically word wrap (this is a main reason why people I know use other editors).
    6. The program should set a positive return code on error. When running Stata in batch-mode the program always exits with 0 (success) even if the program had an error.
    Hopefully, some of these will be available for version 1415.