Friday, October 10, 2014

Managing sets of files produced by different configurations

With statistical estimation you often run the same programs multiple times with slightly different options. If these program produces file outputs you can have some trouble managing them all. Here are some tools I use.  I have a global suffix ${extra_f_suff}

Tracking sets

The first task is tracking the files produced. In general you want to have files names that include the option. Here, I employ two strategies depending on the work.

  1. For real runs, I usually want to collect the names of all files produced so that they can be checked and then stored/deleted/etc. Therefore I have wrapper functions saving dtas, logs, graphs, tables, and text snippets that append the name of the file they are writting to a separate text file. 
  2. A standard option that I always include is a "testing" switch. This is useful for when I want to just test if a small change causes an error. It does the bare minimum for a program (limits the number of observations, reduced the number repetitions, etc.). It also sets a global extra_file_suffix="_testing" which is appended to all file names at the point of file writing (easier than passing a testing option through several layers of programs).

Manipulating sets

If you can build a list of files (either because they were saved or you do find | grep *.blah) then here are some handy tools for dealing with them.

$ cat file_list | xargs rm # delete
#manage them via svn (similar options for git):
$ cat file_list | svn add --targets -
$ cat file_list | svn remove --keep-local --targets -
$ cat file_list | svn commit -m ""  --targets -

Noting the primary key in Stata files

Most tables in databases have a primary key defined and this can be a help with Stata files too. If you have a primary key defined by a single variable then you can use xtset (or tsset if it's a time variable). If you have a composite key and one of them is a time variable you can use xtset/tsset. Otherwise, you should have a consistent way of listing it. One way is to store it as a dta characteristic, such as:
. char _dta[key] keyvar(s)
See also isid.

Thursday, August 28, 2014

Parallel processing in Stata: Example code

Here are some files from a presentation I did about running parallel operations in Stata (beyond what Stata-MP does). They include a simple presentation covering the basic idea and example code for doing parallel bootstrap estimation. They use the parallel Stata module (I prefer dev version).

Wednesday, August 27, 2014

Easy LyX Beamer products

When making presentations in LyX with the Beamer template, one often wants to make three PDFs every time: slides, handouts, and handouts+notes. Here are some scripts to do this. You will need Cygwin installed if you are on Windows.

Command-line version:
In Windows I've added this functionality to the right-click menu for LyX files. It requires the slightly modified script below. To edit the right-click menu I use the FileMenu Tools utility. In a new entry, set the program as a shell (I've tested git -- "C:\Program Files (x86)\Git\bin\sh.exe" and cygwin -- "C:\cygwin64\bin\mintty.exe"). For arguments put "absolute/path/to/beamer_outputs.sh %FILENAME1%".

Monday, August 25, 2014

Storing Stata project dependencies

Newer versions of an environment can break existing code so it is often helpful to maintain access to the specific versions you use. For the Stata environment this is particularly important. The SSC archive doesn't store previous versions of modules so you should store them in your project folder. To ensure that a project is using only the locally-stored Stata programs, I set the shell environment variable S_ADO to "\"<project_base>/code/ado/\";BASE".

The process is a bit more work if a module has machine-specific files (e.g. compiled plugins) and you want to allow your project to run on different platforms. If you're working across platforms you should have your code stored in some kind of version control repository (e.g. subversion or git). For modules with machine-specific file you can't store the installed files in the repo since they differ constantly between machines. Instead you store the installation files in the repo and then you do a local install on each machine. To store the installation files locally, find the URL of pkg files (e.g. http://fmwww.bc.edu/repec/bocode/s/synth.pkg) and use store-mod-install-files.sh. To install locally, do the following

A more general solution will reinstall the files when the the installation files are updated. For this we use make and generate the makefile. The file dependencies are stored in a the pkg files so you can use gen-makefile.sh (and statab.sh and cli-install-module.do) to scan for such files, extract the dependencies, and generate the makefile.You will then be able to type:
$ make all_modules
to update/install all the modules.

For the paranoid: The Stata program itself is closed-source, so specific versions may be unavailable in the future. Stata states that you can use the version command to enable dated-interpretation in newer versions. If you are not satisfied by that, store installation files yourself (preferably for an open-source system like Linux).

Thursday, July 31, 2014

Posting abandoned research ideas

Researchers often hold onto sub-par research projects for their option value. An idea might not seem worthwhile now, but maybe in the future (it might become feasible with better data or theoretical breakthrough, or might become more interesting if combined with another insight). But periodically, this option value drops. After a professor gets tenure they're probably much less likely to care about these long shots. Sometimes these ideas are lobbed at students, but they are often not picked up. I think that at these points where the option values drop, professors should post some of their least-likely-to-be-worked-on ideas somewhere. It could be their own blog, or a maybe an aggregator that specializes in abandoned research ideas.

Incentives and supply curves

I recently had thoughts about when exchanges between individuals are not exploitative. (I'm going to leave aside the issue of coercion though that is obviously related). Examples of "problematic" exchanges would be price hikes after events (like a storm) and blackmail. I think these are roughly in order of decreasing popularity. I'm going to steer clear of "repugnant markets" like organ sale.

The simplest thing to think of what happens in an exchange is that there is some surplus (as defined by their disagreement point or BATNA) from the transaction that should be shared evenly between the parties. This is the result from Nash bargaining. This setup, though, discounts the effects on possible future transactions. This is why many economists favor few price restrictions after calamities. If you don't allow people to profit, there won't be any supply response and the next time it happens (or later in the current after-math) the supply won't be improved. So in each case we are weighing the desire for equality with the desire to improve the world in the future. For price gouging we are then weighing a negative (short-term inequality) and a positive (improved future supply). Reasonable people could give different weights to these concerns and come to different conclusions (some events may be so particular that we don't think suppliers profiting now would cause any future benefit). Blackmailing is interesting because it is appears to be a negative in the short-term and long-term. The long-term is negative because if blackmailing was morally permissible it would incentivize people to violate the privacy of others.

Readings:
  1. Strings Attached: Untangling the Ethics of Incentives by Ruth W. Grant
  2. Questions for Free-Market Moralists by Amia Srinivasan (nytimes.com)
  3. Organ Sale - Stanford Encyclopedia of Philosophy

Thursday, June 19, 2014

Determining demand for unfinished products

Once a product has moved from the idea stage to the prototype stage, Kickstarter and similar sites are great ways to determine demand. But what about products that are still at the idea stage (but which are obviously feasible)?

The domain I was thinking about recently was audiobooks. I bet there are lots of potential long-tail audiobooks (especially older books) that aren't getting made because of uncertainty about demand. The existing systems to determine demand using the internet (that I'm aware of) likely don't allow many audiobooks to be published.
  • When an author tries to determine demand they will sometimes use a crowdfunding site (e.g. indiegogo which is very permissible in the type of project funded) to gauge demand. But this is very rare and I think authors of older books are very unlikely to do this. 
  • Amazon.com allows people to note their interest in an audiobook version on the book page. But amazon doesn't know how serious you are since it's practically costless to click that link. They also only use the data for make audiobooks through Audible.com which has a high production cost (potentially voice actors on ACX might do it for less).
What would be great was a site where people could monetarily show their interest (similar to indiegogo) on any potential audiobook and the site would track progress toward the goal. It would be more like bounty programs since the end-goal isn't specified and the producer hasn't been determined yet. Then entreprenuers could contact authors about audio rights (or authors could do it themselves). The site could even have free copyright as the end-goal like unglue.it for audiobooks. If that were the case, voice actors from LibriVox might record the audio for free and the payment would just be to secure the audio rights.

Monday, March 17, 2014

Stata's serset file format

Sersets are minimal versions of datasets in Stata. They are commonly embedded in .gph graphics files to store the data so that graphs can be reproduced by Stata. Sersets are sometimes also written directly to files. The file format is undocumented by Stata but very similar to Stata dta formats like v115 and previous ones. If you need to access the data in a serset file or .gph then here is the basic format. The below assumes that you are generally familiar with the dta v115 standard. Let nvar be the number of variables and nobs the number of observations. The length of each block is in brackets.

  1. [16] "sersetreadwrite" (null terminated).
  2. [1] either "0x2" for versions 11-13 (gph format 3) or "0x3" for version 14 (gph format 4). 
  3. [1] I think this is a byte-order field, so 0x2 for lohi (standard Windows/Linux machines), otherwise 0x1.
  4. [4] Number of variables
  5. [4] Number of observations
  6. [nvar] The typlist. A byte characterizing the type of each variable.
  7. [nvar x 54] The varlist. Null-terminated strings of the variable names. For versions >=14 this is 150.
  8. [nvar x 49] The fmtlist. Null-terminated strings of the variable formats. For versions >=14 this is 57.
  9. [nvar x 8] The maximums. A double for each variable representing that variable's non-missing maximum. If the variable is a string then the double's missing value is used.
  10. [nvar x 8] The minimums. Same as above except for the minimums.
  11. [nobs x sum(sizeof each variable type)] Data: All data from an observation are contiguous.
Edit 2016-01-31: Figured out more about the bytes immediately after "sersetreadwrite" and about how version 14 is different.

Thursday, February 27, 2014

Bigfoot module for LyX

For the lazy, here is a module for support for the bigfoot package for LyX (outlined here). Find your User directory from Help > About LyX and then put the file in "<User directory>/layouts/". Then do Tools > Reconfigure.

Friday, January 24, 2014

PDFs that allow toggling between images

Authors sometimes want to their documents to be able to display different images in different settings. For instance, color charts often become indecipherable when printed in gray-scale so a separate version with patterns might be preferable. One simple way to achieve this in PDFs is with Optional Content Groups. See this PDF for a working example. It has a link that a user can click and it will toggle images in the PDF. The file was produced using LyX with the tikz and ocg-p LaTeX packages. This zip contains source materials for those who would like to create your own. OCGs don't use JavaScript so this method should be fairly portable.

Saturday, December 21, 2013

Combating projection bias

One manifestation of projection bias is where you think that your views won't change with time. For example, young people often think they will never want to be married or have children. I think a neat idea for a website would be to summarize from long-running surveys to see which opinions or beliefs change with age and which do not. You could even break out the data by subgroup so that someone could see, for example that "of the people with similar education and income levels who held my view, X% changed their mind". This might help when thinking about long-term commitments such as living location, jobs, and partners.

Wednesday, September 11, 2013

Automated Statistical Reports in MS Office

I prefer LyX to Microsoft Office when writing academic works, but sometimes MS Word/Powerpoint is necessary. My one requirement for the workflow is that manual fiddling isn't required. This is a bit challenging because Office products can't show PS/PDF material inline. You can include an EPS as a graphic which can print OK (if you print to a PDF or have a PS printer), which is fine for Word, but will only show an unavoidably ugly "preview" inline which doesn't work for PowerPoint. For presentations, you should output to a raster format (like PNG) or for tables you can possibly output to EMF.

Generating EPS/PNG files from graphics is built into most stats software (R, Stata), so the only hard part is converting tables. I'll assume you can generate these a tex and start there (so as to minimize differences between Office and Latex generated reports). Scripts below were tested on Windows 7 with Cygwin and I had placed the programs in step four in my path.
  1. Export a table to mytable.frag.tex.
  2. Wrap that tex fragment in a minimal "standalone" document.
    • >echo \documentclass[varwidth=true, border=10pt]{standalone} > mytable.tex
    • >echo \begin{document} >> mytable.tex
    • >cat mytable.frag.tex  >> mytable.tex
    • >echo \end{document} >> mytable.tex
  3. Make your table into a "standalone" PDF (the PDF will be just the size of the table). Your TeX distribution will automatically download the 'standalone' package.
    • >pdflatex mytable.tex
  4. Convert the PDF to the output required.
    • >pdftops -eps mytable.pdf mytable.eps
      • Could also do this in Ghostscript like below.
    • >pstoedit -f emf mytable.pdf mytable.emf
      • This will change the fonts on you. If you have Greek letters this conversion might fail in which case you need to add -adt which turns letters into polygons (not perfect but OK). You can also convert with Adobe Illustrator which will also do font conversion.
    • >gswin64c -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pnggray -r600 -sOutputFile=mytable.png mytable.pdf
There are two options for inserting. The first is the easiest uses absolute paths for images so won't work well with sharing the files among other users (e.g. in Dropbox) or moving the project folder.

  • Use normal image insertion but use the dropdown next to "Insert" and select "Insert and Link". Some VBA scripts are available that will convert the absolute path to relative paths (e.g. here).
  • Insert pictures using field codes. Insert -> Quick Parts -> IncludePicture. Then put in a relative path like pics\pic1.png and select "Data not stored with document" (and you probably want to check resizing both horizontally and vertically). If you want to convert to pictures (if you send it out), then remove the "Data not ..." by removing "\d" from the field code, and make a formatting edit (like going to Format Picture and then changing the Layout wrapping style).

Now when the linked files are updated by programs, they will be updated in Office. If the document is open then select an image (or using select all) and press F9.

Edit 2016-01-22: Expanded part about relative paths to images.

Also note that emfs might not display correctly on Mac Powerpoints (same for MathType equations). You can print pptxs to pdfs but then you will lose any animations.

Thursday, September 05, 2013

Modifying Stata library routines

I was recently not understanding why some error messages were being generated by mata's optimize(). I found that it was quite easy to modify the routines to include some debugging print-outs. Broadly we want to create modified routines that mask the existing ones. Here's the basic steps:

  1. Find the mata file (in my case deriv.mata) in the Stata folder, copied it to your own directory, and make changes. 
  2. Turn it into a personal library. See the basic template at -help mata_mlib-.
  3. Have the personal library called instead of the builtin routines. The -mata mlib index- command from the previous step outputted the order in which the libraries are searched for a function.

Enjoy your modified version. When you're all done you can use -mata mlib index- again to get the default ordering.

Wednesday, June 26, 2013

Strategic voting and Oxford-style debates

In determining which side wins a debate, one variant of  Oxford-style debates has people vote about a proposition before and after hearing the debate. The side that sways the most voters towards theirs wins. For example, if an audience votes initially that they are (60% For, 20% Against, 20% Undecided) and afterward they are (65% For, 30% Against, 5% Undecided) then the Against side wins (they change 10% toward their side while the other only swayed 5%). This system, though, incentivizes dishonesty in the first round. If I believe I am unlikely to change my mind during the debate, I should vote opposite my current view in the initial vote.

To see if strategic voting is a problem, one could take a debate's audience and have half of the people's votes be noted but not counted towards the winner. If the counts of the two groups differ, then something might be going on.

The harder question is how do you incentivize truth telling for those that switch.
Edit: One way maybe would be to just have a vote at the end but have a prediction market (for the final vote) be going throughout the debate. The winner of the debate is the one that moved the market price the most. 

Wednesday, July 18, 2012

The alphabet soup of fried chicken places


There is a wide variety of fried chicken restaurants in Dhaka, Bangladesh. There's so many that someone mentioned trying to find for each letter of the alphabet. So here's a list of "_FC".

A: American,
B: Best
C: Cripsy, California, CP (unsure what CP stands for)
D: Dhaka
E: 
F: Fortuna
G: Golden, Good
H: Happy
I:
J:
K: Kentucky, Kok Kok
L: Lime's
M: Macdanald?
N: (in Mohakhali, but unsure what N stands for)
O:
P:
Q:
R:
S: Southern (not sure they still exist...)
T: Taitan
U:
V: Vhoot (means "ghost" in Bengali)
W: (in Banani but unsure what W it stands for)
X:
Y:
Z:

Not bad at over 50% of the letters!

Tuesday, July 17, 2012

DC metro station interior information

On my normal WMATA routes, I (like many other commuters) know where to get on the train so that I end up next to the exit at my destination (this is possible because the trains almost always stop at the front of the platform). But if I'm going to an unfamiliar station I can't shorten the walk at the destination station by walking a bit at my departure station. Knowing the location on the platform of the escalators/stairs/elevators could save time at those stations which only have a single exit on one end. I imagine this type of information would be especially useful for those with impaired mobility. WMATA comes close to providing this info by allowing one to find out about the status of escalators/elevators. It provides and accompanying description (e.g. "Escalator between street and mezzanine") but it's sadly not specific enough.

I've been told that WMATA does have this information digitally so hopefully they'd be able to provide it. Otherwise, the information could be user-gathered. The Washington Post did that with their Metro cell phone service map which had a publicly-editable custom map on Google Maps. Another option would be to have user edit OpenStreetMap. Or possibly one of the smartphone apps (which would benefit from this info) could provide and option for user submitted information). Anyone want to champion one of these options?

(And for the smartphone app developers out there, could you integrate the direction of station exits like StationMasters does? It's a small point but kind of nice.)

Sunday, July 15, 2012

Crowdsourcing journal/data information

In academia, researchers often work with and learn from data and methods that have been used before. I think are new projects that could fill existing gaps in this process and therefore speedup the learning and research process by reducing redundancy. I'll highlight three here. If they already exist, please let me know. If they don't, hopefully someone could start the project and allow crowd-sourcing of this information (could be as simple as setting up a wiki and collecting links to places where some of this data already exists):

  1. Data scripts: Publicly available datasets are often not in the best of shape. They often need to cleaned-up, labeled, linked to other data sources, or processed in standard ways. Additionally, there may be external information about the quality or other facts that should be documented and understood by researchers. Documentation and scripts (in multiple languages) would be the goal here. See for example asdfree.com.
  2. Study replication: Researchers over try to replicate existing studies in order to understand a method, for a class project, or to see conduct extensions. As many authors do not contribute accompanying code for working with the data, there is a lot of reverse-engineering that has to be done. This would not only save time, but disseminate important information about the implicit assumptions in papers.
  3. Typo corrections: Ever puzzled over an equation in a journal article and gone through the trouble of finding that there is a type in it? Published material is never perfect, and unless it is a large mistake, authors normally don't post corrections. But noting small (non-controversial) corrections could still save a lot of time. Obvious spelling mistakes are not worth the time to correct, but even if conclusions aren't overturned it is still helpful to correct intermediate steps.
    Edit: PubPeer seems to provide this function.

Saturday, July 14, 2012

Smartphone app for verbal confidential agreements

A professor recently cautioned a graduate-student class not to share research ideas too liberally as sometimes professors steal the ideas of students. It is helpful to get input on ideas, however, so using a mechanism such as an NDA would seem appropriate in preventing this. I would imagine that most of the most vulnerable type of interaction is face-to-face meetings. In this case what would be nice is to audio-record a quick declaration of confidentiality and then the subsequent conversation. If an IP lawyer made a short script that could be read and recorded then a smart-phone could be created that would play the "this a confidential conversation" audio and then start recording. Any takers?

Wednesday, July 11, 2012

StatWeave in LyX

While it's pretty easy to integrate R into LyX documents (using Sweave or knitr) I hadn't found any native way of integrating Stata code. I've hacked StatWeave (which allows Stata, R, and a bunch of other stat languages) to work in my LyX (in Windows 7 with LyX 2.0.4) and thought that others might like this as well. Here's the steps:

  1. Copy statweave.module to the "Resources/layouts" of your user directory (See HelpAbout LyX).
  2. Open LyX and go to  Tools→Reconfigure
  3. Restart LyX and go to  Tools→Preferences
  4. On the left choose File HandlingFile Formats.
  5. Click the New... button and then fill in  Format: StatWeave, check Document format, Short Name: statweave, Extension: swv. Then Apply.
  6. On the left choose File HandlingConverters
  7. Using the drop-down menus, create a new one from "StatWeave" to "LaTeX (pdflatex)" with the command "statweave --target tex $$i" (make the changes and then click "Add" in the upper right). Then Save.
Once you get that setup you should be able to compile: stata-test-bq.lyx. Notes:
  • LyX disables the enter key from inserting a line-break in insets (anyone who knows about this please let me know why or a work around). You can paste them, though, so copying several lines of code into a Stata Code Chunk insert works fine (if you are writing it in LyX then just copy a line break from somewhere and paste as needed). 
  • If you don't want to use the custom insets you can always just use plain TeX code (e.g. You can always just use TeX code inserts \begin{Statacode} ... \end{Statacode} or \Stataexpr{...})
  • The StatWeave manual is helpful for the non LyX part.
  • When configuring your Stata executable do not include "do " before "%codefile%".
  • You may need to edit the MiKTex path given the default ones are for version 2.6.
  • If you want to use relative paths for input files then you can pass the directory of the LyX file to Stata as an environment variable. In the converter replace the statweave command with "set orig_PWD=$$r && statweave --target tex $$i" (then ModifySave). Then in Stata you can change do that directory with -global orig_PWD =trim("`:environment orig_PWD'")"- (it will have the final slash). Don't cd around in the program or StatWeave wont' know where it's graphics files are for conversion.
Suggestions very appreciated.

Update: I've made a newer version the just focuses on Stata and allows for code block options. You can easily show figures now right-clicking in the block, choosing opts, and then putting in a string like fig, height=4.5in, width=9in, dispw=4in