Monday, March 17, 2014

Stata's serset file format

Sersets are minimal versions of datasets in Stata. They are commonly embedded in .gph graphics files to store the data so that graphs can be reproduced by Stata. Sersets are sometimes also written directly to files. The file format is undocumented by Stata but very similar to Stata dta formats like v115 and previous ones. If you need to access the data in a serset file or .gph then here is the basic format. The below assumes that you are generally familiar with the dta v115 standard. Let nvar be the number of variables and nobs the number of observations. The length of each block is in brackets.

  1. [16] "sersetreadwrite" (null terminated).
  2. [1] either "0x2" for versions 11-13 (gph format 3) or "0x3" for version 14 (gph format 4). 
  3. [1] I think this is a byte-order field, so 0x2 for lohi (standard Windows/Linux machines), otherwise 0x1.
  4. [4] Number of variables
  5. [4] Number of observations
  6. [nvar] The typlist. A byte characterizing the type of each variable.
  7. [nvar x 54] The varlist. Null-terminated strings of the variable names. For versions >=14 this is 150.
  8. [nvar x 49] The fmtlist. Null-terminated strings of the variable formats. For versions >=14 this is 57.
  9. [nvar x 8] The maximums. A double for each variable representing that variable's non-missing maximum. If the variable is a string then the double's missing value is used.
  10. [nvar x 8] The minimums. Same as above except for the minimums.
  11. [nobs x sum(sizeof each variable type)] Data: All data from an observation are contiguous.
Edit 2016-01-31: Figured out more about the bytes immediately after "sersetreadwrite" and about how version 14 is different.