Data formats for Census files

Column Orientation

The census microfiles are quite large, and reading the full file to obtain a few variables can be tiresome. It takes about 7 minutes to read 1940.dta into Stata at the NBER. Since most users need only a fraction of the 190 available variables, considerable time can be saved by reading only the variables that are needed. While Stata has variable and row selection on the command line, that doesn't speed reading. All the values are read, but only the selected variables are kept. This is inevitable for a package whose binary format holds data row after row.

Parquet

Widely used and standardized formats in the HPC community such as Parquet store data in column format, that is, all the values for a single variable are stored in a line, followed by all the values for the next variable, etc. With this storage format a single variable can be read by skipping to the location on disk where the variable starts and reading only to the end of that variable.

Mauricio Caceres Bravo has posted code for Stata-Parquet that can read and write Parquet files, but it makes two copies of the full dataset in the process of writing a Parquet file. The additional memory required makes it difficult to use with very large datasets. We do have Parquet versions of the Census-IPUMS files which does demonstrate the superior compression achieved with column storage - 84GB for the compressed .dta files against 23GB for Parquet. If you install Stata-Parquet you can load a few variables with this command:

parquet use age race sex using /home/data/census-ipums/current/1940.parquet

As of this writing we have been unable to make -stata-parquet- work for all users without individual installs.

One-variable .dta files

As a stopgap, and to demonstrate the advantages of column storage, I have created directories with each variable in a separate .dta file. The user can merge these togehter to create a working file. For example: cd /home/data/census-ipums/current/dta-column/1940 use age merge 1:1 _n sex merge 1:1 _n race loads a three variable workspace. There is an ado file -ceniprc.ado- that can simplify this: ceniprc 1940 age sex race bring in the selected variables in one command.

An advantage of column storage is that compression is much more efficient. The compressed single .dta file is 22GB, The sum of 193 individual variables (also compressed) is only 9.4GB.

However the Stata -merge- command is slower than the -use- and while the -use- statement for the first variable takes only 5 seconds, each variable merged after the first takes about 28 seconds to load. So the breakeven point of wall clock time is 14 variables. With purpose-built software this could be much better.

Conclusion

In a sense, the row-major storage format chosen by Stata is surprising. While natural in a database where it allows convenient update of a single record there is little advantage to Stata, which never allows updates other than by rewriting the entire file.

Our next step should be a new version of Stata-parquet that directly calls the Parquet API, rather then using Python, Numpy and Arrow intermediaries.

Daniel Feenberg
19 September 2019