[ANN] ReadStatTables.jl v0.2.2 is faster than all related packages for reading Stata files

ReadStatTables.jl is a package for reading data files from Stata, SAS and SPSS into Julia tables. Substantial improvement has been made for this package recently. Since the v0.2.0 rewrite, it no longer relies on ReadStat.jl for parsing the data files, but directly interacts with the C library ReadStat. This allows the access to complete functionality of ReadStat and significant performance improvement.

With the v0.2.2 release, multithreaded reading is supported and turned on by default for Stata .dta, SAS .sas7bdat and SPSS .sav formats. Benchmark results here show that ReadStatTables.jl reads Stata .dta files significantly faster than all well-recognized open-source packages, including pandas and all other packages based on ReadStat (e.g., haven). This statement is expected to hold true for SAS and SPSS files as well but is not verified.


How fast is Stata in comparison?

I am not entirely sure what would be the comparable way to measure that in Stata. If I directly use timer around the use datafile statement in Stata, I got elapsed time that is on the same magnitude of that measured for ReadStatTables.jl. But, if I repeatedly turn the timer on and off in a forvalues loop in a program, the average time is less than a quarter of what is measured in a single run. So, the open-source solutions are probably still slower than Stata, but the gap is smaller now.

1 Like