I have just tagged a new major version of SASLib. I took the chance to migrate version number to 1.x, as the library/API has been stable and unchanged for quite a while.
The main updates are:
Fully adopt Julia 1.x by removing old v0.6-specific code and adding Project.toml.
Implement Tables.jl interface for better integration with the rest of data ecosystem
An updated performance test results against Python/Pandas and ReadStat C-library
The performance results made me take another look at ReadStat.jl. Turns out there was something like a type instability festival happening in the most inner loop I just released a new version that fixes that and should give much better performance across the board.
I reran the benchmark in SASLib with that new version, and I think the only test where SASLib was still faster than ReadStat was the data_misc/numeric_1000000_2.sas7bdat test (but I did this hastily, would be great if you could rerun the results in your repo!). But maybe that is a bit of an apples and oranges comparison, because SASLib doesn’t handle missing values, it just hands them down as NaN, right?
I think the problem is that ReadStat’s logic for missing values detection runs always, even if the file doesn’t have missing values in it. As far as I could tell one can’t tell beforehand in the file format whether a column has missing values in it, right? Essentially one has to add a check for every single value that is read that a) checks for NaN, and b) checks for these declared missing ranges (or was that for a different file format that is also supported by ReadStat?). But I think those checks then also run for files that actually don’t have any missing values in them, at least that is what I believe is happening in ReadStat.