SASLib v1.1.0 release

Hi all,

I have just tagged a new major version of SASLib. I took the chance to migrate version number to 1.x, as the library/API has been stable and unchanged for quite a while.

The main updates are:

  • Fully adopt Julia 1.x by removing old v0.6-specific code and adding Project.toml.
  • Implement Tables.jl interface for better integration with the rest of data ecosystem
  • An updated performance test results against Python/Pandas and ReadStat C-library

A sneak peak about the latest performance comparison results:

Test Result
py_jl_homimp_50.md 30x faster than Python/Pandas
py_jl_numeric_1000000_2_100.md 10x faster than Python/Pandas
py_jl_productsales_100.md 50x faster than Python/Pandas
py_jl_test1_100.md 120x faster than Python/Pandas
py_jl_topical_30.md 30x faster than Python/Pandas

Being able to read files in chunks would be awesome!

It already supports reading file in chunks (see Incremental Reading section of README).

Are you referring to the multi-threaded aspect per issue #5 or the random access idea per issue #35?

If I call correctly, the issue is that when reading large files in chunks, it reads the whole file somehow and then chunks it. It’s really slow.

See https://github.com/tk3369/SASLib.jl/issues/50

Awesome!

The performance results made me take another look at ReadStat.jl. Turns out there was something like a type instability festival happening in the most inner loop :slight_smile: I just released a new version that fixes that and should give much better performance across the board.

I reran the benchmark in SASLib with that new version, and I think the only test where SASLib was still faster than ReadStat was the data_misc/numeric_1000000_2.sas7bdat test (but I did this hastily, would be great if you could rerun the results in your repo!). But maybe that is a bit of an apples and oranges comparison, because SASLib doesn’t handle missing values, it just hands them down as NaN, right?

I believe ReadStat doesn’t support binary compressed SAS? (RDC compression)

I think that is so. ReadStat.jl just exposes whatever ReadStat the C library does.

I can definitely re-run the performance test again!

Yes, SASLib doesn’t deal with missing data yet. An enhancement request was logged here https://github.com/tk3369/SASLib.jl/issues/43

The file data_misc/numeric_1000000_2.sas7bdat file does not contain any missing values so it isn’t necessary a bad test.

I think the problem is that ReadStat’s logic for missing values detection runs always, even if the file doesn’t have missing values in it. As far as I could tell one can’t tell beforehand in the file format whether a column has missing values in it, right? Essentially one has to add a check for every single value that is read that a) checks for NaN, and b) checks for these declared missing ranges (or was that for a different file format that is also supported by ReadStat?). But I think those checks then also run for files that actually don’t have any missing values in them, at least that is what I believe is happening in ReadStat.