SASLib v1.1.0 release

tk3369 · January 4, 2020, 7:19am

Hi all,

I have just tagged a new major version of SASLib. I took the chance to migrate version number to 1.x, as the library/API has been stable and unchanged for quite a while.

The main updates are:

Fully adopt Julia 1.x by removing old v0.6-specific code and adding Project.toml.
Implement Tables.jl interface for better integration with the rest of data ecosystem
An updated performance test results against Python/Pandas and ReadStat C-library

A sneak peak about the latest performance comparison results:

Test	Result
py_jl_homimp_50.md	30x faster than Python/Pandas
py_jl_numeric_1000000_2_100.md	10x faster than Python/Pandas
py_jl_productsales_100.md	50x faster than Python/Pandas
py_jl_test1_100.md	120x faster than Python/Pandas
py_jl_topical_30.md	30x faster than Python/Pandas

xiaodai · January 4, 2020, 10:33am

Being able to read files in chunks would be awesome!

tk3369 · January 4, 2020, 10:45pm

It already supports reading file in chunks (see Incremental Reading section of README).

Are you referring to the multi-threaded aspect per issue #5 or the random access idea per issue #35?

xiaodai · January 4, 2020, 10:56pm

If I call correctly, the issue is that when reading large files in chunks, it reads the whole file somehow and then chunks it. It’s really slow.

See https://github.com/tk3369/SASLib.jl/issues/50

davidanthoff · January 6, 2020, 6:35pm

Awesome!

The performance results made me take another look at ReadStat.jl. Turns out there was something like a type instability festival happening in the most inner loop I just released a new version that fixes that and should give much better performance across the board.

I reran the benchmark in SASLib with that new version, and I think the only test where SASLib was still faster than ReadStat was the data_misc/numeric_1000000_2.sas7bdat test (but I did this hastily, would be great if you could rerun the results in your repo!). But maybe that is a bit of an apples and oranges comparison, because SASLib doesn’t handle missing values, it just hands them down as NaN, right?

xiaodai · January 6, 2020, 11:54pm

I believe ReadStat doesn’t support binary compressed SAS? (RDC compression)

davidanthoff · January 6, 2020, 11:56pm

I think that is so. ReadStat.jl just exposes whatever ReadStat the C library does.

tk3369 · January 7, 2020, 7:22am

I can definitely re-run the performance test again!

Yes, SASLib doesn’t deal with missing data yet. An enhancement request was logged here https://github.com/tk3369/SASLib.jl/issues/43

The file data_misc/numeric_1000000_2.sas7bdat file does not contain any missing values so it isn’t necessary a bad test.

davidanthoff · January 7, 2020, 5:10pm

I think the problem is that ReadStat’s logic for missing values detection runs always, even if the file doesn’t have missing values in it. As far as I could tell one can’t tell beforehand in the file format whether a column has missing values in it, right? Essentially one has to add a check for every single value that is read that a) checks for NaN, and b) checks for these declared missing ranges (or was that for a different file format that is also supported by ReadStat?). But I think those checks then also run for files that actually don’t have any missing values in them, at least that is what I believe is happening in ReadStat.

Topic		Replies	Views
ANN: SASLib.jl Data	14	2306	December 30, 2017
[ANN] ReadStatTables.jl v0.2.2 is faster than all related packages for reading Stata files Package Announcements package , announcement , statistics , data	2	455	December 17, 2022
SASLib.jl v0.5.0 breaking changes Data	0	481	March 7, 2018
Adding SAS to benchmark comparison Meta Discussion	10	2198	February 21, 2018
Reading Data Is Still Too Slow Data	35	8817	August 2, 2019

SASLib v1.1.0 release

Related topics