ANN: SASLib.jl

Hello everyone!

I am happy to announce a new package SASLib.jl. Please see the github project page for more information. -Tom


Already tweeting about it

I’d be interested to know how this compares to

I think StatFiles uses the ReadStat package which does not support compressed sas7bdat files.

Right, I understand that there are some differences (compressed files, incremental reads). I just mean a performance comparison of read and write times, perhaps as a function of file size.

I work with large sas7bdat files and typically end up still doing half of that work in SAS because I haven’t found the Julia data ecosystem ready for those yet. Hopefully the incremental reading capability of your package will help with that.

I see. The library can only read files but not write. I would love to hear any feedback about performance if you can try it on your files.

@tbeason hi can you tell me more about your needs? i dont mean to turn this forum into sales opp but my companies makes a fast sas reader. it can convert a large sas dataset into csv in chunks or as a whole. i am trying to make a Julia wrapper around it atm but i am waiting for a more mature Julia to rust interop. if you are interested i can send you an alpha testing copy in the form of a julia package. which OS are you using?

The SAS isn’t an issue. The issues are all about trying to work with the data in Julia. If the SAS file is >10 gigs then I’m basically out of luck when it comes to using Julia because loading the file and working with it is near impossible for me right now.

It’d be nice for this package to support column selection (I see the issue has already been opened).

ok then JuliaDB.jl might be one to watch. not sure how ready is that for
your workload as it feels quite early in development

I don’t normally work with such big files. Are you dealing with mostly numerical data?

I generated a 8 GB file with 200,000,000 x 5 random numbers from SAS and read it into Julia. The file is physically stored in a corporate SAN storage so it’s not the quickest. It completed successfully.

Reading a subset of columns is a high priority from my perspective. I’m already working on and it will be released soon.

shell> ls -l unif.sas7bdat
-rw-r--r-- 1 tkwong staff 8070946816 Dec 28 16:31 unif.sas7bdat

julia> @time x = readsas("unif.sas7bdat")
Read data set of size 200000000 x 5 in 132.04 seconds
132.041671 seconds (2.31 G allocations: 194.824 GiB, 8.55% gc time)
Dict{Symbol,Any} with 16 entries:
  :filename             => "unif.sas7bdat"
  :page_length          => 8192
  :file_encoding        => "ISO-8859-1"
  :system_endianness    => :LittleEndian
  :ncols                => 5
  :column_types         => DataType[Float64, Float64, Float64, Float64, Float64]
  :data                 => Dict{Any,Any}(Pair{Any,Any}(:m, [8.0, 5.0, 5.0, 7.0, 6.0, 7.0, 7.0, 6.0, 5.0, 5.0  …  6.0, 9.0, 8.0, 7.0, 7.0, 5.0, 5.0, 7.0, 9.0, 6.0]),Pair{An…
  :perf_type_conversion => 32.2385
  :page_count           => 985222
  :column_names         => String["m", "n", "k", "u", "x"]
  :column_symbols       => Symbol[:m, :n, :k, :u, :x]
  :column_lengths       => [8, 8, 8, 8, 8]
  :file_endianness      => :BigEndian
  :nrows                => 200000000
  :perf_read_data       => 97.5212
  :column_offsets       => [0, 8, 16, 24, 32]

julia> using DataFrames

julia> y = DataFrame(x[:data]);

julia> @time y = DataFrame(x[:data]);
  0.189854 seconds (69 allocations: 119.212 MiB, 80.54% gc time)

julia> showcols(y)
200000000Γ—5 DataFrames.DataFrame
β”‚ Col # β”‚ Name β”‚ Eltype  β”‚ Missing β”‚
β”‚ 1     β”‚ k    β”‚ Float64 β”‚ 0       β”‚
β”‚ 2     β”‚ m    β”‚ Float64 β”‚ 0       β”‚
β”‚ 3     β”‚ n    β”‚ Float64 β”‚ 0       β”‚
β”‚ 4     β”‚ u    β”‚ Float64 β”‚ 0       β”‚
β”‚ 5     β”‚ x    β”‚ Float64 β”‚ 0       β”‚

Unrelated to SAS, I have an (unregistered) package that I am using for 40-100GB of data. It is a thin wrapper around mmap, works if your elements are bits types (eg Int, Float64; you could make it work for fixed-width strings). Column selection is costless since mmap reads lazily.

It is definitely slower when the data has strings (which is expected I guess).
I just tested it on a β€˜real’ 49 column and 500k rows data set. It took 9.25 seconds which is not too bad.
Page length was 65536, 5883pages. File on disk is 380mb (142mb compressed)

I am trying a larger file now (hoping memory does not max out) …

EDIT: I ran out of memory on a 14gb file (32GB of RAM)

below are some numbers on a 3.8gb file (uncompressed).
Note the encoding (not sure if that adds processing time).
Only two columns actually contain some Chinese characters though.

@time x = readsas(string(fld,fn))
Read data set of size 5000000 x 49 in 327.862 seconds
327.843503 seconds (649.90 M allocations: 53.104 GiB, 77.32% gc time)
Dict{Symbol,Any} with 16 entries:
  :filename             => "C:\\temp\\data5000k.sas7bdat"
  :page_length          => 65536
  :file_encoding        => "GB18030"
  :system_endianness    => :LittleEndian
  :ncols                => 49
  :column_types         => Type[Float64, String, String, String, String, String, String, String, Float64, Float64  …  Float64, Float64, Float64, String, Union{DateTime, Missings.Missing}, Union{DateTime, Missings.Missing}, Stri…
  :data                 => Dict{Any,Any}(Pair{Any,Any}(:a, [123.56, 32.7101, 62.9041, 328.045, 1833.33, 44.9167, 78.6032, 378.97, 5.65784, 999.92  …  98.8, 1928.94, 38.95, 588.73, 24.5418, 230.16, 762.6, -0.00428962,…
  :perf_type_conversion => 147.857
  :page_count           => 58824
  :column_names         => String["a", "b", ...
  :column_symbols       => Symbol[:a, :b, ...
  :column_lengths       => [32, 32, 32, 6, 16, 16, 32, 16, 40, 40  …  8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
  :file_endianness      => :LittleEndian
  :nrows                => 5000000
  :perf_read_data       => 176.944
  :column_offsets       => [160, 192, 224, 256, 262, 278, 294, 326, 342, 382  …  80, 88, 96, 104, 112, 120, 128, 136, 144, 152]

Main> @time y = DataFrame(x[:data]);
  0.000122 seconds (206 allocations: 18.281 KiB)

I would be interesting to integrate with the DataStreams framework. That would allow streaming the data, selecting or transforming columns on the fly, and storing it elsewhere, e.g. in a SQL or JuliaDB on-disk database.

Hi @bernhard, I also realized that strings aren’t very efficiently processed. That may the next thing to tackle. Appreciate your effort in testing this.

Good idea… let me log an issue and look into that later.