ANN: SASLib.jl

tk3369 · December 27, 2017, 10:29pm

Hello everyone!

I am happy to announce a new package SASLib.jl. Please see the github project page for more information. -Tom

https://github.com/tk3369/SASLib.jl

xiaodai · December 27, 2017, 11:57pm

Already tweeting about it

tbeason · December 28, 2017, 6:28pm

I’d be interested to know how this compares to https://github.com/davidanthoff/StatFiles.jl

tk3369 · December 28, 2017, 7:37pm

I think StatFiles uses the ReadStat package which does not support compressed sas7bdat files.

tbeason · December 28, 2017, 8:11pm

Right, I understand that there are some differences (compressed files, incremental reads). I just mean a performance comparison of read and write times, perhaps as a function of file size.

I work with large sas7bdat files and typically end up still doing half of that work in SAS because I haven’t found the Julia data ecosystem ready for those yet. Hopefully the incremental reading capability of your package will help with that.

tk3369 · December 28, 2017, 9:14pm

I see. The library can only read files but not write. I would love to hear any feedback about performance if you can try it on your files.

xiaodai · December 28, 2017, 9:39pm

@tbeason hi can you tell me more about your needs? i dont mean to turn this forum into sales opp but my companies makes a fast sas reader. it can convert a large sas dataset into csv in chunks or as a whole. i am trying to make a Julia wrapper around it atm but i am waiting for a more mature Julia to rust interop. if you are interested i can send you an alpha testing copy in the form of a julia package. which OS are you using?

tbeason · December 28, 2017, 9:55pm

The SAS isn’t an issue. The issues are all about trying to work with the data in Julia. If the SAS file is >10 gigs then I’m basically out of luck when it comes to using Julia because loading the file and working with it is near impossible for me right now.

It’d be nice for this package to support column selection (I see the issue has already been opened).

xiaodai · December 28, 2017, 10:03pm

ok then JuliaDB.jl might be one to watch. not sure how ready is that for
your workload as it feels quite early in development

tk3369 · December 29, 2017, 3:14am

I don’t normally work with such big files. Are you dealing with mostly numerical data?

I generated a 8 GB file with 200,000,000 x 5 random numbers from SAS and read it into Julia. The file is physically stored in a corporate SAN storage so it’s not the quickest. It completed successfully.

Reading a subset of columns is a high priority from my perspective. I’m already working on and it will be released soon.

shell> ls -l unif.sas7bdat
-rw-r--r-- 1 tkwong staff 8070946816 Dec 28 16:31 unif.sas7bdat

julia> @time x = readsas("unif.sas7bdat")
Read data set of size 200000000 x 5 in 132.04 seconds
132.041671 seconds (2.31 G allocations: 194.824 GiB, 8.55% gc time)
Dict{Symbol,Any} with 16 entries:
  :filename             => "unif.sas7bdat"
  :page_length          => 8192
  :file_encoding        => "ISO-8859-1"
  :system_endianness    => :LittleEndian
  :ncols                => 5
  :column_types         => DataType[Float64, Float64, Float64, Float64, Float64]
  :data                 => Dict{Any,Any}(Pair{Any,Any}(:m, [8.0, 5.0, 5.0, 7.0, 6.0, 7.0, 7.0, 6.0, 5.0, 5.0  …  6.0, 9.0, 8.0, 7.0, 7.0, 5.0, 5.0, 7.0, 9.0, 6.0]),Pair{An…
  :perf_type_conversion => 32.2385
  :page_count           => 985222
  :column_names         => String["m", "n", "k", "u", "x"]
  :column_symbols       => Symbol[:m, :n, :k, :u, :x]
  :column_lengths       => [8, 8, 8, 8, 8]
  :file_endianness      => :BigEndian
  :nrows                => 200000000
  :perf_read_data       => 97.5212
  :column_offsets       => [0, 8, 16, 24, 32]

julia> using DataFrames

julia> y = DataFrame(x[:data]);

julia> @time y = DataFrame(x[:data]);
  0.189854 seconds (69 allocations: 119.212 MiB, 80.54% gc time)

julia> showcols(y)
200000000×5 DataFrames.DataFrame
│ Col # │ Name │ Eltype  │ Missing │
├───────┼──────┼─────────┼─────────┤
│ 1     │ k    │ Float64 │ 0       │
│ 2     │ m    │ Float64 │ 0       │
│ 3     │ n    │ Float64 │ 0       │
│ 4     │ u    │ Float64 │ 0       │
│ 5     │ x    │ Float64 │ 0       │

Tamas_Papp · December 29, 2017, 7:20am

Unrelated to SAS, I have an (unregistered) package that I am using for 40-100GB of data. It is a thin wrapper around mmap, works if your elements are bits types (eg Int, Float64; you could make it work for fixed-width strings). Column selection is costless since mmap reads lazily.
https://github.com/tpapp/LargeColumns.jl

bernhard · December 29, 2017, 9:36am

It is definitely slower when the data has strings (which is expected I guess).
I just tested it on a ‘real’ 49 column and 500k rows data set. It took 9.25 seconds which is not too bad.
Page length was 65536, 5883pages. File on disk is 380mb (142mb compressed)

I am trying a larger file now (hoping memory does not max out) …

EDIT: I ran out of memory on a 14gb file (32GB of RAM)

below are some numbers on a 3.8gb file (uncompressed).
Note the encoding (not sure if that adds processing time).
Only two columns actually contain some Chinese characters though.

@time x = readsas(string(fld,fn))
Read data set of size 5000000 x 49 in 327.862 seconds
327.843503 seconds (649.90 M allocations: 53.104 GiB, 77.32% gc time)
Dict{Symbol,Any} with 16 entries:
  :filename             => "C:\\temp\\data5000k.sas7bdat"
  :page_length          => 65536
  :file_encoding        => "GB18030"
  :system_endianness    => :LittleEndian
  :ncols                => 49
  :column_types         => Type[Float64, String, String, String, String, String, String, String, Float64, Float64  …  Float64, Float64, Float64, String, Union{DateTime, Missings.Missing}, Union{DateTime, Missings.Missing}, Stri…
  :data                 => Dict{Any,Any}(Pair{Any,Any}(:a, [123.56, 32.7101, 62.9041, 328.045, 1833.33, 44.9167, 78.6032, 378.97, 5.65784, 999.92  …  98.8, 1928.94, 38.95, 588.73, 24.5418, 230.16, 762.6, -0.00428962,…
  :perf_type_conversion => 147.857
  :page_count           => 58824
  :column_names         => String["a", "b", ...
  :column_symbols       => Symbol[:a, :b, ...
  :column_lengths       => [32, 32, 32, 6, 16, 16, 32, 16, 40, 40  …  8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
  :file_endianness      => :LittleEndian
  :nrows                => 5000000
  :perf_read_data       => 176.944
  :column_offsets       => [160, 192, 224, 256, 262, 278, 294, 326, 342, 382  …  80, 88, 96, 104, 112, 120, 128, 136, 144, 152]

Main> @time y = DataFrame(x[:data]);
  0.000122 seconds (206 allocations: 18.281 KiB)

nalimilan · December 29, 2017, 10:31am

I would be interesting to integrate with the DataStreams framework. That would allow streaming the data, selecting or transforming columns on the fly, and storing it elsewhere, e.g. in a SQL or JuliaDB on-disk database.

tk3369 · December 30, 2017, 12:43am

Hi @bernhard, I also realized that strings aren’t very efficiently processed. That may the next thing to tackle. Appreciate your effort in testing this.

https://github.com/tk3369/SASLib.jl/issues/4

tk3369 · December 30, 2017, 12:45am

Good idea… let me log an issue and look into that later.

Topic		Replies	Views
SASLib v1.1.0 release Package Announcements	8	788	January 7, 2020
Reading Data Is Still Too Slow Data	35	8817	August 2, 2019
Adding SAS to benchmark comparison Meta Discussion	10	2198	February 21, 2018
What's the difference between CSV.jl and CSVFiles.jl? New to Julia	25	8110	January 29, 2020
Package for reading/writing ~100GB data files General Usage	10	2883	November 17, 2018

ANN: SASLib.jl

Related topics