How to save an array to disk in compressed form?

Hello, how to save an array to disk in compressed form?

How about HDF5 or FITS formats? You may also consider JLD.

2 Likes

I just asked almost the same question. :slight_smile: Is Append to zipped CSV file of any help? The compression is not much compared to a binary format, though. A rough binary format is https://docs.julialang.org/en/v1/stdlib/Serialization/. Also search for threads on this site, like Binary output, How store this variable into files, and reaload it?, or just search for HDF5 on this site.

1 Like

If you store your array in DataFrames format (or any Tables.jl compatible format) then you can use JDF.jl.

If you don’t need interop with R then Blosc.jl is quite good.

uncompressed = rand(1_000_000)
using Blosc
compressed = compress(uncompressed)

using Serialization
serialize("somewhere.jls", compressed)

# to read it back
compressed_read_back = deserialize("somewhere.jls")
decompressed = Blosc.decompress(Float64, compressed_read_back)

decompressed  == uncompressed  # true
1 Like

See this github comment, which works for generic data not just arrays.
Before running the code, first run

using TranscodingStreams, CodecZstd

Bear in mind, JDIF does not support missing / nothing

1 Like

I find that using Serialization + TranscodingStreams + CodecXz may be a good way, all packages needed are small overhead

using Downloads, TranscodingStreams, CodecXz, Serialization

xzFile = Downloads.download("https://github.com/PharosAbad/PharosAbad.github.io/raw/master/files/sp500.jls.xz")

io = open(xzFile)
io = TranscodingStream(XzDecompressor(), io)
E = deserialize(io)
V = deserialize(io)
close(io)

now, we compress the data

xzFile = "/tmp/my-sp500.jls.xz"
io = open(xzFile, "w")
io = TranscodingStream(XzCompressor(), io)
serialize(io, E)
serialize(io, V)
close(io)

Blosc is a good compressor, one of the best lossless compressor, but you can do much better with a lossy compressor. [Uniformly] random info shouldn’t compress at all, rand is however for normal distributed, so will compress, and real-world data even more.

For some lossy compressor that’s available with Julia:

julia> using ZfpCompression
julia> A = rand(Float32,100,50);
julia> Ac = zfp_compress(A)

Note right there using Float32, starting from half the size (or possibly even Float16), is useful, for lossy or not, since often the extra 32-bit might not be valuable data. Note, also “reversible (lossless) compression is supported.” And you may want to (compress and) decompress to Float64, for further processing, and there are useful tuning options (for lossy).

I didn’t immediately find Julia software for state-of-the-art SZ3 (or SZ):

https://szcompressor.org/

or well until (seemingly though only very domain specific):

While compression ratios range from 300x to more than 3,000x, our method outperforms the state-of-the-art compressor SZ3 in terms of weighted RMSE, MAE.

1 Like

That maybe true. But users may have other reasons. As for me, I compressed data because I am using BigFloat to maintain precision, and the raw size is 16G+ usually.

I feel like BigFloat is almost always a mistake, it’s by default 256 bits of precision (you can set it higher, or lower, and I find it likely that that flexibility, even if unused, makes it slower), so right there you use 8x more memory before compression (or 4x compared to Float64), and it’s also a performance killer.

And for what? I would at least consider the much faster (fastest alternative to Float64, with more bits) intermediate Float128 from GitHub - JuliaMath/Quadmath.jl: Float128 and libquadmath for the Julia language

In either case, no matter how many bits you use, you are not immune from catastrophic cancellation, i.e. loss of precision, so I would also consider (ValidatedNumerics.jl and/or):

The final result is an interval that is guaranteed to contain the correct result, starting from the given initial data.

It’s based on Float64 by default, but supports down to Float16, so each number is 2x64 down to 2x16 = 32 bits. Always slower than the types it’s built on (2x slower?), in all cases likely much faster then BigFloat, and more accurate. I’m not sure, but it might transparently support compression, I believe Bloch should work, ZfpCompression might in case it sees e.g. Float64 from inside the struct, but note then to use it’s lossless mode (you could try the lossy mode, but you would likely destroy the guarantee the package gives you, though you might not be far off, and with some support it might be kept).

I see recent:

Breaking changes

  • Changed from using FastRounding.jl to RoundingEmulator.jl for the [default [fixed typo]] rounding mode. #370

Note one other option:

As 1/3 is not exactly representable the rounding will be at 66.6% chance towards 0.33398438 and at 33.3% towards 0.33203125 such that in expectation the result is 0.33333… and therefore exact.

E.g. 1/3 (and 1/10) isn’t exact in any binary floating point, why people may be tempted to use ever higher precision to better approximate, but there’s always an error, and it can grow, but good to know with with the above, even this is very viable, BFloat16sr based below, only 63% slower than Float64 with its inferior rounding:

See also (I just discovered this one): GitHub - AnderGray/IntervalUnionArithmetic.jl: An implementation of interval union arithmetic in Julia