How to save an array to disk in compressed form?

andrey2185 · May 28, 2020, 8:30am

Hello, how to save an array to disk in compressed form?

jishnub · May 28, 2020, 10:16am

How about HDF5 or FITS formats? You may also consider JLD.

HenrikM · May 28, 2020, 10:19am

I just asked almost the same question. Is Append to zipped CSV file of any help? The compression is not much compared to a binary format, though. A rough binary format is https://docs.julialang.org/en/v1/stdlib/Serialization/. Also search for threads on this site, like Binary output, How store this variable into files, and reaload it?, or just search for HDF5 on this site.

xiaodai · May 28, 2020, 10:25am

If you store your array in DataFrames format (or any Tables.jl compatible format) then you can use JDF.jl.

If you don’t need interop with R then Blosc.jl is quite good.

uncompressed = rand(1_000_000)
using Blosc
compressed = compress(uncompressed)

using Serialization
serialize("somewhere.jls", compressed)

# to read it back
compressed_read_back = deserialize("somewhere.jls")
decompressed = Blosc.decompress(Float64, compressed_read_back)

decompressed  == uncompressed  # true

greatpet · October 31, 2022, 9:28am

See this github comment, which works for generic data not just arrays.
Before running the code, first run

using TranscodingStreams, CodecZstd

lawless-m · October 31, 2022, 10:24am

Bear in mind, JDIF does not support missing / nothing

github.com/xiaodaigh/JDF.jl

Odd issue ambiguous methoderror

opened 08:44PM - 24 Aug 20 UTC

closed 06:31AM - 25 Aug 20 UTC

ym-han

wontfix

I got the following error while trying to save some files; I'll upload a MWE lat…er. ``` ERROR: TaskFailedException: MethodError: compress_then_write(::Array{Any,1}, ::BufferedStreams.BufferedOutputStream{IOStream}) is ambiguous. Candidates: compress_then_write(b::Array{Union{Missing, T},1}, io) where T in JDF at /users/yh31/.julia/packages/JDF/jDvZp/src/type-writer-loader/Missing.jl:7 compress_then_write(b::Array{Union{Nothing, T},1}, io) where T in JDF at /users/yh31/.julia/packages/JDF/jDvZp/src/type-writer-loader/Nothing.jl:10 To resolve the ambiguity, try making one of the methods more specific, or adding a new method more specific than any of the existing applicable methods. Stacktrace: [1] macro expansion at /users/yh31/.julia/packages/JDF/jDvZp/src/savejdf.jl:69 [inlined] [2] (::JDF.var"#49#52"{String,DataFrame,String,Int64})() at ./threadingconstructs.jl:169 Stacktrace: [1] wait at ./task.jl:267 [inlined] [2] fetch(::Task) at ./task.jl:282 [3] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined] [4] _broadcast_getindex at ./broadcast.jl:621 [inlined] [5] getindex at ./broadcast.jl:575 [inlined] [6] copyto_nonleaf!(::Array{NamedTuple,1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(fetch),Tuple{Base.Broadcast.Extruded{Array{Any,1},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1026 [7] restart_copyto_nonleaf!(::Array{NamedTuple,1}, ::Array{NamedTuple{(:string_compressed_bytes, :string_len_bytes, :rle_bytes, :rle_len, :type, :len),Tuple{Int64,Int64,Int64,Int64,DataType,Int64}},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(fetch),Tuple{Base.Broadcast.Extruded{Array{Any,1},Tuple{Bool},Tuple{Int64}}}}, ::NamedTuple{(:len, :type),Tuple{Int64,DataType}}, ::Int64, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1017 [8] copyto_nonleaf!(::Array{NamedTuple{(:string_compressed_bytes, :string_len_bytes, :rle_bytes, :rle_len, :type, :len),Tuple{Int64,Int64,Int64,Int64,DataType,Int64}},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(fetch),Tuple{Base.Broadcast.Extruded{Array{Any,1},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1033 [9] copy at ./broadcast.jl:880 [inlined] [10] materialize at ./broadcast.jl:837 [inlined] [11] savejdf(::String, ::DataFrame; verbose::Bool) at /users/yh31/.julia/packages/JDF/jDvZp/src/savejdf.jl:76 [12] savejdf(::String, ::DataFrame) at /users/yh31/.julia/packages/JDF/jDvZp/src/savejdf.jl:48 [13] (::var"#51#52")(::File) at ./REPL[18]:2 [14] (::FileTrees.var"#saver#89"{var"#51#52"})(::File, ::DataFrame) at /users/yh31/.julia/packages/FileTrees/sx5xd/src/values.jl:121 [15] (::Dagger.var"#47#48"{FileTrees.var"#saver#89"{var"#51#52"},Tuple{File,DataFrame}})() at ./threadingconstructs.jl:169 wait at ./task.jl:267 [inlined] fetch at ./task.jl:282 [inlined] execute!(::Dagger.ThreadProc, ::Function, ::File, ::Vararg{Any,N} where N) at /users/yh31/.julia/packages/Dagger/U857J/src/processor.jl:222 do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{File,Dagger.Chunk{Any,MemPool.DRef,Dagger.ThreadProc}}, ::Bool, ::Bool, ::Bool, ::Dagger.Sch.ThunkOptions) at /users/yh31/.julia/packages/Dagger/U857J/src/scheduler.jl:340 #137 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:354 [inlined] run_work_thunk(::Distributed.var"#137#138"{typeof(Dagger.Sch.do_task),Tuple{Dagger.Context,Dagger.OSProc,Int64,FileTrees.var"#saver#89"{var"#51#52"},Tuple{File,Dagger.Chunk{Any,MemPool.DRef,Dagger.ThreadProc}},Bool,Bool,Bool,Dagger.Sch.ThunkOptions},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}}, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79 remotecall_fetch(::Function, ::Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:379 remotecall_fetch(::Function, ::Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:379 remotecall_fetch(::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined] macro expansion at /users/yh31/.julia/packages/Dagger/U857J/src/scheduler.jl:353 [inlined] (::Dagger.Sch.var"#26#27"{Dagger.Context,Dagger.OSProc,Int64,FileTrees.var"#saver#89"{var"#51#52"},Tuple{File,Dagger.Chunk{Any,MemPool.DRef,Dagger.ThreadProc}},Channel{Any},Bool,Bool,Bool,Dagger.Sch.ThunkOptions})() at ./task.jl:356 Stacktrace: [1] compute_dag(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/U857J/src/scheduler.jl:137 [2] compute(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/U857J/src/compute.jl:32 [3] #compute#70 at /users/yh31/.julia/packages/Dagger/U857J/src/compute.jl:5 [inlined] [4] compute at /users/yh31/.julia/packages/Dagger/U857J/src/compute.jl:5 [inlined] [5] exec(::Dagger.Thunk) at /users/yh31/.julia/packages/FileTrees/sx5xd/src/parallelism.jl:68 [6] save(::var"#51#52", ::FileTree; lazy::Nothing, exec::Bool) at /users/yh31/.julia/packages/FileTrees/sx5xd/src/values.jl:128 [7] save(::Function, ::FileTree) at /users/yh31/.julia/packages/FileTrees/sx5xd/src/values.jl:111 [8] top-level scope at REPL[18]:1 ```

maxchendt · January 23, 2023, 12:44pm

I find that using Serialization + TranscodingStreams + CodecXz may be a good way, all packages needed are small overhead

using Downloads, TranscodingStreams, CodecXz, Serialization

xzFile = Downloads.download("https://github.com/PharosAbad/PharosAbad.github.io/raw/master/files/sp500.jls.xz")

io = open(xzFile)
io = TranscodingStream(XzDecompressor(), io)
E = deserialize(io)
V = deserialize(io)
close(io)

now, we compress the data

xzFile = "/tmp/my-sp500.jls.xz"
io = open(xzFile, "w")
io = TranscodingStream(XzCompressor(), io)
serialize(io, E)
serialize(io, V)
close(io)

Palli · January 23, 2023, 3:36pm

Blosc is a good compressor, one of the best lossless compressor, but you can do much better with a lossy compressor. [Uniformly] random info shouldn’t compress at all, rand is however for normal distributed, so will compress, and real-world data even more.

For some lossy compressor that’s available with Julia:

julia> using ZfpCompression
julia> A = rand(Float32,100,50);
julia> Ac = zfp_compress(A)

Note right there using Float32, starting from half the size (or possibly even Float16), is useful, for lossy or not, since often the extra 32-bit might not be valuable data. Note, also “reversible (lossless) compression is supported.” And you may want to (compress and) decompress to Float64, for further processing, and there are useful tuning options (for lossy).

I didn’t immediately find Julia software for state-of-the-art SZ3 (or SZ):

https://szcompressor.org/

or well until (seemingly though only very domain specific):

While compression ratios range from 300x to more than 3,000x, our method outperforms the state-of-the-art compressor SZ3 in terms of weighted RMSE, MAE.

maxchendt · January 23, 2023, 10:45pm

That maybe true. But users may have other reasons. As for me, I compressed data because I am using BigFloat to maintain precision, and the raw size is 16G+ usually.

Palli · January 24, 2023, 2:32pm

I feel like BigFloat is almost always a mistake, it’s by default 256 bits of precision (you can set it higher, or lower, and I find it likely that that flexibility, even if unused, makes it slower), so right there you use 8x more memory before compression (or 4x compared to Float64), and it’s also a performance killer.

And for what? I would at least consider the much faster (fastest alternative to Float64, with more bits) intermediate Float128 from GitHub - JuliaMath/Quadmath.jl: Float128 and libquadmath for the Julia language

In either case, no matter how many bits you use, you are not immune from catastrophic cancellation, i.e. loss of precision, so I would also consider (ValidatedNumerics.jl and/or):

The final result is an interval that is guaranteed to contain the correct result, starting from the given initial data.

It’s based on Float64 by default, but supports down to Float16, so each number is 2x64 down to 2x16 = 32 bits. Always slower than the types it’s built on (2x slower?), in all cases likely much faster then BigFloat, and more accurate. I’m not sure, but it might transparently support compression, I believe Bloch should work, ZfpCompression might in case it sees e.g. Float64 from inside the struct, but note then to use it’s lossless mode (you could try the lossy mode, but you would likely destroy the guarantee the package gives you, though you might not be far off, and with some support it might be kept).

I see recent:

Breaking changes

Changed from using FastRounding.jl to RoundingEmulator.jl for the [default [fixed typo]] rounding mode. #370

Note one other option:

As 1/3 is not exactly representable the rounding will be at 66.6% chance towards 0.33398438 and at 33.3% towards 0.33203125 such that in expectation the result is 0.33333… and therefore exact.

E.g. 1/3 (and 1/10) isn’t exact in any binary floating point, why people may be tempted to use ever higher precision to better approximate, but there’s always an error, and it can grow, but good to know with with the above, even this is very viable, BFloat16sr based below, only 63% slower than Float64 with its inferior rounding:

See also (I just discovered this one): GitHub - AnderGray/IntervalUnionArithmetic.jl: An implementation of interval union arithmetic in Julia

Topic		Replies	Views
How to save a large Float32 array on disk using data compression (failed attempt with JLD2)? Data jld2 , data-compression	2	1276	January 24, 2023
Data Import Types and Compression New to Julia question , data-compression	1	1171	December 22, 2017
A Julia-compatible alternative to zarr Data data-compression	19	4456	December 18, 2019
Data Storage Quo Vadis under Julia: HDF5 - JLD2 - MAT - Performance big-data , data-compression	0	661	April 27, 2022
Saving julia dataframes efficiently (in terms of size on the disk) General Usage	13	1514	January 19, 2020

How to save an array to disk in compressed form?

Breaking changes

Related topics