The good news: Julia can beat the phyton data-compression package “hdf5storage”.
But it is useful to have a closer look into the matter.
And, if you store data, it should be better or at least as good as hdf5storage
.
Here the related Python code:
import h5py
import hdf5storage
import numpy as np
matfiledata = {} # make a dictionary to store the MAT data in
matfiledata[u'c'] = np.arange(1, 1e7, dtype=float)
hdf5storage.write(matfiledata, '.', 'file_name', matlab_compatible=True)
You can use the same package also in Julia via PyCall
:
# create mat via python package
# install via conda:
# using Conda; Conda.add("h5py"); Conda.add("hdf5storage")
using PyCall
h5p = pyimport("h5py")
h5st = pyimport("hdf5storage")
c = rotl90(collect(1:1e7)[:, :])
matfiledata = Dict() # make a dictionary to store the MAT data in
matfiledata["c"] = c
h5st.write(matfiledata, ".", "file_name", matlab_compatible=true)
As far as I understand, it is useful to split the content into chunks
,
if you would like to save channel data (vectors), and you would like
to split them into chunks, I guess you need to transfer them first into 1D-arrays.
The deflation number has on my machine only an impact from 1 to 4,
4 is the best compression.
On my machine the deflation or compression factor had only a significant
impact, when I used the Blosc-compression
-method.
My example-vector has 1e7 elements and I varied the chunk-size
in decades, here the results with the Blosc-compression
-method:
- Chunk Size: 1e+3, File size: 10’274 KB
- Chunk Size: 1e+4, File size: 1’387 KB
- Chunk Size: 1e+5, File size: 493 KB
I compared the file size also with the ‘jdl2-format’ and with the output
of the MAT-package:
-
jdl2-package
: 11’707 KB -
MAT-package
: 11’758 KB
Here the code, if you would like to verify my results:
# test HDF5.jl
# https://github.com/JuliaIO/HDF5.jl
# https://juliaio.github.io/HDF5.jl/stable/
# https://morioh.com/p/5813a7232517
using HDF5
using FileIO, JLD2, MAT
using H5Zblosc # load in Blosc
using Printf
b_H5Zblosc = false
b_Sort_after_Chunk = true
n_elements = 1e7
n_deflate = 5
n_chunk = 1e5
if b_Sort_after_Chunk
DIR_Export = raw"C:\data\data_export\HDF5" * "\\" * @sprintf("chunk%.0e", n_chunk)
else
DIR_Export = raw"C:\data\data_export\HDF5" * "\\" * @sprintf("defl%i", n_deflate)
end
if b_H5Zblosc
FN_HDF5_Compr = DIR_Export * "\\H5Zblosc" * @sprintf("_#%.0e", n_elements) *
@sprintf("_defl%i", n_deflate) * @sprintf("_chunk%.0e", n_chunk) * "_compr.hd5"
else
FN_HDF5_Compr = DIR_Export * "\\zlib" * @sprintf("_#%.0e", n_elements) *
@sprintf("_defl%i", n_deflate) * @sprintf("_chunk%.0e", n_chunk) * "_compr.hd5"
end
FN_mat = DIR_Export * "\\MAT" * @sprintf("_#%.0e", n_elements) * ".mat"
FN_jld2_compr = DIR_Export * "\\jld2_compr" * @sprintf("_#%.0e", n_elements) * ".jdl2"
FN_jld2 = DIR_Export * "\\jld2_uncmpr" * @sprintf("_#%.0e", n_elements) * ".jdl2"
# ---
if ~isdir(DIR_Export)
mkpath(DIR_Export)
end
if n_elements > 1e8
error("To many elements")
end
if n_chunk > n_elements
error("Chunk size larger than matrix size")
end
if n_deflate > 5
error("Compression Parameter too high!")
end
c = rotl90(collect(1:n_elements)[:, :]) # transform vector to 1D-matrix, flip array to column array
# --- store data to file ---
println("--- start HDF5 compressed ---")
h5open(FN_HDF5_Compr, "w") do fid
if b_H5Zblosc
fid["c", chunk=(1, n_chunk), blosc=n_deflate] = c
else
fid["c", chunk=(1, n_chunk), compress=n_deflate] = c
end
attributes(fid)["c"] = "Unit: m"
end
if ~isfile(FN_mat)
# fid = matopen(FN_mat, "w", compress = true)
# write(fid, "c", c)
# close(fid)
matwrite(FN_mat, Dict(
"c" => c
); compress = true)
end
# --- jld2, not compressed ---
if ~isfile(FN_jld2)
jldsave(FN_jld2, false; c)
end
# --- jld2, compressed ---
if ~isfile(FN_jld2_compr)
jldsave(FN_jld2_compr, true; c)
end
println("--- END --- Deflation: ", n_deflate, " --- Chunk: ", n_chunk, " ---")
Followup:
The following code:
% export sample channel data to mat
c = [1 : 1e7]';
save('matlab_exported_sample_data_#1e7.mat', 'c')
results in following numbers for octave v7.1 and matlab:
- Matlab (mat v7.3): 13’663 KB
- Octave (mat v7.0): 11’680 KB