Data Storage Quo Vadis under Julia: HDF5 - JLD2 - MAT -

The good news: Julia can beat the phyton data-compression package “hdf5storage”.
But it is useful to have a closer look into the matter.
And, if you store data, it should be better or at least as good as hdf5storage.
Here the related Python code:

import h5py
import hdf5storage
import numpy as np

matfiledata = {} # make a dictionary to store the MAT data in
matfiledata[u'c'] = np.arange(1, 1e7, dtype=float) 
hdf5storage.write(matfiledata, '.', 'file_name', matlab_compatible=True)

You can use the same package also in Julia via PyCall:

# create mat via python package
# install via conda: 
# using Conda; Conda.add("h5py"); Conda.add("hdf5storage")
using PyCall

h5p = pyimport("h5py")
h5st = pyimport("hdf5storage")

c = rotl90(collect(1:1e7)[:, :])

matfiledata = Dict() # make a dictionary to store the MAT data in
matfiledata["c"] = c 
h5st.write(matfiledata, ".", "file_name", matlab_compatible=true)

As far as I understand, it is useful to split the content into chunks,
if you would like to save channel data (vectors), and you would like
to split them into chunks, I guess you need to transfer them first into 1D-arrays.
The deflation number has on my machine only an impact from 1 to 4,
4 is the best compression.
On my machine the deflation or compression factor had only a significant
impact, when I used the Blosc-compression-method.
My example-vector has 1e7 elements and I varied the chunk-size
in decades, here the results with the Blosc-compression-method:

  1. Chunk Size: 1e+3, File size: 10’274 KB
  2. Chunk Size: 1e+4, File size: 1’387 KB
  3. Chunk Size: 1e+5, File size: 493 KB

I compared the file size also with the ‘jdl2-format’ and with the output
of the MAT-package:

  • jdl2-package: 11’707 KB
  • MAT-package: 11’758 KB

Here the code, if you would like to verify my results:

# test HDF5.jl
# https://github.com/JuliaIO/HDF5.jl
# https://juliaio.github.io/HDF5.jl/stable/
# https://morioh.com/p/5813a7232517
using HDF5
using FileIO, JLD2, MAT
using H5Zblosc # load in Blosc
using Printf

b_H5Zblosc = false
b_Sort_after_Chunk = true
n_elements = 1e7
n_deflate = 5
n_chunk = 1e5

if b_Sort_after_Chunk
    DIR_Export = raw"C:\data\data_export\HDF5" * "\\" * @sprintf("chunk%.0e", n_chunk)
else
    DIR_Export = raw"C:\data\data_export\HDF5" * "\\" * @sprintf("defl%i", n_deflate)
end
if b_H5Zblosc
    FN_HDF5_Compr = DIR_Export * "\\H5Zblosc" * @sprintf("_#%.0e", n_elements) *
                    @sprintf("_defl%i", n_deflate) * @sprintf("_chunk%.0e", n_chunk) * "_compr.hd5"
else
    FN_HDF5_Compr = DIR_Export * "\\zlib" * @sprintf("_#%.0e", n_elements) *
                    @sprintf("_defl%i", n_deflate) * @sprintf("_chunk%.0e", n_chunk) * "_compr.hd5"
end
FN_mat = DIR_Export * "\\MAT" * @sprintf("_#%.0e", n_elements) * ".mat"
FN_jld2_compr = DIR_Export * "\\jld2_compr" * @sprintf("_#%.0e", n_elements) * ".jdl2"
FN_jld2 = DIR_Export * "\\jld2_uncmpr" * @sprintf("_#%.0e", n_elements) * ".jdl2"

# ---
if ~isdir(DIR_Export)
    mkpath(DIR_Export)
end
if n_elements > 1e8
    error("To many elements")
end
if n_chunk > n_elements
    error("Chunk size larger than matrix size")
end
if n_deflate > 5
    error("Compression Parameter too high!")
end

c = rotl90(collect(1:n_elements)[:, :]) # transform vector to 1D-matrix, flip array to column array

# --- store data to file ---
println("--- start HDF5 compressed ---")
h5open(FN_HDF5_Compr, "w") do fid
    if b_H5Zblosc
        fid["c", chunk=(1, n_chunk), blosc=n_deflate] = c
    else
        fid["c", chunk=(1, n_chunk), compress=n_deflate] = c
    end
    attributes(fid)["c"] = "Unit: m"
end

if ~isfile(FN_mat)
    # fid = matopen(FN_mat, "w", compress = true)
    # write(fid, "c", c)
    # close(fid)
    matwrite(FN_mat, Dict(
        "c" => c
    ); compress = true)
end

# --- jld2, not compressed ---
if ~isfile(FN_jld2)
    jldsave(FN_jld2, false; c)
end

# --- jld2, compressed ---
if ~isfile(FN_jld2_compr)
    jldsave(FN_jld2_compr, true; c)
end

println("--- END --- Deflation: ", n_deflate, "  ---   Chunk: ", n_chunk,  "   ---")

Followup:
The following code:

% export sample channel data to mat
c = [1 : 1e7]';
save('matlab_exported_sample_data_#1e7.mat', 'c')

results in following numbers for octave v7.1 and matlab:

  • Matlab (mat v7.3): 13’663 KB
  • Octave (mat v7.0): 11’680 KB
2 Likes