I have a code that is divided in three parts: pre-processing, iterative scheme, and post-processing. In the pre-processing, it calculates a matrix (that can be very large); but it won’t be used in the iterative scheme, which is the most time consuming part of the whole code. This matrix will be used in the post-processing phase only. So, I would like to know from you if there is a way to save this matrix in disc and save memory cost, which one would be the best option? Is there a Julia package ready-to-use for this task?
You could use the Serialization
standard library’s serialize
and deserialize
E.g.
using Serialization
M = rand(200,200);
serialize("matrix.bin", M)
x = open("matrix.bin", "r") do io
deserialize(io)
end;
Thank you for your answer! Do you have any experience with the performance of Serialization? Have you ever tried the package GitHub - PetrKryslUCSD/DataDrop.jl: Numbers and matrices and strings stored to disk and retrieved again.?
I haven’t benchmarked, but I would guess Serialization is faster or at least comparable
If the matrix just contain numbers I would use something more portable than Serialization, for example the Arrow format: User Manual · Arrow.jl
Thank you! Yes, it is real square matrix.
I’ve been testing with Arrow. Do I need to convert the matrix to a DataFrame?
Here is some simple HDF5.jl usage:
julia> using HDF5
julia> A = rand(16, 16)
16×16 Matrix{Float64}:
0.221447 0.811268 0.353941 … 0.9793 0.242755 0.854743
0.723401 0.720135 0.0715644 0.828959 0.330749 0.343337
0.699929 0.966771 0.287677 0.325095 0.469268 0.3752
0.681071 0.927654 0.890345 0.0767709 0.655433 0.447086
0.754597 0.123622 0.117297 0.776716 0.532075 0.840728
0.547704 0.297954 0.505034 … 0.931126 0.384665 0.896195
0.435483 0.434247 0.0798438 0.613545 0.891797 0.890092
0.817429 0.952932 0.620872 0.808312 0.94135 0.740395
0.481258 0.6092 0.668251 0.988837 0.438144 0.958993
0.828436 0.333143 0.438589 0.237257 0.100838 0.576843
0.652142 0.774052 0.648885 … 0.016766 0.719377 0.215559
0.364151 0.579178 0.49379 0.885932 0.239334 0.174138
0.619411 0.850066 0.828862 0.793094 0.534864 0.50797
0.300589 0.354683 0.224314 0.229821 0.347456 0.397955
0.0131737 0.616927 0.181855 0.147175 0.615718 0.261567
0.104203 0.233343 0.89632 … 0.983387 0.0355454 0.62741
julia> h5write("mydata.h5", "A", A)
julia> h5read("mydata.h5", "A")
16×16 Matrix{Float64}:
0.221447 0.811268 0.353941 … 0.9793 0.242755 0.854743
0.723401 0.720135 0.0715644 0.828959 0.330749 0.343337
0.699929 0.966771 0.287677 0.325095 0.469268 0.3752
0.681071 0.927654 0.890345 0.0767709 0.655433 0.447086
0.754597 0.123622 0.117297 0.776716 0.532075 0.840728
0.547704 0.297954 0.505034 … 0.931126 0.384665 0.896195
0.435483 0.434247 0.0798438 0.613545 0.891797 0.890092
0.817429 0.952932 0.620872 0.808312 0.94135 0.740395
0.481258 0.6092 0.668251 0.988837 0.438144 0.958993
0.828436 0.333143 0.438589 0.237257 0.100838 0.576843
0.652142 0.774052 0.648885 … 0.016766 0.719377 0.215559
0.364151 0.579178 0.49379 0.885932 0.239334 0.174138
0.619411 0.850066 0.828862 0.793094 0.534864 0.50797
0.300589 0.354683 0.224314 0.229821 0.347456 0.397955
0.0131737 0.616927 0.181855 0.147175 0.615718 0.261567
0.104203 0.233343 0.89632 … 0.983387 0.0355454 0.62741
I did some benchmarking and found the following results. A random matrix was saved and retrieved by each one of the methods for a number of realizations (sort of Monte Carlo Simulation to obtain the mean time and its standard deviation). Then, I used t-Student cumulative distribution (alpha=0.05) to calculate the confidence interval for each mean and plotted it. As far as the experiments go, serialization seams to be the fastest approach.
matrix_handler.jl (6.7 KB)
Here is the code I used to do the experiments.
Oh dear, is this a performance question? I was going for simple and standardized.
If you don’t use the matrix during the iterative scheme, why not calculate the matrix after the iterative scheme?
This matrix is the product of a Schur decomposition that must be calculated in the pre-processing step.
Would memory-mapping the matrix be useful here? I am imagining that if the matrix is memory-mapped when it is created but not used during the second stage of the calculation its in-memory image would be used for other purposes then restored during the third stage.
Interesting! Do you have any working example in Julia?
julia> using Mmap
julia> x = Mmap.mmap(Matrix{Float64}, (10_000, 10_000));
then just write into it.