Matrix handling in memory

MatheusJanczkowski · May 14, 2024, 5:00pm

I have a code that is divided in three parts: pre-processing, iterative scheme, and post-processing. In the pre-processing, it calculates a matrix (that can be very large); but it won’t be used in the iterative scheme, which is the most time consuming part of the whole code. This matrix will be used in the post-processing phase only. So, I would like to know from you if there is a way to save this matrix in disc and save memory cost, which one would be the best option? Is there a Julia package ready-to-use for this task?

KnutAM · May 14, 2024, 5:06pm

You could use the Serialization standard library’s serialize and deserialize

E.g.

using Serialization
M = rand(200,200);
serialize("matrix.bin", M)

x = open("matrix.bin", "r") do io
    deserialize(io)
end;

MatheusJanczkowski · May 14, 2024, 5:15pm

Thank you for your answer! Do you have any experience with the performance of Serialization? Have you ever tried the package GitHub - PetrKryslUCSD/DataDrop.jl: Numbers and matrices and strings stored to disk and retrieved again.?

KnutAM · May 14, 2024, 5:32pm

I haven’t benchmarked, but I would guess Serialization is faster or at least comparable

kristoffer.carlsson · May 14, 2024, 6:05pm

If the matrix just contain numbers I would use something more portable than Serialization, for example the Arrow format: User Manual · Arrow.jl

MatheusJanczkowski · May 14, 2024, 6:13pm

Thank you! Yes, it is real square matrix.

MatheusJanczkowski · May 14, 2024, 10:26pm

I’ve been testing with Arrow. Do I need to convert the matrix to a DataFrame?

mkitti · May 14, 2024, 10:42pm

Here is some simple HDF5.jl usage:

julia> using HDF5

julia> A = rand(16, 16)
16×16 Matrix{Float64}:
 0.221447   0.811268  0.353941   …  0.9793     0.242755   0.854743
 0.723401   0.720135  0.0715644     0.828959   0.330749   0.343337
 0.699929   0.966771  0.287677      0.325095   0.469268   0.3752
 0.681071   0.927654  0.890345      0.0767709  0.655433   0.447086
 0.754597   0.123622  0.117297      0.776716   0.532075   0.840728
 0.547704   0.297954  0.505034   …  0.931126   0.384665   0.896195
 0.435483   0.434247  0.0798438     0.613545   0.891797   0.890092
 0.817429   0.952932  0.620872      0.808312   0.94135    0.740395
 0.481258   0.6092    0.668251      0.988837   0.438144   0.958993
 0.828436   0.333143  0.438589      0.237257   0.100838   0.576843
 0.652142   0.774052  0.648885   …  0.016766   0.719377   0.215559
 0.364151   0.579178  0.49379       0.885932   0.239334   0.174138
 0.619411   0.850066  0.828862      0.793094   0.534864   0.50797
 0.300589   0.354683  0.224314      0.229821   0.347456   0.397955
 0.0131737  0.616927  0.181855      0.147175   0.615718   0.261567
 0.104203   0.233343  0.89632    …  0.983387   0.0355454  0.62741

julia> h5write("mydata.h5", "A", A)

julia> h5read("mydata.h5", "A")
16×16 Matrix{Float64}:
 0.221447   0.811268  0.353941   …  0.9793     0.242755   0.854743
 0.723401   0.720135  0.0715644     0.828959   0.330749   0.343337
 0.699929   0.966771  0.287677      0.325095   0.469268   0.3752
 0.681071   0.927654  0.890345      0.0767709  0.655433   0.447086
 0.754597   0.123622  0.117297      0.776716   0.532075   0.840728
 0.547704   0.297954  0.505034   …  0.931126   0.384665   0.896195
 0.435483   0.434247  0.0798438     0.613545   0.891797   0.890092
 0.817429   0.952932  0.620872      0.808312   0.94135    0.740395
 0.481258   0.6092    0.668251      0.988837   0.438144   0.958993
 0.828436   0.333143  0.438589      0.237257   0.100838   0.576843
 0.652142   0.774052  0.648885   …  0.016766   0.719377   0.215559
 0.364151   0.579178  0.49379       0.885932   0.239334   0.174138
 0.619411   0.850066  0.828862      0.793094   0.534864   0.50797
 0.300589   0.354683  0.224314      0.229821   0.347456   0.397955
 0.0131737  0.616927  0.181855      0.147175   0.615718   0.261567
 0.104203   0.233343  0.89632    …  0.983387   0.0355454  0.62741

MatheusJanczkowski · May 15, 2024, 12:15am

I did some benchmarking and found the following results. A random matrix was saved and retrieved by each one of the methods for a number of realizations (sort of Monte Carlo Simulation to obtain the mean time and its standard deviation). Then, I used t-Student cumulative distribution (alpha=0.05) to calculate the confidence interval for each mean and plotted it. As far as the experiments go, serialization seams to be the fastest approach.
means_methods

MatheusJanczkowski · May 15, 2024, 12:18am

matrix_handler.jl (6.7 KB)

Here is the code I used to do the experiments.

mkitti · May 15, 2024, 1:01am

Oh dear, is this a performance question? I was going for simple and standardized.

Elrod · May 15, 2024, 3:48am

If you don’t use the matrix during the iterative scheme, why not calculate the matrix after the iterative scheme?

MatheusJanczkowski · May 15, 2024, 11:35am

This matrix is the product of a Schur decomposition that must be calculated in the pre-processing step.

dmbates · May 15, 2024, 1:10pm

Would memory-mapping the matrix be useful here? I am imagining that if the matrix is memory-mapped when it is created but not used during the second stage of the calculation its in-memory image would be used for other purposes then restored during the third stage.

MatheusJanczkowski · May 15, 2024, 3:53pm

Interesting! Do you have any working example in Julia?

Elrod · May 15, 2024, 3:57pm

julia> using Mmap

julia> x = Mmap.mmap(Matrix{Float64}, (10_000, 10_000));

then just write into it.