I need advices to solve Ax=b with Mmap

Hellos

My topic title is clear, I am solving Ax=b (x=A\b).
I know that my system will become big, and I want to try Mmap to store my data. Here the basic code:

using Mmap
N = 2_000
A = rand(N, N)
b = rand(N)


io_A = open("/tmp/matrixA.bin", "w+");
A_HD = Mmap.mmap(io_A, Matrix{Float64}, (N,N));
A_HD[:] = A


io_b = open("/tmp/vectorB.bin", "w+");
b_HD = Mmap.mmap(io_b, Vector{Float64}, N);
b_HD[:] = b

io_x = open("/tmp/vectorX.bin", "w+");
x_HD = Mmap.mmap(io_x, Vector{Float64}, N);

x_HD[:] = A_HD\b_HD
x = A\b

println(x ā‰ˆ x_HD)

Looking at memory consumption, I can guess that A_HD\b_HD is stored on RAM before saved into HD. I would like to only use the disk for all the operation. Is is possible ? Or some internals of \ functions does not allow me do to that so easily ?

Thank you

Iā€™ve never worked with memory mapped arrays in julia but I would guess that what youā€™re trying to do is not possible. The \ operator would need to include an out of core factorization algorithm for dense matrices. I could be wrong but Iā€™m pretty sure it doesnā€™t. A quick google search for out of core factorization of dense matrices in julia didnā€™t turn up anything.

However, are you sure your matrices need to be dense? If youā€™re able to work with sparse arrays, that will allow you to handle much larger problems.

I simulate a physical problem that is not possible to use Sparse Matrix.

In the meantime I will try to implement a simple of \ and see its caveats for Mmap.

You might find this article useful. Itā€™s not an answer for how to do the inverse, but rather an answer on how to avoid it. Donā€™t invert that matrix

I would strongly suggest you reconsider if there are any other way to solve your problem. Solving a linear system of size 10k on my machine takes up about 1gb RAM and 10s (with full multithreading). The scaling wrt memory size is power 3/2, so if your matrix goes over your RAM (which Iā€™m assuming is at least 4gbs), then it takes at least a minute. There usually are better ways, depending on your particular problem.

3 Likes

I strongly cannot reconsider :smile:

Firstly, because I need all the values of x for further analysis.
Secondly, because is a challenge that my curiosity wants to solve

My bet is to take a package written in julia for x=A\b and add features for Mmap.
For example, Gaussian elimination is a common algorithm, and does not seem an issue to create Mmap variables for it.

But I know that Gaussian elimination is slow, and Iā€™m open for package suggestions

I think you could probably do this with an in-place factorization, for example F = lu!(A_HD) with an Mmapā€™d array A. Iā€™m not sure this is completely allocation-free but itā€™s a place to start looking.

See also

By the way itā€™s unlikely to be helpful to mmap your vector b_HD because itā€™s so much smaller than the matrix.

1 Like

I do a lot of linear algebra. As far as I know, Gaussian Elimination/LU decomp., etc, will all perform roughly the same on this data. At the size youā€™re looking at (2000x2000) it is not necessary to mmap anything on a computer built in the past couple of decades but of course feel free to do it anyway. If you implement a thing like this, note that the main way that dense linear algebra is made to be fast on computers is to use block operations that correspond to the cache size. In your case, your ā€œcacheā€ is the main computer memory and the ā€œoff-cacheā€ is the HDD so you should make blocks that are sizes appropriately for your main computer memory. You can probably get some more speed by making blocks-of-blocks corresponding to your memory hierarchy.

So youā€™d have to implement a block-LU or block-GE algorithm to get good performance.

Apart from these ā€œdirect solversā€ you can try iterative solvers. The simplest are Jacobi and Gauss-Seidel, and the more advanced ones are GMRES and similar. However, these algorithms are not going to work well on a random matrix as in your example. Roughly speaking, iterative algorithms converge at a rate that depends on the distribution of eigenvalues of the matrix, and random matrices donā€™t have good spectral properties.

2 Likes

The 2000 matrix was just an example

I donā€™t want the headache of convergence tests at this initial moment. So the keywords that I should look for are ā€˜block-LUā€™ or ā€˜block-GEā€™
thank you

Mmap isnā€™t magical. The data still has to exist and if it doesnā€™t all fit into memory at once, then the kernel is going to have to swap it back and forth from disk all the time which is going to kill the performance. Thereā€™s some chance that an algorithm might happen to operate on the matrix memory in a sufficiently local way that this wonā€™t be terrible, but that is not the case with most dense matrix algorithms.

1 Like

Internally it seems that we call LAPACK.getrf for the LU decomposition of strided matrices, so it would depend on how well OpenBLASā€™s version of dgetrf optimizes for multiple levels of cache (the last level of cache here being the OS page table).

Hereā€™s a paper which discusses cache oblivious algorithms for dense linear algebra, including dgetrf:

That has nothing to do with whether you use a sparse-matrix algorithm, which depends only on A being sparse (mostly zero, as is common for matrices that are very large) ā€” the full solution vector x is computed by sparse or iterative solvers.

Very large matrices that arise in practice are almost always either sparse or have some other structure that can be exploited to solve the system much more quickly than Gaussian elimination or similar. Thatā€™s a good thing, because brute-force ā€œdenseā€ matrix solvers (which exploit no special structure of the matrix) simply donā€™t scale to very large problems ā€” such algorithms scale with the cube of the matrix size, which means that even if I give you 1000Ɨ more processing power, you can only solve a system that is 10Ɨ bigger. (You usually run out of memory first.)

Thereā€™s no magic trick here that doesnā€™t involve exploiting matrix structure.

5 Likes

Hellos again, and I bring good news.

Using direct Lapack calls solved my problem perfectly:

using Mmap, LinearAlgebra
N = 2_000
A = rand(ComplexF64, N, N)
b = rand(ComplexF64, N)

io_A = open("/tmp/matrixA.bin", "w+");
A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
A_HD[:] = A
close(io_A)
GC.gc()

xRam = A\b

io_A = open("/tmp/matrixA.bin","r+")
A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N))
xHD = LinearAlgebra.LAPACK.gesv!(A_HD, b)[1]
close(io_A)

println(xHD ā‰ˆ xRam) # julia> true

To test the difference of RAM and HDD, another code :

N_range = [5_000,10_000, 15_000, 20_000]

timeRam = zeros(length(N_range))
cc = 1
@progress for N in N_range
    global cc
    A = rand(ComplexF64, N, N)
    b = rand(ComplexF64, N)

    io_A = open("/tmp/matrixA_"*string(N)*".bin", "w+");
    A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
    A_HD[:] = A

    timeRam[cc] = @elapsed A\b;
    A = 1; GC.gc()
    cc = cc + 1
    close(io_A)
end

timeDisk = zeros(length(N_range))
cc = 1
@progress for N in N_range
    global cc
    io_A = open("/tmp/matrixA_"*string(N)*".bin", "r+");
    A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
    b = rand(ComplexF64, N) # i dont mind to recreate b for this test
    
    timeDisk[cc] = @elapsed LinearAlgebra.LAPACK.gesv!(A_HD, b)[1];
    cc = cc + 1
    close(io_A)
end

using Plots
plot(N_range, timeRam, markershape=:circle, label="RAM", scale=:log10)
plot!(N_range, timeDisk, markershape=:circle, label="HD", legend=:topleft)
plot!(xlabel="Matrix Size", ylabel="Elapsed Time")

rapido

You can post your curve for comparison too :slightly_smiling_face:

1 Like

Are you sure youā€™re exceeding RAM here? A 20000x20000 ComplexF64 matrix takes < 7GB of memory, which most people have. If you use a memory-mapped array that fits into RAM then it is equivalent to just reading all the data into a regular array.

(You donā€™t need a low-level LAPACK call, either. You can work in place using high level functions like lu! and ldiv!. I think youā€™re just measuring the memory overhead for the out-of-place A \ b call.)

1 Like

My goal is to work with N=200 000.

Those values were to get a notion of time consumption.

This isnā€™t something I have any experience with, but I found it an interesting question and so also Googled a bit. There is a conference paper by the Flame authors from a decade ago that claims they were able to get comparable performance for out-of-core dense LA implementations (in GFLOPs) to their in-core algorithms. They used it to solve a 100,000x100,000 Cholesky factorization in ~1 hour on an 8-core machine.

1 Like

Benchmarking in-memory factorizations will not tell you anything about the cost of out-of-core problems in general. You have no idea how much slowdown you are about to experience (the LAPACK algorithms were not designed to run out of core, so I would be surprised to see them scale well).

2 Likes