I need advices to solve Ax=b with Mmap

Hellos

My topic title is clear, I am solving Ax=b (x=A\b).
I know that my system will become big, and I want to try Mmap to store my data. Here the basic code:

using Mmap
N = 2_000
A = rand(N, N)
b = rand(N)


io_A = open("/tmp/matrixA.bin", "w+");
A_HD = Mmap.mmap(io_A, Matrix{Float64}, (N,N));
A_HD[:] = A


io_b = open("/tmp/vectorB.bin", "w+");
b_HD = Mmap.mmap(io_b, Vector{Float64}, N);
b_HD[:] = b

io_x = open("/tmp/vectorX.bin", "w+");
x_HD = Mmap.mmap(io_x, Vector{Float64}, N);

x_HD[:] = A_HD\b_HD
x = A\b

println(x ≈ x_HD)

Looking at memory consumption, I can guess that A_HD\b_HD is stored on RAM before saved into HD. I would like to only use the disk for all the operation. Is is possible ? Or some internals of \ functions does not allow me do to that so easily ?

Thank you

I’ve never worked with memory mapped arrays in julia but I would guess that what you’re trying to do is not possible. The \ operator would need to include an out of core factorization algorithm for dense matrices. I could be wrong but I’m pretty sure it doesn’t. A quick google search for out of core factorization of dense matrices in julia didn’t turn up anything.

However, are you sure your matrices need to be dense? If you’re able to work with sparse arrays, that will allow you to handle much larger problems.

I simulate a physical problem that is not possible to use Sparse Matrix.

In the meantime I will try to implement a simple of \ and see its caveats for Mmap.

You might find this article useful. It’s not an answer for how to do the inverse, but rather an answer on how to avoid it. https://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/

I would strongly suggest you reconsider if there are any other way to solve your problem. Solving a linear system of size 10k on my machine takes up about 1gb RAM and 10s (with full multithreading). The scaling wrt memory size is power 3/2, so if your matrix goes over your RAM (which I’m assuming is at least 4gbs), then it takes at least a minute. There usually are better ways, depending on your particular problem.

3 Likes

I strongly cannot reconsider :smile:

Firstly, because I need all the values of x for further analysis.
Secondly, because is a challenge that my curiosity wants to solve

My bet is to take a package written in julia for x=A\b and add features for Mmap.
For example, Gaussian elimination is a common algorithm, and does not seem an issue to create Mmap variables for it.

But I know that Gaussian elimination is slow, and I’m open for package suggestions

I think you could probably do this with an in-place factorization, for example F = lu!(A_HD) with an Mmap'd array A. I’m not sure this is completely allocation-free but it’s a place to start looking.

See also

By the way it’s unlikely to be helpful to mmap your vector b_HD because it’s so much smaller than the matrix.

1 Like

I do a lot of linear algebra. As far as I know, Gaussian Elimination/LU decomp., etc, will all perform roughly the same on this data. At the size you’re looking at (2000x2000) it is not necessary to mmap anything on a computer built in the past couple of decades but of course feel free to do it anyway. If you implement a thing like this, note that the main way that dense linear algebra is made to be fast on computers is to use block operations that correspond to the cache size. In your case, your “cache” is the main computer memory and the “off-cache” is the HDD so you should make blocks that are sizes appropriately for your main computer memory. You can probably get some more speed by making blocks-of-blocks corresponding to your memory hierarchy.

So you’d have to implement a block-LU or block-GE algorithm to get good performance.

Apart from these “direct solvers” you can try iterative solvers. The simplest are Jacobi and Gauss-Seidel, and the more advanced ones are GMRES and similar. However, these algorithms are not going to work well on a random matrix as in your example. Roughly speaking, iterative algorithms converge at a rate that depends on the distribution of eigenvalues of the matrix, and random matrices don’t have good spectral properties.

2 Likes

The 2000 matrix was just an example

I don’t want the headache of convergence tests at this initial moment. So the keywords that I should look for are ‘block-LU’ or ‘block-GE’
thank you

Mmap isn’t magical. The data still has to exist and if it doesn’t all fit into memory at once, then the kernel is going to have to swap it back and forth from disk all the time which is going to kill the performance. There’s some chance that an algorithm might happen to operate on the matrix memory in a sufficiently local way that this won’t be terrible, but that is not the case with most dense matrix algorithms.

1 Like

Internally it seems that we call LAPACK.getrf for the LU decomposition of strided matrices, so it would depend on how well OpenBLAS’s version of dgetrf optimizes for multiple levels of cache (the last level of cache here being the OS page table).

Here’s a paper which discusses cache oblivious algorithms for dense linear algebra, including dgetrf:

That has nothing to do with whether you use a sparse-matrix algorithm, which depends only on A being sparse (mostly zero, as is common for matrices that are very large) — the full solution vector x is computed by sparse or iterative solvers.

Very large matrices that arise in practice are almost always either sparse or have some other structure that can be exploited to solve the system much more quickly than Gaussian elimination or similar. That’s a good thing, because brute-force “dense” matrix solvers (which exploit no special structure of the matrix) simply don’t scale to very large problems — such algorithms scale with the cube of the matrix size, which means that even if I give you 1000× more processing power, you can only solve a system that is 10× bigger. (You usually run out of memory first.)

There’s no magic trick here that doesn’t involve exploiting matrix structure.

5 Likes

Hellos again, and I bring good news.

Using direct Lapack calls solved my problem perfectly:

using Mmap, LinearAlgebra
N = 2_000
A = rand(ComplexF64, N, N)
b = rand(ComplexF64, N)

io_A = open("/tmp/matrixA.bin", "w+");
A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
A_HD[:] = A
close(io_A)
GC.gc()

xRam = A\b

io_A = open("/tmp/matrixA.bin","r+")
A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N))
xHD = LinearAlgebra.LAPACK.gesv!(A_HD, b)[1]
close(io_A)

println(xHD ≈ xRam) # julia> true

To test the difference of RAM and HDD, another code :

N_range = [5_000,10_000, 15_000, 20_000]

timeRam = zeros(length(N_range))
cc = 1
@progress for N in N_range
    global cc
    A = rand(ComplexF64, N, N)
    b = rand(ComplexF64, N)

    io_A = open("/tmp/matrixA_"*string(N)*".bin", "w+");
    A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
    A_HD[:] = A

    timeRam[cc] = @elapsed A\b;
    A = 1; GC.gc()
    cc = cc + 1
    close(io_A)
end

timeDisk = zeros(length(N_range))
cc = 1
@progress for N in N_range
    global cc
    io_A = open("/tmp/matrixA_"*string(N)*".bin", "r+");
    A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
    b = rand(ComplexF64, N) # i dont mind to recreate b for this test
    
    timeDisk[cc] = @elapsed LinearAlgebra.LAPACK.gesv!(A_HD, b)[1];
    cc = cc + 1
    close(io_A)
end

using Plots
plot(N_range, timeRam, markershape=:circle, label="RAM", scale=:log10)
plot!(N_range, timeDisk, markershape=:circle, label="HD", legend=:topleft)
plot!(xlabel="Matrix Size", ylabel="Elapsed Time")

rapido

You can post your curve for comparison too :slightly_smiling_face:

1 Like

Are you sure you’re exceeding RAM here? A 20000x20000 ComplexF64 matrix takes < 7GB of memory, which most people have. If you use a memory-mapped array that fits into RAM then it is equivalent to just reading all the data into a regular array.

(You don’t need a low-level LAPACK call, either. You can work in place using high level functions like lu! and ldiv!. I think you’re just measuring the memory overhead for the out-of-place A \ b call.)

1 Like

My goal is to work with N=200 000.

Those values were to get a notion of time consumption.

This isn’t something I have any experience with, but I found it an interesting question and so also Googled a bit. There is a conference paper by the Flame authors from a decade ago that claims they were able to get comparable performance for out-of-core dense LA implementations (in GFLOPs) to their in-core algorithms. They used it to solve a 100,000x100,000 Cholesky factorization in ~1 hour on an 8-core machine.

1 Like

Benchmarking in-memory factorizations will not tell you anything about the cost of out-of-core problems in general. You have no idea how much slowdown you are about to experience (the LAPACK algorithms were not designed to run out of core, so I would be surprised to see them scale well).

1 Like