I need advices to solve Ax=b with Mmap

Noel_Araujo · September 11, 2019, 8:07pm

Hellos

My topic title is clear, I am solving Ax=b (x=A\b).
I know that my system will become big, and I want to try Mmap to store my data. Here the basic code:

using Mmap
N = 2_000
A = rand(N, N)
b = rand(N)


io_A = open("/tmp/matrixA.bin", "w+");
A_HD = Mmap.mmap(io_A, Matrix{Float64}, (N,N));
A_HD[:] = A


io_b = open("/tmp/vectorB.bin", "w+");
b_HD = Mmap.mmap(io_b, Vector{Float64}, N);
b_HD[:] = b

io_x = open("/tmp/vectorX.bin", "w+");
x_HD = Mmap.mmap(io_x, Vector{Float64}, N);

x_HD[:] = A_HD\b_HD
x = A\b

println(x ≈ x_HD)

Looking at memory consumption, I can guess that A_HD\b_HD is stored on RAM before saved into HD. I would like to only use the disk for all the operation. Is is possible ? Or some internals of \ functions does not allow me do to that so easily ?

Thank you

Pbellive · September 11, 2019, 8:41pm

I’ve never worked with memory mapped arrays in julia but I would guess that what you’re trying to do is not possible. The \ operator would need to include an out of core factorization algorithm for dense matrices. I could be wrong but I’m pretty sure it doesn’t. A quick google search for out of core factorization of dense matrices in julia didn’t turn up anything.

However, are you sure your matrices need to be dense? If you’re able to work with sparse arrays, that will allow you to handle much larger problems.

Noel_Araujo · September 11, 2019, 8:50pm

I simulate a physical problem that is not possible to use Sparse Matrix.

In the meantime I will try to implement a simple of \ and see its caveats for Mmap.

Oscar_Smith · September 11, 2019, 8:53pm

You might find this article useful. It’s not an answer for how to do the inverse, but rather an answer on how to avoid it. Don’t invert that matrix

antoine-levitt · September 11, 2019, 8:55pm

I would strongly suggest you reconsider if there are any other way to solve your problem. Solving a linear system of size 10k on my machine takes up about 1gb RAM and 10s (with full multithreading). The scaling wrt memory size is power 3/2, so if your matrix goes over your RAM (which I’m assuming is at least 4gbs), then it takes at least a minute. There usually are better ways, depending on your particular problem.

Noel_Araujo · September 12, 2019, 12:00am

I strongly cannot reconsider

Firstly, because I need all the values of x for further analysis.
Secondly, because is a challenge that my curiosity wants to solve

My bet is to take a package written in julia for x=A\b and add features for Mmap.
For example, Gaussian elimination is a common algorithm, and does not seem an issue to create Mmap variables for it.

But I know that Gaussian elimination is slow, and I’m open for package suggestions

c42f · September 12, 2019, 12:25am

I think you could probably do this with an in-place factorization, for example F = lu!(A_HD) with an Mmap’d array A. I’m not sure this is completely allocation-free but it’s a place to start looking.

See also

By the way it’s unlikely to be helpful to mmap your vector b_HD because it’s so much smaller than the matrix.

loisel · September 12, 2019, 12:39am

I do a lot of linear algebra. As far as I know, Gaussian Elimination/LU decomp., etc, will all perform roughly the same on this data. At the size you’re looking at (2000x2000) it is not necessary to mmap anything on a computer built in the past couple of decades but of course feel free to do it anyway. If you implement a thing like this, note that the main way that dense linear algebra is made to be fast on computers is to use block operations that correspond to the cache size. In your case, your “cache” is the main computer memory and the “off-cache” is the HDD so you should make blocks that are sizes appropriately for your main computer memory. You can probably get some more speed by making blocks-of-blocks corresponding to your memory hierarchy.

So you’d have to implement a block-LU or block-GE algorithm to get good performance.

Apart from these “direct solvers” you can try iterative solvers. The simplest are Jacobi and Gauss-Seidel, and the more advanced ones are GMRES and similar. However, these algorithms are not going to work well on a random matrix as in your example. Roughly speaking, iterative algorithms converge at a rate that depends on the distribution of eigenvalues of the matrix, and random matrices don’t have good spectral properties.

Noel_Araujo · September 12, 2019, 12:58am

The 2000 matrix was just an example

I don’t want the headache of convergence tests at this initial moment. So the keywords that I should look for are ‘block-LU’ or ‘block-GE’
thank you

StefanKarpinski · September 12, 2019, 1:25am

Mmap isn’t magical. The data still has to exist and if it doesn’t all fit into memory at once, then the kernel is going to have to swap it back and forth from disk all the time which is going to kill the performance. There’s some chance that an algorithm might happen to operate on the matrix memory in a sufficiently local way that this won’t be terrible, but that is not the case with most dense matrix algorithms.

c42f · September 12, 2019, 2:14am

Internally it seems that we call LAPACK.getrf for the LU decomposition of strided matrices, so it would depend on how well OpenBLAS’s version of dgetrf optimizes for multiple levels of cache (the last level of cache here being the OS page table).

Here’s a paper which discusses cache oblivious algorithms for dense linear algebra, including dgetrf:

stevengj · September 12, 2019, 2:39am

That has nothing to do with whether you use a sparse-matrix algorithm, which depends only on A being sparse (mostly zero, as is common for matrices that are very large) — the full solution vector x is computed by sparse or iterative solvers.

Very large matrices that arise in practice are almost always either sparse or have some other structure that can be exploited to solve the system much more quickly than Gaussian elimination or similar. That’s a good thing, because brute-force “dense” matrix solvers (which exploit no special structure of the matrix) simply don’t scale to very large problems — such algorithms scale with the cube of the matrix size, which means that even if I give you 1000× more processing power, you can only solve a system that is 10× bigger. (You usually run out of memory first.)

There’s no magic trick here that doesn’t involve exploiting matrix structure.

Noel_Araujo · September 13, 2019, 4:11pm

Hellos again, and I bring good news.

Using direct Lapack calls solved my problem perfectly:

using Mmap, LinearAlgebra
N = 2_000
A = rand(ComplexF64, N, N)
b = rand(ComplexF64, N)

io_A = open("/tmp/matrixA.bin", "w+");
A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
A_HD[:] = A
close(io_A)
GC.gc()

xRam = A\b

io_A = open("/tmp/matrixA.bin","r+")
A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N))
xHD = LinearAlgebra.LAPACK.gesv!(A_HD, b)[1]
close(io_A)

println(xHD ≈ xRam) # julia> true

To test the difference of RAM and HDD, another code :

N_range = [5_000,10_000, 15_000, 20_000]

timeRam = zeros(length(N_range))
cc = 1
@progress for N in N_range
    global cc
    A = rand(ComplexF64, N, N)
    b = rand(ComplexF64, N)

    io_A = open("/tmp/matrixA_"*string(N)*".bin", "w+");
    A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
    A_HD[:] = A

    timeRam[cc] = @elapsed A\b;
    A = 1; GC.gc()
    cc = cc + 1
    close(io_A)
end

timeDisk = zeros(length(N_range))
cc = 1
@progress for N in N_range
    global cc
    io_A = open("/tmp/matrixA_"*string(N)*".bin", "r+");
    A_HD = Mmap.mmap(io_A, Matrix{ComplexF64}, (N,N));
    b = rand(ComplexF64, N) # i dont mind to recreate b for this test
    
    timeDisk[cc] = @elapsed LinearAlgebra.LAPACK.gesv!(A_HD, b)[1];
    cc = cc + 1
    close(io_A)
end

using Plots
plot(N_range, timeRam, markershape=:circle, label="RAM", scale=:log10)
plot!(N_range, timeDisk, markershape=:circle, label="HD", legend=:topleft)
plot!(xlabel="Matrix Size", ylabel="Elapsed Time")

rapido

You can post your curve for comparison too

stevengj · September 13, 2019, 9:07pm

Are you sure you’re exceeding RAM here? A 20000x20000 ComplexF64 matrix takes < 7GB of memory, which most people have. If you use a memory-mapped array that fits into RAM then it is equivalent to just reading all the data into a regular array.

(You don’t need a low-level LAPACK call, either. You can work in place using high level functions like lu! and ldiv!. I think you’re just measuring the memory overhead for the out-of-place A \ b call.)

Noel_Araujo · September 13, 2019, 9:15pm

My goal is to work with N=200 000.

Those values were to get a notion of time consumption.

isaacsas · September 13, 2019, 10:15pm

This isn’t something I have any experience with, but I found it an interesting question and so also Googled a bit. There is a conference paper by the Flame authors from a decade ago that claims they were able to get comparable performance for out-of-core dense LA implementations (in GFLOPs) to their in-core algorithms. They used it to solve a 100,000x100,000 Cholesky factorization in ~1 hour on an 8-core machine.

stevengj · September 13, 2019, 11:14pm

Benchmarking in-memory factorizations will not tell you anything about the cost of out-of-core problems in general. You have no idea how much slowdown you are about to experience (the LAPACK algorithms were not designed to run out of core, so I would be surprised to see them scale well).

Topic		Replies	Views
Use of Memory-mapped I/O General Usage memory , memory-allocation	9	2966	September 5, 2019
Efficient multiplication of out-of-memory sparse matrix and an in-memory dense vector General Usage	7	810	August 19, 2019
Writing an array too large to store in memory Performance question , jld	2	2222	March 16, 2018
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7777	May 21, 2017
Write to the file the structure of the form (key => value) followed by reading using Mmap.mmap General Usage mmap	12	902	December 16, 2018

I need advices to solve Ax=b with Mmap

Related topics