Optimizing Linear Algebra Code?

Hello everyone! I’ve been trying to convert some simple functions that were originally implemented in Matlab into Julia. They revolve around vector/matrix and matrix/matrix multiplication.

My question stems from trying to understand how to best optimize some code for both speed and efficiency. Following the performance tips, its recommend to pre-allocate the outputs which would cut down on the amount of memory that is allocated each time the function is called. Possibly by using mul!() or matmul!() but in this example, I saw no real difference on the run time of each function or even on the number of allocations. Ideally, I should be pre-allocating to avoid hitting garbage collection but even in this simple example it appears to not help.

Is my only real option for speed up to switch over to GPUs or is there a subtler more efficient way to preform these calculations? I’ve an small example of the code with the benchmarks from my system.

n = 1000
pts = 256

m = [pts pts]

Test_mat = randn(ComplexF64,n,prod(m))
Z = randn(typeof(Test_mat[1,1]),m[1],m[2])

function Afwd(x,Enc)
    return Enc*vec(x)
end

function Aadj(b,Enc,m)
    return reshape(BLAS.gemv('C',Enc,b),(m[1],m[2]))
end

function Aadj_Afwd(x, Enc, m)
    return Aadj(Afwd(x,Enc),Enc,m)
end

test = Afwd(vec(Z),Test_mat);

@btime Afwd($Z, $Test_mat)
@btime Aadj($test, $Test_mat, $m)
@btime Aadj_Afwd($Z, $Test_mat, $m)

56.917 ms (3 allocations: 15.83 KiB)
52.040 ms (4 allocations: 1.00 MiB)
104.370 ms (7 allocations: 1.02 MiB)

You are right, it does not make a difference. Here is a preallocated, simplified version if anyone else is curious.

edit: @mcabbott pointed out on slack that probably the time from allocations is dwarfed by the computation time here, which is why preallocating does not help

I don’t think we need gemv for this for example, calling mul! on the transpose is equivalent.

using BenchmarkTools
using LinearAlgebra
function Afwd(x,Enc)
    return Enc*x
end

function Aadj(b,Enc,m)
    return reshape(BLAS.gemv('C',Enc,b),(m[1], m[2]))
end

function Aadj_Afwd(x, Enc, m)
    return Aadj(Afwd(x,Enc),Enc,m)
end

function Aadj_Afwd_new(cache_n,cache_m,x, Enc, m)
    mul!(cache_n,Enc,x)
    mul!(cache_m,Enc',cache_n)
    return reshape(cache_m,(m[1],m[2]))
end

function bench()
    n = 1000
    pts = 256
    
    m = [pts pts]
    
    Test_mat = randn(ComplexF64,n,prod(m))
    Z = vec(randn(typeof(Test_mat[1,1]),m[1],m[2]))
    test = Afwd(vec(Z),Test_mat)
    cache_m = zeros(ComplexF64,prod(m))
    cache_n = zeros(ComplexF64,n)

    @assert all(Aadj_Afwd(Z, Test_mat, m) .== Aadj_Afwd_new(cache_n,cache_m,Z, Test_mat, m)) #test new function

    @btime Aadj_Afwd($Z, $Test_mat, $m)
    @btime Aadj_Afwd_new($cache_n,$cache_m,$Z, $Test_mat, $m)
end

bench()

Are you comparing to something (eg. Matlab/numpy)? This is a pretty big computation. That might be close to the upper limit on a CPU.

You can get rid of the allocations if you do the vec operation outside the function (or set Z = randn(typeof(Test_mat[1,1]),m[1]*m[2])) and then create an empty result result=Vector{ComplexF64}(undef,n). Your function can then be replaced by mul!(result,Test_mat,Z) That way I got zero allocations, but the computation time was effectively the same (and I didn’t try to count the allocation of the result vector, which if you’re doing this just once should be accounted for).
So it looks like the allocations aren’t your problem here.
On a side note, if you do have a GPU and it can handle F64 well, doing this specific operation on it is trivial in Julia. So if that’s an option, try it out!

1 Like

I was curious, so I tested it on the GPU:

using CUDA
using LinearAlgebra
using BenchmarkTools
n = 1000;
pts = 256;
m = [pts pts];
Test_mat = randn(ComplexF64,n,prod(m));
Z = randn(typeof(Test_mat[1,1]),m[1]*m[2]);
result = zeros(ComplexF64,n);
Test_mat_GPU=CuArray(Test_mat);
Z_GPU = CuArray(Z);
result_GPU = CuArray(result);

@btime mul!($result,$Test_mat,$Z);
  20.288 ms (0 allocations: 0 bytes)

@btime CUDA.@sync mul!($result_GPU,$Test_mat_GPU,$Z_GPU);
  2.924 ms (5 allocations: 96 bytes)

So that’s a 10x speedup (compared to a really good CPU), not counting all the time to move the data to the GPU (the CuArray lines), which is significant, so again, only worth it if you do it a bunch of times, but a factor of 7 is not bad.
I do have access to a ridiculous GPU, so YMMV.

1 Like

@pjentsch0 Thank you for working through this little example. These three basic operations form the foundation for a more complex algorithm that was originally written in Matlab. Looking at the @profview my algorithm spends the most time performing these operations so I figured it was a reasonable place to start. I’ve included below the profiling results from Matlab for the same functions. Without sharing all of the code there is a significant run time difference between my Matlab version and the Julia implementation so I believe i need to hunt down more optimizations. Matlab runs the entire program in 51ms where as my current implementation takes 2.2sec for the same problem size. It is important to note that all of the benchmarking results I’ve included here have been ran on the same laptop.


I was hoping to get at least comparable results to the serial Matlab version to ensure that I have a good grasp of the foundations before I dive into GPU programming.

@Ribeiro Thank you as well for working through this example! I’ll try and implement your recommendations for a zero allocation version of these functions in hope of better serial performance. As I mentioned earlier I would like to try and code it to be ran on a GPU as that makes the most sense for this problem but was hoping to have a reasonable serial version as loading these large matrices onto a GPU can be slightly slow. Do you have any experience or recommendations on implementing these types of matrix problems using Nvidia’s Tensor cores and TensorFloat-32?

Note that if you don’t need Float64 precision, Float32 should be at least 2x faster (and will be up to 64x faster on GTX/RTX gpu)

Thank you for the suggestion. I was trying to structure my code such that the precision would be set by Enc to allow for more flexibility. Though why is Float32 so much faster beyond just a simple factor of 2 when running on GTX/RTX cards?

Short answer is that Nvidia wants to sell quadros, and gaming mostly doesn’t need it, so they intentionally degrade the Float64 performance through drivers.

1 Like

Did you try GitHub - JuliaLinearAlgebra/MKL.jl: Intel MKL linear algebra backend for Julia ?

1 Like

The things you are timing here are mostly calls to BLAS libraries, I think that very little is actually being done by Julia (or by Matlab). You might be using two different ones, and depending on your computer may see improvements using MKL.jl.

Matlab runs the entire program in 51ms where as my current implementation takes 2.2sec

This still seems like a huge difference, the different libraries are not a factor 40 apart. Something else must be wrong. Could be something weird like this – on sufficiently strange array types, you get a very different fallback implementation, which profiling would (I think) still show as time within mul!.

julia> A=rand(100,100); B=rand(100,100);

julia> @btime $A * $B;
  32.958 μs (2 allocations: 78.20 KiB)

julia> C = convert(Array{Real}, B); summary(C)
"100×100 Matrix{Real}"

julia> @btime $A * $C;
  23.732 ms (3050002 allocations: 46.62 MiB)

julia> A * C ≈ A * B
true
1 Like

I haven’t tried MKL.jl but I’ve looked into Octavian.jl early on. I ran into some issues regarding vector Matrix multiplication. Without knowing more of how Matlab is handling this specific calculation, it may seem reasonable to assume that it is just as simple as a different BLAS library being called.

Do you have any recommends on to hunt this down? I’ve tried to use @code_warntype to hunt for any issues and keep everything type stables using ComplexF64. Though looking at the benchmarking results posted here, I am not too surprised by the overall run time difference. For example, the longest benchmark in Julia was:

Compared to the Matlab profile, which takes 0.021s to run 50 times or 420us per call. I do think using a GPU is the way to go but still surprised with the current difference.

There is definitely something wrong here. Anything more than a factor of 2 difference is suspicious.

Can you make an absolute minimal example in both Matlab and Julia to show both code and benchmarking? Right now there is a bit too much code and things going on in your example, so I tried to condense it.

Here is my MWE, using Julia 1.6 (OpenBLAS) and Matlab R2021a (MKL). Does it capture your situation?

Julia:

using BenchmarkTools

N, M = 1000, 256
A = rand(ComplexF64, N, M^2)
B = rand(ComplexF64, M, M)

testmul(X, Y) = X * vec(Y)

Matlab:

N = 1000;
M = 256;
A = rand(N, M^2) + 1i * rand(N, M^2);
B = rand(M,M) + 1i * rand(M,M);

Benchmarks:

jl> @btime testmul($A, $B);
  39.787 ms (3 allocations: 15.83 KiB)

>> timeit(@()A * B(:))
ans =
    0.0410

These speeds are basically identical.

(BTW: Instead of this: typeof(Test_mat[1,1]), it is more idiomatic to use eltype(Test_mat))

1 Like

@DNF You are indeed correct something screwy is going on but I agree with everyone on here that its not an issue with BLAS. As per your suggestion I stripped down the function that utilize these multiplications and directly comparing Matlab to Julia, I see a ~4.5 times speed improvement when using Julia, which makes me think I have some other junk that is slowing down my program.

I need to do some more digging to try to find the slow down in the main program but wanted to thank everyone for your help with this!!

3 Likes