Help using CUDA to exponentiate Matrix

I am trying to use CUDA to speed up the process of finding the exponential of a matrix.

For some reason, all the steps are fast except the code written inside a loop

using LinearAlgebra
using SparseArrays
using CUDA
using CUDA.CUSPARSE

n = 15_000;
A = sprand(n,n,6/n);

A_gpu = CuArray(A)

function expCU(A_gpu::CuArray{Float64,2};threshold=1e-6)
    rows = LinearAlgebra.checksquare(A_gpu);
    P = spzeros(rows,rows);
    P[diagind(P)] = ones(rows);

    mat_norm=norm(A_gpu,Inf);

    # Scaling to ensure matrix doesn't blow up ensuring significant speed up
    scaling_factor = nextpow(2,mat_norm);
    A_gpu = A_gpu./scaling_factor;
    
    P_gpu = CuArray{Float64,2}(P);

    next_term = copy(P_gpu);
    n = 1;
    delta=norm(next_term,Inf);
    nonzero_tol = 1e-12;

    while delta>threshold
        next_term=(1/n)*A_gpu*next_term;
        # timing this step
        @time delta=norm(next_term,Inf);
        P_gpu = P_gpu.+ next_term; n=n+1;
    end

    # Squaring of P to generate correct P
    for n=1:log2(scaling_factor)
        P_gpu=P_gpu*P_gpu;
    end

    return P_gpu
end

expCU(A_gpu);

I have found that the bottleneck is the step timed in the above code. The output for a specific run is

0.375602 seconds (78 allocations: 3.281 KiB)
0.359730 seconds (78 allocations: 3.281 KiB)
0.359744 seconds (78 allocations: 3.281 KiB)
0.359743 seconds (78 allocations: 3.281 KiB)
0.359767 seconds (78 allocations: 3.281 KiB)
0.363671 seconds (78 allocations: 3.281 KiB)
0.363688 seconds (78 allocations: 3.281 KiB)
0.363686 seconds (78 allocations: 3.281 KiB)
0.359806 seconds (78 allocations: 3.281 KiB)
0.363658 seconds (78 allocations: 3.281 KiB)
0.363692 seconds (78 allocations: 3.281 KiB)
0.363676 seconds (78 allocations: 3.281 KiB)
0.359748 seconds (78 allocations: 3.281 KiB)

Now when I try running the step as is on my terminal (REPL) I get the following time

0.029465 seconds (79 allocations: 3.297 KiB)

This means that this step is significantly slower when inside the loop for whatever reason. When I try timing the steps without the loop (manually) it is pretty fast. I am pretty new to GPU computing so I’d be really grateful if someone can explain what I am doing wrong. Thanks!

Hi!

Such norm operation is slow, because it is a reducing operation (basically, it finds the maximum absolute value of the array), which is hard to execute in parallel.

What I recommend is just reducing the number of norm calls as much as possible. Something like this:

function expCU2(A_gpu::CuArray{Float64,2};threshold=1e-6)
    rows = LinearAlgebra.checksquare(A_gpu);
    P = spzeros(rows,rows);
    P[diagind(P)] = ones(rows);

    mat_norm=norm(A_gpu,Inf);

    # Scaling to ensure matrix doesn't blow up ensuring significant speed up
    scaling_factor = nextpow(2,mat_norm);
    A_gpu = A_gpu./scaling_factor;
    mat_norm /= scaling_factor
    
    P_gpu = CuArray{Float64,2}(P);

    next_term = copy(P_gpu);
    n = 1;
    delta=one(mat_norm);
    nonzero_tol = 1e-12;

    while delta>threshold
        next_term=(1/n)*A_gpu*next_term;
        # timing this step
        delta *= mat_norm / sqrt(n) # Not accurate but OK heuristics
        P_gpu = P_gpu.+ next_term
        n += 1
    end
    # Squaring of P to generate correct P
    for n=1:log2(scaling_factor)
        P_gpu=P_gpu*P_gpu;
    end

    return P_gpu
end

These are the benchmark results I got:

julia> using BenchmarkTools

julia> @btime expCU(A_gpu);
  849.408 ms (4459 allocations: 7.90 MiB)

julia> @btime expCU2(A_gpu);
  29.819 ms (3246 allocations: 7.84 MiB)

Also, speaking of norm timing, it seems that it is because you allocate new memory for next_term each iteration. You may want to try creating next_term_buffer = similar(next_term) before the loop and

mul!(next_term_buffer, next_term, 1/n, 0)
next_term_buffer, next_term = next_term, next_term_buffer

in the loop.

There are some fast matrix exponential methods in ExponentialUtiltiies.jl that are CUDA-compatible that don’t use squaring and scaling in order to avoid the performance pitfalls of that.

Thanks, this helped. Changing the norm calculation in my code did speed things up. Also, I’ve realized that because the input matrix is so large I need to take care of memory usage. For this, I edited the part inside my for loop:

next_term_buffer = similar(next_term)
mul!(next_term_buffer, next_term, A_gpu)
mul!(next_term, next_term_buffer, 1/n)
CUDA.unsafe_free!(next_term_buffer)