Help using CUDA to exponentiate Matrix

I am trying to use CUDA to speed up the process of finding the exponential of a matrix.

For some reason, all the steps are fast except the code written inside a loop

using LinearAlgebra
using SparseArrays
using CUDA
using CUDA.CUSPARSE

n = 15_000;
A = sprand(n,n,6/n);

A_gpu = CuArray(A)

function expCU(A_gpu::CuArray{Float64,2};threshold=1e-6)
    rows = LinearAlgebra.checksquare(A_gpu);
    P = spzeros(rows,rows);
    P[diagind(P)] = ones(rows);

    mat_norm=norm(A_gpu,Inf);

    # Scaling to ensure matrix doesn't blow up ensuring significant speed up
    scaling_factor = nextpow(2,mat_norm);
    A_gpu = A_gpu./scaling_factor;
    
    P_gpu = CuArray{Float64,2}(P);

    next_term = copy(P_gpu);
    n = 1;
    delta=norm(next_term,Inf);
    nonzero_tol = 1e-12;

    while delta>threshold
        next_term=(1/n)*A_gpu*next_term;
        # timing this step
        @time delta=norm(next_term,Inf);
        P_gpu = P_gpu.+ next_term; n=n+1;
    end

    # Squaring of P to generate correct P
    for n=1:log2(scaling_factor)
        P_gpu=P_gpu*P_gpu;
    end

    return P_gpu
end

expCU(A_gpu);

I have found that the bottleneck is the step timed in the above code. The output for a specific run is

0.375602 seconds (78 allocations: 3.281 KiB)
0.359730 seconds (78 allocations: 3.281 KiB)
0.359744 seconds (78 allocations: 3.281 KiB)
0.359743 seconds (78 allocations: 3.281 KiB)
0.359767 seconds (78 allocations: 3.281 KiB)
0.363671 seconds (78 allocations: 3.281 KiB)
0.363688 seconds (78 allocations: 3.281 KiB)
0.363686 seconds (78 allocations: 3.281 KiB)
0.359806 seconds (78 allocations: 3.281 KiB)
0.363658 seconds (78 allocations: 3.281 KiB)
0.363692 seconds (78 allocations: 3.281 KiB)
0.363676 seconds (78 allocations: 3.281 KiB)
0.359748 seconds (78 allocations: 3.281 KiB)

Now when I try running the step as is on my terminal (REPL) I get the following time

0.029465 seconds (79 allocations: 3.297 KiB)

This means that this step is significantly slower when inside the loop for whatever reason. When I try timing the steps without the loop (manually) it is pretty fast. I am pretty new to GPU computing so I’d be really grateful if someone can explain what I am doing wrong. Thanks!

1 Like

Hi!

Such norm operation is slow, because it is a reducing operation (basically, it finds the maximum absolute value of the array), which is hard to execute in parallel.

What I recommend is just reducing the number of norm calls as much as possible. Something like this:

function expCU2(A_gpu::CuArray{Float64,2};threshold=1e-6)
    rows = LinearAlgebra.checksquare(A_gpu);
    P = spzeros(rows,rows);
    P[diagind(P)] = ones(rows);

    mat_norm=norm(A_gpu,Inf);

    # Scaling to ensure matrix doesn't blow up ensuring significant speed up
    scaling_factor = nextpow(2,mat_norm);
    A_gpu = A_gpu./scaling_factor;
    mat_norm /= scaling_factor
    
    P_gpu = CuArray{Float64,2}(P);

    next_term = copy(P_gpu);
    n = 1;
    delta=one(mat_norm);
    nonzero_tol = 1e-12;

    while delta>threshold
        next_term=(1/n)*A_gpu*next_term;
        # timing this step
        delta *= mat_norm / sqrt(n) # Not accurate but OK heuristics
        P_gpu = P_gpu.+ next_term
        n += 1
    end
    # Squaring of P to generate correct P
    for n=1:log2(scaling_factor)
        P_gpu=P_gpu*P_gpu;
    end

    return P_gpu
end

These are the benchmark results I got:

julia> using BenchmarkTools

julia> @btime expCU(A_gpu);
  849.408 ms (4459 allocations: 7.90 MiB)

julia> @btime expCU2(A_gpu);
  29.819 ms (3246 allocations: 7.84 MiB)

Also, speaking of norm timing, it seems that it is because you allocate new memory for next_term each iteration. You may want to try creating next_term_buffer = similar(next_term) before the loop and

mul!(next_term_buffer, next_term, 1/n, 0)
next_term_buffer, next_term = next_term, next_term_buffer

in the loop.

1 Like

There are some fast matrix exponential methods in ExponentialUtiltiies.jl that are CUDA-compatible that don’t use squaring and scaling in order to avoid the performance pitfalls of that.

2 Likes

Thanks, this helped. Changing the norm calculation in my code did speed things up. Also, I’ve realized that because the input matrix is so large I need to take care of memory usage. For this, I edited the part inside my for loop:

next_term_buffer = similar(next_term)
mul!(next_term_buffer, next_term, A_gpu)
mul!(next_term, next_term_buffer, 1/n)
CUDA.unsafe_free!(next_term_buffer)