GPU parallelization for a very large system of ODEs

Hi all,

I need to simulate a very large system of ODEs (~10^5 states). At every time step in the simulation I need to update a matrix using the current value of the state vector.

Here is an MWE of the matrix updating function. This needs to be run at every timestep in the ODE integration.

using BenchmarkTools

n = 1000
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);

function viewmultsum!(result, coefficients, state, randidxs)
    @views sum!(result, coefficients .* state[randidxs])
end;

@benchmark viewmultsum!(result, coefficients, state, randidxs)
BenchmarkTools.Trial: 357 samples with 1 evaluation.
 Range (min … max):  10.177 ms … 54.770 ms  β”Š GC (min … max):  0.00% … 30.50%
 Time  (median):     14.702 ms              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   14.001 ms Β±  3.392 ms  β”Š GC (mean Β± Οƒ):  14.57% Β± 12.51%

      ▁▄▆▄▄                         β–‚β–ˆβ–„β–‚                       
  β–„β–†β–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–„β–‚β–…β–‚β–„β–„β–ƒβ–„β–ƒβ–‚β–ƒβ–β–ƒβ–„β–ƒβ–β–ƒβ–†β–†β–†β–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–ƒβ–ƒβ–ƒβ–β–„β–ƒβ–„β–ƒβ–β–β–…β–β–‚β–β–‚β–β–β–„ β–„
  10.2 ms         Histogram: frequency by time        19.7 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 2.

In the real simulation result is an n x n matrix and I calculate result * state at every time step (again n > 1E5).

Given the huge size of the state vector I want to accelerate the computation of result as much as I can. I’ve read the Introduction to GPU programming in the CUDA.jl docs but I’m still unsure if GPU acceleration is the right approach.

In particular I’m concerned that array slicing with state[randidxs] won’t be very performant on the GPU hardware. Unfortunately the equations I am simulating don’t seem to allow a more structured access pattern of the state vector. I’m also concerned that with n~1e5 and hence result (a n x n matrix of Float 32s) being at least several dozen GB, I will have a significant memory transfer overhead that slows down parallel computation on GPUs which have relatively little RAM.

So my specific questions are:

  1. Does the unstructured access of state using state[randidxs] necessarily mean performance will be signifcantly degraded on GPUs?
  2. Does the large size of the result array (10s of GB) mean that data transfer overhead will kill performance gains from moving to GPUs? It seems the largest GPU RAM is about 80GB and result may be bigger than that as n gets very large.
  3. Are there alternative approaches I should investigate before fully committing to the GPU route?

Thank you!

That’s fine.

If you keep the entire computation on the GPU you should be fine. That said, if your operations are all only O(n) like shown in your example, then you’re not likely to see that much of a speedup.

What is your actual equation?

Thanks for the response!

If you keep the entire computation on the GPU you should be fine. That said, if your operations are all only O(n) like shown in your example, then you’re not likely to see that much of a speedup.

Regarding the computations being only O(n) and hence not likely to be sped up. My understanding was that moving the computation to the GPU would parallelize viewmultsum!

function viewmultsum!(result, coefficients, state, randidxs)
    @views sum!(result, coefficients .* state[randidxs])
end;

and therefore roughly reduce the computational time by a factor proportional to the number of GPU cores. Are you suggesting I need to write some kind of kernel that is explicitly parallel instead of using broadcasting? For example something like the following (using 4 threads) but with a large number of threads

n = 1000
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);

function loopmultsum!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    for j in js
        Threads.@threads for i in is
            c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            result[i] = (j == first(js)) ? c * s : muladd(c, s, result[i])
        end
    end
end;
BenchmarkTools.Trial: 1193 samples with 1 evaluation.
 Range (min … max):  3.011 ms … 41.851 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.878 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.181 ms Β±  1.893 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

          β–‚β–ˆβ–„                                                 
  β–ƒβ–†β–†β–…β–…β–ƒβ–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–„β–„β–…β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–‚β–β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–β–‚β–β–β–‚β–‚β–‚β–β–‚ β–ƒ
  3.01 ms        Histogram: frequency by time        8.05 ms <

 Memory estimate: 8.59 KiB, allocs estimate: 95.

I wasn’t sure if the nested looping would degrade GPU performance.

And regarding

What is your actual equation?

The actual equation is du/dt = (D + R) u + c. Where D is a constant diagonal matrix, R is the result from the code above reshaped to be n x n, and c is a constant vector. R is not actually filled by randomly indexing state (i.e. u) as above, but I don’t think getting into the details of how state is indexed would be helpful. Suffice it to say that instead of the randidxs array above, I have another integer filled array trueidxs that I use to index state. In short the equation is a large, nonlinear (due to R) system of first order ODEs.

It would, but you need quite a bit of compute for GPUs to be faster than CPUs for O(n) calculations. They just aren’t great at things with that scaling.

No, GPU cores are much slower than CPU cores, so that mental math is mis calibrated. For a larger discussion, see Doing small network scientific machine learning in Julia 5x faster than PyTorch.

image

At N=1000 you start to break even with CPU on Float32 operations. With Float64, the breakeven is closer to N=10000 because of how much slower the FPUs are on GPUs.

Can you express R as a sparse matrix or more directly as code? Looks like a PDE discretization, using arrays is generally slow for that. How dense is R?

3 Likes

Thanks for the reply!

It would, but you need quite a bit of compute for GPUs to be faster than CPUs for O(n) calculations. They just aren’t great at things with that scaling.

I think I see what you mean with GPUs not being great on calculations with O(n) scaling. In my case I thought the algorithm to fill result is O(n^2) since result has a shape of n x n. Maybe the for-loop version of the code makes that clearer (copied below for reference). Let me know if I am mistaken.

n = 1000
coefficients = rand(n^2, 4);
state = ones(n);
idxs = rand(1:n, n^2, 4);
result = zeros(n^2);

function loopmultsum!(result, coefficients, state, randidxs)
    is, js = axes(coefficients)
    for j in js
        Threads.@threads for i in is
            c, r = coefficients[i, j], randidxs[i, j]
            s = state[r]
            result[i] = (j == first(js)) ? c * s : muladd(c, s, result[i])
        end
    end
end;

Regarding the sparsity of R and how to construct it as code. It is constructed very similarly to result in viewmultsum! or loopmultsum! above. The only difference is reshaping and the specific values in coefficients and randidxs. Concretely to get R I write

n = 1000
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
R= zeros(n, n);

function viewmultsum!(R, coefficients, state, randidxs)
    n = size(R)[1]
    R = reshape(R, :)
    @views sum!(R, coefficients .* state[randidxs])
    R = reshape(R, n, n)
end;

It doesn’t seem there is much overhead associated with reshaping so I didn’t include that part in the original post.

How dense is R?

It turns out that for the real problem the R matrix is pretty sparse, e.g.

count(iszero, R)/length(R) = .9778738175089268

for n ~ 9000 (and hence R having around 81M elements since it is n x n). It also seems that R gets more sparse as n grows. Are you suggesting that there is some way to greatly reduce computational demands using the sparsity of R? Do SparseArrays have GPU support?

Thanks again.

That’s not really sparse enough to warrant a sparse matrix. You need it to be like less than 1 in a thousand. So on the larger end it would make sense.

If it’s sparse enough then yes. And yes SparseArrays has GPU support, there’s GPU sparse solvers. But GPU sparse solvers do not scale as well as they do in the dense case, so you need some rather large and structured sparsity patterns for them out perform the CPU case.

2 Likes