Hi all,
I need to simulate a very large system of ODEs (~10^5 states). At every time step in the simulation I need to update a matrix using the current value of the state vector.
Here is an MWE of the matrix updating function. This needs to be run at every timestep in the ODE integration.
using BenchmarkTools
n = 1000
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);
function viewmultsum!(result, coefficients, state, randidxs)
@views sum!(result, coefficients .* state[randidxs])
end;
@benchmark viewmultsum!(result, coefficients, state, randidxs)
BenchmarkTools.Trial: 357 samples with 1 evaluation.
Range (min β¦ max): 10.177 ms β¦ 54.770 ms β GC (min β¦ max): 0.00% β¦ 30.50%
Time (median): 14.702 ms β GC (median): 0.00%
Time (mean Β± Ο): 14.001 ms Β± 3.392 ms β GC (mean Β± Ο): 14.57% Β± 12.51%
βββββ ββββ
ββββββββββββββ
ββββββββββββββββββββββββββββββββββββββ
βββββββ β
10.2 ms Histogram: frequency by time 19.7 ms <
Memory estimate: 30.52 MiB, allocs estimate: 2.
In the real simulation result
is an n x n
matrix and I calculate result * state
at every time step (again n > 1E5
).
Given the huge size of the state vector I want to accelerate the computation of result
as much as I can. Iβve read the Introduction to GPU programming in the CUDA.jl
docs but Iβm still unsure if GPU acceleration is the right approach.
In particular Iβm concerned that array slicing with state[randidxs]
wonβt be very performant on the GPU hardware. Unfortunately the equations I am simulating donβt seem to allow a more structured access pattern of the state
vector. Iβm also concerned that with n~1e5
and hence result
(a n x n
matrix of Float 32s) being at least several dozen GB, I will have a significant memory transfer overhead that slows down parallel computation on GPUs which have relatively little RAM.
So my specific questions are:
- Does the unstructured access of
state
usingstate[randidxs]
necessarily mean performance will be signifcantly degraded on GPUs? - Does the large size of the
result
array (10s of GB) mean that data transfer overhead will kill performance gains from moving to GPUs? It seems the largest GPU RAM is about 80GB andresult
may be bigger than that asn
gets very large. - Are there alternative approaches I should investigate before fully committing to the GPU route?
Thank you!