Hi all,
I need to simulate a very large system of ODEs (~10^5 states). At every time step in the simulation I need to update a matrix using the current value of the state vector.
Here is an MWE of the matrix updating function. This needs to be run at every timestep in the ODE integration.
using BenchmarkTools
n = 1000
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);
function viewmultsum!(result, coefficients, state, randidxs)
@views sum!(result, coefficients .* state[randidxs])
end;
@benchmark viewmultsum!(result, coefficients, state, randidxs)
BenchmarkTools.Trial: 357 samples with 1 evaluation.
Range (min β¦ max): 10.177 ms β¦ 54.770 ms β GC (min β¦ max): 0.00% β¦ 30.50%
Time (median): 14.702 ms β GC (median): 0.00%
Time (mean Β± Ο): 14.001 ms Β± 3.392 ms β GC (mean Β± Ο): 14.57% Β± 12.51%
βββββ ββββ
ββββββββββββββ
ββββββββββββββββββββββββββββββββββββββ
βββββββ β
10.2 ms Histogram: frequency by time 19.7 ms <
Memory estimate: 30.52 MiB, allocs estimate: 2.
In the real simulation result is an n x n matrix and I calculate result * state at every time step (again n > 1E5).
Given the huge size of the state vector I want to accelerate the computation of result as much as I can. Iβve read the Introduction to GPU programming in the CUDA.jl docs but Iβm still unsure if GPU acceleration is the right approach.
In particular Iβm concerned that array slicing with state[randidxs] wonβt be very performant on the GPU hardware. Unfortunately the equations I am simulating donβt seem to allow a more structured access pattern of the state vector. Iβm also concerned that with n~1e5 and hence result (a n x n matrix of Float 32s) being at least several dozen GB, I will have a significant memory transfer overhead that slows down parallel computation on GPUs which have relatively little RAM.
So my specific questions are:
- Does the unstructured access of
stateusingstate[randidxs]necessarily mean performance will be signifcantly degraded on GPUs? - Does the large size of the
resultarray (10s of GB) mean that data transfer overhead will kill performance gains from moving to GPUs? It seems the largest GPU RAM is about 80GB andresultmay be bigger than that asngets very large. - Are there alternative approaches I should investigate before fully committing to the GPU route?
Thank you!
