Hi all. New Julia user here. I want to significantly increase performance/reduce the amount of memory allocated in a function similar to the MWE below. Basically I need to access the elements of an array (sliced using a matrix), multiply those elements by another matrix, and then sum the columns of the result.
using BenchmarkTools
n = 300
coefficients = rand(n^2, 4);
state = ones(n);
randidxs = rand(1:n, n^2, 4);
result = zeros(n^2);
function viewmultsum!(result, coefficients, state, randidxs)
@views sum!(result, coefficients .* state[randidxs])
end;
@benchmark viewmultsum!(result, coefficients, state, randidxs)
BenchmarkTools.Trial: 5263 samples with 1 evaluation.
Range (min β¦ max): 581.567 ΞΌs β¦ 9.461 ms β GC (min β¦ max): 0.00% β¦ 90.38%
Time (median): 760.413 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 941.917 ΞΌs Β± 660.843 ΞΌs β GC (mean Β± Ο): 17.49% Β± 18.97%
βββ
βββ
βββ ββ β β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββ β
582 ΞΌs Histogram: log(frequency) by time 3.53 ms <
Memory estimate: 2.75 MiB, allocs estimate: 2.
Iβll need to call a function similar to this several thousand times while integrating a very large system of ODES (n > 1e4) so it is performance critical. I thought using @views
and summing inplace with sum!
would mean that I donβt have to allocate much memory. I only have 2 allocations in the MWE but they seem quite large and I donβt know exactly where they come from.
I know accessing the elements of state
in a random manner is not ideal for cache performance, but in the real ODE system there is no structured way I can access its elements. In short the poor cache performance may be unavoidable.
Is there any way I can reduce the allocations in the MWE? Or is the best bet for performance improvement to look at parallelization using hardware like GPUs? Any other performance tips/pointers are much appreciated.
Thanks!