Thanks for the hints @maleadt!
I might have oversimplified (and somewhat complicated) the problem. I actually have a struct with a large array that looks at other very large arrays within the same struct and another one. So there are no race conditions in my real problem (my bad).
So it’s more like this (some representative numbers included):
for i in 1:length(struct1.bigVector1) # parallelized, 100k points
accumulator = 0.0;
for j in 1:length(struct1.bigVector2) # 95k points
accumulator += struct1.bigVector2[i]*struct1.bigVector3[j]; # actual math more complicated
for j in 1:length(struct2.bigVector4) # 15k points
accumulator += struct1.bigVector2[i]*struct2.bigVector4[j];
struct1.bigVector1[i] = accumulator;
I am indeed only writing to the big vector once.
Knowing that there are in fact 4 or 5 arrays per struct with 15k-100k elements (often the elements are SArrays with 3 elements), all Float64 (I know…), do you still think it is worth doing the shared mem allocation thing? If so, can you point to some examples?
Going back to this issue: @cuda threads and blocks confusion I tried messing with threads/blocks. Initially I used this:
kernel = @cuda launch=false myfun!(struct1,struct2); # not sure this works or if it needs a simple array as an input so it can look at its length
config = launch_configuration(kernel.fun)
threads = min(length(struct1.bigVector1), config.threads)
blocks = cld(length(struct1.bigVector1), threads)
CUDA.@sync kernel(struct1,struct2; threads=threads, blocks=blocks);
i=(blockIdx().x - 1) * blockDim().x + threadIdx().x
if i <= length(struct1.bigVector1)
and ended up with 256 threads, 392 blocks. This took 40s (vs 60s on the CPU). Keeping the same number of threads, but setting the blocks to 1-16 took under 4s (without much variation from 1 to 16)! So I got a substantial speedup this way (no idea why). I’m still a bit lost regarding how I should set these numbers. I copied the current method from the examples in CUDA.jl.
Any help is greatly appreciated!