Tips on writing kernels?

Hello!
I’m working on a problem where I have structs with big vectors and I have to compute stuff index vs index (N square).
Something like this:

for i in 1:length(bigVector1)
  for j in 1:length(bigVector2)
    bigVector1[i] += bigVector1[i]*bigVector2[j];
  end
  for j in 1:length(bigVector1)
    bigVector1[i] += bigVector1[i]*bigVector1[j];
  end
end

(but obviously much more complicated. The example above is just to illustrate the memory access.)
When porting this to the GPU I parallelized the outer loop, but I think this is likely not an ideal GPU problem, as each GPU thread needs to look at everything I have in memory.
Can anyone point me to some tips on how to write such kernels efficiently? This family of problems is quite large, so I’m guessing other people run into this as well. The examples I find online are very simple. So far, converting my CPU code to the GPU with no changes at all yielded a 30% speedup, which is something, but far from what I would like.
Thanks a lot!

If your outer loop is large enough, and you can spawn enough threads (to launch enough full blocks), you don’t necessarily need to parallelize any further but just optimize the current kernel.
Some quick thoughts:

  • avoid writing to bigVector1 in each iteration of the first inner loop. you can cache the value and only write once; to synchronize across blocks you split the outer loop in two kernels
  • each thread reads the entirety of bigVector2 in the first loop, one element at a time. Instead, you could have each block cache these reads in shared memory: allocate a block of N values, fill the block cooperatively (with coalescable reads, so shmem[threadIdx().x] = bigVector2[threadIdx().x + chunk_offset]), and use those values
  • the second loop seems racey? you’re writing to bigVector1 while reading from it in other threads.

Thanks for the hints @maleadt!
I might have oversimplified (and somewhat complicated) the problem. I actually have a struct with a large array that looks at other very large arrays within the same struct and another one. So there are no race conditions in my real problem (my bad).
So it’s more like this (some representative numbers included):

for i in 1:length(struct1.bigVector1) # parallelized, 100k points
  accumulator = 0.0;
  for j in 1:length(struct1.bigVector2) # 95k points
    accumulator += struct1.bigVector2[i]*struct1.bigVector3[j]; # actual math more complicated
  end
  for j in 1:length(struct2.bigVector4) # 15k points
    accumulator += struct1.bigVector2[i]*struct2.bigVector4[j];
  end
  struct1.bigVector1[i] = accumulator;
end

I am indeed only writing to the big vector once.
Knowing that there are in fact 4 or 5 arrays per struct with 15k-100k elements (often the elements are SArrays with 3 elements), all Float64 (I know…), do you still think it is worth doing the shared mem allocation thing? If so, can you point to some examples?

Going back to this issue: @cuda threads and blocks confusion I tried messing with threads/blocks. Initially I used this:

kernel = @cuda launch=false myfun!(struct1,struct2); # not sure this works or if it needs a simple array as an input so it can look at its length
config = launch_configuration(kernel.fun)
threads = min(length(struct1.bigVector1), config.threads)
blocks = cld(length(struct1.bigVector1), threads)
CUDA.@sync kernel(struct1,struct2; threads=threads, blocks=blocks);

function myfun!(struct1,struct2)
  i=(blockIdx().x - 1) * blockDim().x + threadIdx().x
  if i <= length(struct1.bigVector1)
  ...
end

and ended up with 256 threads, 392 blocks. This took 40s (vs 60s on the CPU). Keeping the same number of threads, but setting the blocks to 1-16 took under 4s (without much variation from 1 to 16)! So I got a substantial speedup this way (no idea why). I’m still a bit lost regarding how I should set these numbers. I copied the current method from the examples in CUDA.jl.
Any help is greatly appreciated!

Shared memory should help regardless of the launch configuration, because it allows to replace non-coalescable single-element reads of bigVector3 across threads into large coalescable reads. An example of that is here: https://github.com/JuliaGPU/CUDA.jl/blob/766b39fd8e825d368aa3d6bc63473f16d64e0a99/src/mapreduce.jl#L52-L82

You should probably also hoist the read from bigVector2, because otherwise the compiler needs to prove that it can’t change. Here, that’s pretty obvious, but I assume your code is still simplified.

Regarding the launch configuration: the block number it returns is not the number you need to launch, because that obviously depends on the number of elements you want to process, but serves as a hint on how many blocks you want to launch if you want to use the device fully. You can use it to, e.g., decide the granularity of your outer parallelization.

Ok, I’m clearly not paying enough attention. When I changed the number of threads/blocks, my code ended up not running through the entire array. So that’s why it was way faster than using the occupancy API. And clearly I continue to have no idea what I’m doing =P.
@maleadt thanks for the tips. I’ll try those out, since I really would like to speed things up a lot.