Tips on writing kernels?

Ribeiro · August 10, 2021, 3:15pm

Thanks for the hints @maleadt!
I might have oversimplified (and somewhat complicated) the problem. I actually have a struct with a large array that looks at other very large arrays within the same struct and another one. So there are no race conditions in my real problem (my bad).
So it’s more like this (some representative numbers included):

for i in 1:length(struct1.bigVector1) # parallelized, 100k points
  accumulator = 0.0;
  for j in 1:length(struct1.bigVector2) # 95k points
    accumulator += struct1.bigVector2[i]*struct1.bigVector3[j]; # actual math more complicated
  end
  for j in 1:length(struct2.bigVector4) # 15k points
    accumulator += struct1.bigVector2[i]*struct2.bigVector4[j];
  end
  struct1.bigVector1[i] = accumulator;
end

I am indeed only writing to the big vector once.
Knowing that there are in fact 4 or 5 arrays per struct with 15k-100k elements (often the elements are SArrays with 3 elements), all Float64 (I know…), do you still think it is worth doing the shared mem allocation thing? If so, can you point to some examples?

Going back to this issue: @cuda threads and blocks confusion I tried messing with threads/blocks. Initially I used this:

kernel = @cuda launch=false myfun!(struct1,struct2); # not sure this works or if it needs a simple array as an input so it can look at its length
config = launch_configuration(kernel.fun)
threads = min(length(struct1.bigVector1), config.threads)
blocks = cld(length(struct1.bigVector1), threads)
CUDA.@sync kernel(struct1,struct2; threads=threads, blocks=blocks);

function myfun!(struct1,struct2)
  i=(blockIdx().x - 1) * blockDim().x + threadIdx().x
  if i <= length(struct1.bigVector1)
  ...
end

and ended up with 256 threads, 392 blocks. This took 40s (vs 60s on the CPU). Keeping the same number of threads, but setting the blocks to 1-16 took under 4s (without much variation from 1 to 16)! So I got a substantial speedup this way (no idea why). I’m still a bit lost regarding how I should set these numbers. I copied the current method from the examples in CUDA.jl.
Any help is greatly appreciated!

Topic		Replies	Views
Kernel optimization and shared memory GPU	1	442	July 9, 2021
Elementwise multiplication of arrays across many cores General Usage parallel	5	2352	April 14, 2017
Trying to understand the use of shared memory on GPUs GPU	3	2286	May 25, 2021
Shared-memory parallelization with large matrix Performance	9	742	September 24, 2019
Performance of kernel function GPU	3	463	November 28, 2019

Tips on writing kernels?

Related topics