@maleadt Ok, looking at the code I think I understand how to use it. I tested it in a small (length(someArray) = 404) case:
kernel = @cuda launch=false funGPU!(someArray)
config = launch_configuration(kernel.fun) # for a certain test this resulted in b=36, t=256
threads = Base.min(length(someArray), config.threads) # still 256
blocks = cld(length(someArray), threads) # now changed to 2
@btime CUDA.@sync kernel(someArray; threads=256, blocks=2) # 27 msec
@btime CUDA.@sync @cuda threads=256 blocks=2 funGPU!(someArray) # 27 msec
# but I messed around with other values
@btime CUDA.@sync kernel(someArray; threads=16, blocks=32) # 8 msec
I’m probably still messing up, since the way I thought I was supposed to be doing things is clearly suboptimal. Thoughts?
@simeonschaub That’s quite interesting. I like it that you force trig functions to use their CUDA. variants. From trying to read the source code, I think it is using the same math as above to calculate threads and blocks, though, right? So the values would still be suboptimal for this case?