I’m writing a CUDA kernel for DynamicExpressions here and was wondering what the best practices are for unit-testing it on CPU-only machines?
My current idea is to modify the GPU kernel so that I can manually specify the threads, like so:
function my_kernel(... # data
# Override for unittesting:
i=nothing,
)
i = i === nothing ? (blockIdx().x - 1) * blockDim().x + threadIdx().x : i
# do stuff with i
end
and then have a branch of my code where I simply loop through the threads and blocks, rather than the @cuda launches, which would therefore be amenable to running on the CPU.
But was just wondering what the best practices were here and if there’s something obvious people use. For example maybe there’s a @cuda cpuonly=true option that sets this up or something?
For posterity, here’s how I set it up right now, using the kernel modification above (zero cost)
if buffer isa CuArray
@cuda(threads=num_threads, blocks=num_blocks,
my_kernel(buffer, #= data =#, nothing)
)
else
Threads.@threads for i in 1:(num_threads * num_blocks)
my_kernel(buffer, #= data =#, i)
end
end
Have you considered using KernelAbstractions.jl instead? Then you’d just run the kernel as my_kernel(CPU(), N)(args...) or my_kernel(CUDABackend(), N)(args...)
Nope, didn’t even know about it - thanks for pointing it out! Any caveats to worry about here or should I expect identical performance to CUDA.jl directly?
Actually on second thought I’ll just stick with CUDA.jl for now. From reading the docs and trying an implementation, it seems like KernelAbstractions.jl is much less mature. I’ll try it again when they release v1.
I was scared off by various “TODO” statements which are printed verbatim in the documentation and the fact that all the unit-tests seeming to be failing But apart from the more superficial things I did run into a few annoyances:
Couldn’t use @kernel on an anonymous function (a block function, like function (x); ...; end) even though this works with standard CUDA.jl. The error message for this was not helpful…
Couldn’t do @Const(a::Int), needed to write it as @Const(a)::Int which seems arbitrary. The error message for this was I is not defined which led me on a goose chase through my code only to realize the macro was extracting the I from Int…
Also the capital letters for a macro seemed weird especially since the other macros are lower case
Global, Linear, and other options which I am supposed to pass to @index are not exported symbols from KA. These are apparently just magic words that if I type correctly into the @index I guess it triggers a behavior? I found this very strange.
I think @index(:global) would be better since it’s clear it is being used as a symbol, and they are not types I am supposed to import.
Those were the main things I ran into. Given all of those sharp edges in the first hour I didn’t want to get deeper into rewriting my code with it – SymbolicRegression.jl/PySR have way too many dependencies as it is so CUDA seems like the safer option for an initial taste of GPU support. But maybe I should really use KA here, I don’t know…
I agree that KA.jl isn’t at the same level of user-friendliness of CUDA.jl, but if you need CPU execution of kernels, it’s your best bet. We used to rely on PTX emulators to support CPU execution with CUDA.jl kernels, but those aren’t maintained anymore, so we removed that functionality.
That only works for simple element-wise kernels tough. As soon as you start introducing synchronization or local memory, both or which are essential to getting good GPU performance for nontrivial operations, that approach breaks down.
Also, if you’re only writing kernels that do elementwise operations (or similar), and are thus amenable to running on the CPU like that, it may be possible to rephrase the kernel using array operations which are already portable to the CPU.
Thanks. I should specify that I only want to run this kernel on the CPU for testing coverage of it. But outside of tests, I would never execute that kernel on the CPU. (There is separate CPU-optimized code which is used for regular CPU stuff)
Sounds like I’m probably doing something dumb then… Here’s my current approach:
This outer loop is for going up the tree of operations:
Not sure there’s a smarter way to merge these kernels into a single CUDA launch here?