Unit-testing best practices for CUDA.jl?

I’m writing a CUDA kernel for DynamicExpressions here and was wondering what the best practices are for unit-testing it on CPU-only machines?

My current idea is to modify the GPU kernel so that I can manually specify the threads, like so:

function my_kernel(... # data
    # Override for unittesting:
    i = i === nothing ? (blockIdx().x - 1) * blockDim().x + threadIdx().x : i
    # do stuff with i

and then have a branch of my code where I simply loop through the threads and blocks, rather than the @cuda launches, which would therefore be amenable to running on the CPU.

But was just wondering what the best practices were here and if there’s something obvious people use. For example maybe there’s a @cuda cpuonly=true option that sets this up or something?

Tagging @maleadt :slight_smile:

1 Like

For posterity, here’s how I set it up right now, using the kernel modification above (zero cost)

if buffer isa CuArray
    @cuda(threads=num_threads, blocks=num_blocks,
        my_kernel(buffer, #= data =#, nothing)
    Threads.@threads for i in 1:(num_threads * num_blocks)
        my_kernel(buffer, #= data =#, i)

I then define a fake CuArray for testing:

struct FakeCuArray{T,N,A<:AbstractArray{T,N}} <: AbstractArray{T,N}
Base.similar(x::FakeCuArray, dims::Integer...) = FakeCuArray(similar(x.a, dims...))
Base.getindex(x::FakeCuArray, i::Int...) = getindex(x.a, i...)
Base.setindex!(x::FakeCuArray, v, i::Int...) = setindex!(x.a, v, i...)
Base.size(x::FakeCuArray) = size(x.a)
Base.Array(x::FakeCuArray) = Array(x.a)

to_device(a, ::CuArray) = CuArray(a)
to_device(a, ::FakeCuArray) = FakeCuArray(a)

In the data loading part, I replace my CuArray calls with to_device, and define all my methods on Union{CuArray{T,2},FakeCuArray{T,N}}.

Seems to work fine and there’s parallelism in the Threads.@threads call for testing against data races.

But, I’m still not sure if this is the best way to set up CPU-only testing, so interested in advice.


Have you considered using KernelAbstractions.jl instead? Then you’d just run the kernel as my_kernel(CPU(), N)(args...) or my_kernel(CUDABackend(), N)(args...)


Nope, didn’t even know about it - thanks for pointing it out! Any caveats to worry about here or should I expect identical performance to CUDA.jl directly?

Actually on second thought I’ll just stick with CUDA.jl for now. From reading the docs and trying an implementation, it seems like KernelAbstractions.jl is much less mature. I’ll try it again when they release v1.

What issue did you run into? I consider KA as fairly stable and mature and it is used by a number of folks.

I tend to not be a fan of arbitrary v1 and 0.9 has been a robust KA release and I don’t yet see a reason do go to 0.10 xD

I was scared off by various “TODO” statements which are printed verbatim in the documentation and the fact that all the unit-tests seeming to be failing :sweat_smile: But apart from the more superficial things I did run into a few annoyances:

  1. Couldn’t use @kernel on an anonymous function (a block function, like function (x); ...; end) even though this works with standard CUDA.jl. The error message for this was not helpful…
  2. Couldn’t do @Const(a::Int), needed to write it as @Const(a)::Int which seems arbitrary. The error message for this was I is not defined which led me on a goose chase through my code only to realize the macro was extracting the I from Int
    • Also the capital letters for a macro seemed weird especially since the other macros are lower case
  3. Global, Linear, and other options which I am supposed to pass to @index are not exported symbols from KA. These are apparently just magic words that if I type correctly into the @index I guess it triggers a behavior? I found this very strange.
    • I think @index(:global) would be better since it’s clear it is being used as a symbol, and they are not types I am supposed to import.

Those were the main things I ran into. Given all of those sharp edges in the first hour I didn’t want to get deeper into rewriting my code with it – SymbolicRegression.jl/PySR have way too many dependencies as it is so CUDA seems like the safer option for an initial taste of GPU support. But maybe I should really use KA here, I don’t know…


I agree that KA.jl isn’t at the same level of user-friendliness of CUDA.jl, but if you need CPU execution of kernels, it’s your best bet. We used to rely on PTX emulators to support CPU execution with CUDA.jl kernels, but those aren’t maintained anymore, so we removed that functionality.

That only works for simple element-wise kernels tough. As soon as you start introducing synchronization or local memory, both or which are essential to getting good GPU performance for nontrivial operations, that approach breaks down.

Also, if you’re only writing kernels that do elementwise operations (or similar), and are thus amenable to running on the CPU like that, it may be possible to rephrase the kernel using array operations which are already portable to the CPU.

Thanks. I should specify that I only want to run this kernel on the CPU for testing coverage of it. But outside of tests, I would never execute that kernel on the CPU. (There is separate CPU-optimized code which is used for regular CPU stuff)

Sounds like I’m probably doing something dumb then… Here’s my current approach:

This outer loop is for going up the tree of operations:


Not sure there’s a smarter way to merge these kernels into a single CUDA launch here?

Actual kernel:

Thank you for taking the time to write up your experiences, high-quality feedback like that is very much appreciated.

I will take a look at 1 & 2 since those are footguns that need not be there.

3 is a personal preference thing. In the context of a macro Global is just a symbol. So :Global is a double quotation.

I was scared off by various “TODO” statements which are printed verbatim in the documentation

Yeah the docs really need a cleanup and filling in some of the blanks, now that the semantics have stayed stable for so long.

and the fact that all the unit-tests seeming to be failing

Nightly is messing with one performance test, but yes agreed a user shouldn’t need to dig into what is failing.

1 Like

Hi Valentin,
A quick question : is the Metal backend for KernelAbstraction.jl usable ?

Let’s not take this thread to far off-topic, but the short answer is yes it should be!
If not I recommend filing an issue with Metal.jl

1 Like