Unit-testing best practices for CUDA.jl?

MilesCranmer · February 25, 2024, 8:39pm

I’m writing a CUDA kernel for DynamicExpressions here and was wondering what the best practices are for unit-testing it on CPU-only machines?

My current idea is to modify the GPU kernel so that I can manually specify the threads, like so:

function my_kernel(... # data
    # Override for unittesting:
    i=nothing,
)
    i = i === nothing ? (blockIdx().x - 1) * blockDim().x + threadIdx().x : i
    # do stuff with i
end

and then have a branch of my code where I simply loop through the threads and blocks, rather than the @cuda launches, which would therefore be amenable to running on the CPU.

But was just wondering what the best practices were here and if there’s something obvious people use. For example maybe there’s a @cuda cpuonly=true option that sets this up or something?

Tagging @maleadt

MilesCranmer · February 25, 2024, 10:28pm

For posterity, here’s how I set it up right now, using the kernel modification above (zero cost)

if buffer isa CuArray
    @cuda(threads=num_threads, blocks=num_blocks,
        my_kernel(buffer, #= data =#, nothing)
    )
else
    Threads.@threads for i in 1:(num_threads * num_blocks)
        my_kernel(buffer, #= data =#, i)
    end
end

I then define a fake CuArray for testing:

struct FakeCuArray{T,N,A<:AbstractArray{T,N}} <: AbstractArray{T,N}
    a::A
end
Base.similar(x::FakeCuArray, dims::Integer...) = FakeCuArray(similar(x.a, dims...))
Base.getindex(x::FakeCuArray, i::Int...) = getindex(x.a, i...)
Base.setindex!(x::FakeCuArray, v, i::Int...) = setindex!(x.a, v, i...)
Base.size(x::FakeCuArray) = size(x.a)
Base.Array(x::FakeCuArray) = Array(x.a)

to_device(a, ::CuArray) = CuArray(a)
to_device(a, ::FakeCuArray) = FakeCuArray(a)

In the data loading part, I replace my CuArray calls with to_device, and define all my methods on Union{CuArray{T,2},FakeCuArray{T,N}}.

Seems to work fine and there’s parallelism in the Threads.@threads call for testing against data races.

But, I’m still not sure if this is the best way to set up CPU-only testing, so interested in advice.

Code:

github.com

SymbolicML/DynamicExpressions.jl/blob/f6155badaf9d9a38c1ada861538d8912eecf4e5b/ext/DynamicExpressionsCUDAExt.jl

module DynamicExpressionsCUDAExt

using CUDA: @cuda, CuArray, blockDim, blockIdx, threadIdx
using DynamicExpressions: OperatorEnum, AbstractExpressionNode
using DynamicExpressions.EvaluateEquationModule: get_nbin, get_nuna
using DynamicExpressions.AsArrayModule: as_array

import DynamicExpressions.EvaluateEquationModule: eval_tree_array

# array type for exclusively testing purposes
struct FakeCuArray{T,N,A<:AbstractArray{T,N}} <: AbstractArray{T,N}
    a::A
end
Base.similar(x::FakeCuArray, dims::Integer...) = FakeCuArray(similar(x.a, dims...))
Base.getindex(x::FakeCuArray, i::Int...) = getindex(x.a, i...)
Base.setindex!(x::FakeCuArray, v, i::Int...) = setindex!(x.a, v, i...)
Base.size(x::FakeCuArray) = size(x.a)
Base.Array(x::FakeCuArray) = Array(x.a)

const MaybeCuArray{T,N} = Union{CuArray{T,2},FakeCuArray{T,N}}

This file has been truncated. show original

Mason · February 25, 2024, 11:22pm

Have you considered using KernelAbstractions.jl instead? Then you’d just run the kernel as my_kernel(CPU(), N)(args...) or my_kernel(CUDABackend(), N)(args...)

MilesCranmer · February 25, 2024, 11:29pm

Nope, didn’t even know about it - thanks for pointing it out! Any caveats to worry about here or should I expect identical performance to CUDA.jl directly?

MilesCranmer · February 26, 2024, 1:15am

Actually on second thought I’ll just stick with CUDA.jl for now. From reading the docs and trying an implementation, it seems like KernelAbstractions.jl is much less mature. I’ll try it again when they release v1.

vchuravy · February 26, 2024, 5:45am

What issue did you run into? I consider KA as fairly stable and mature and it is used by a number of folks.

I tend to not be a fan of arbitrary v1 and 0.9 has been a robust KA release and I don’t yet see a reason do go to 0.10 xD

MilesCranmer · February 26, 2024, 6:53am

I was scared off by various “TODO” statements which are printed verbatim in the documentation and the fact that all the unit-tests seeming to be failing But apart from the more superficial things I did run into a few annoyances:

Couldn’t use @kernel on an anonymous function (a block function, like function (x); ...; end) even though this works with standard CUDA.jl. The error message for this was not helpful…
Couldn’t do @Const(a::Int), needed to write it as @Const(a)::Int which seems arbitrary. The error message for this was I is not defined which led me on a goose chase through my code only to realize the macro was extracting the I from Int…
- Also the capital letters for a macro seemed weird especially since the other macros are lower case
Global, Linear, and other options which I am supposed to pass to @index are not exported symbols from KA. These are apparently just magic words that if I type correctly into the @index I guess it triggers a behavior? I found this very strange.
- I think @index(:global) would be better since it’s clear it is being used as a symbol, and they are not types I am supposed to import.

Those were the main things I ran into. Given all of those sharp edges in the first hour I didn’t want to get deeper into rewriting my code with it – SymbolicRegression.jl/PySR have way too many dependencies as it is so CUDA seems like the safer option for an initial taste of GPU support. But maybe I should really use KA here, I don’t know…

maleadt · February 26, 2024, 7:05am

I agree that KA.jl isn’t at the same level of user-friendliness of CUDA.jl, but if you need CPU execution of kernels, it’s your best bet. We used to rely on PTX emulators to support CPU execution with CUDA.jl kernels, but those aren’t maintained anymore, so we removed that functionality.

That only works for simple element-wise kernels tough. As soon as you start introducing synchronization or local memory, both or which are essential to getting good GPU performance for nontrivial operations, that approach breaks down.

Also, if you’re only writing kernels that do elementwise operations (or similar), and are thus amenable to running on the CPU like that, it may be possible to rephrase the kernel using array operations which are already portable to the CPU.

MilesCranmer · February 26, 2024, 7:11am

Thanks. I should specify that I only want to run this kernel on the CPU for testing coverage of it. But outside of tests, I would never execute that kernel on the CPU. (There is separate CPU-optimized code which is used for regular CPU stuff)

Sounds like I’m probably doing something dumb then… Here’s my current approach:

github.com

SymbolicML/DynamicExpressions.jl/blob/cb2d055423b415af9d2118bfcf2fc33ef67730c2/ext/DynamicExpressionsCUDAExt.jl#L116-L123


      
          for launch in one(I):I(num_launches)
              #! format: off
              if buffer isa CuArray
                  @cuda threads=num_threads blocks=num_blocks gpu_kernel!(
                      buffer,
                      launch, num_elem, num_nodes, execution_order,
                      cX, idx_self, idx_l, idx_r,
                      degree, constant, val, feature, op

This outer loop is for going up the tree of operations:

graphviz

Not sure there’s a smarter way to merge these kernels into a single CUDA launch here?

Actual kernel:

github.com

SymbolicML/DynamicExpressions.jl/blob/cb2d055423b415af9d2118bfcf2fc33ef67730c2/ext/DynamicExpressionsCUDAExt.jl#L148-L209


      
          @eval function create_gpu_kernel(operators::OperatorEnum, ::Val{$nuna}, ::Val{$nbin})
              #! format: off
              function (
                  # Storage:
                  buffer,
                  # Thread info:
                  launch::Integer, num_elem::Integer, num_nodes::Integer, execution_order::AbstractArray,
                  # Input data and tree
                  cX::AbstractArray, idx_self::AbstractArray, idx_l::AbstractArray, idx_r::AbstractArray,
                  degree::AbstractArray, constant::AbstractArray, val::AbstractArray, feature::AbstractArray, op::AbstractArray,
                  # Override for unittesting:
                  i=nothing,
              )
                  #! format: on
                  i = i === nothing ? (blockIdx().x - 1) * blockDim().x + threadIdx().x : i
                  if i > num_elem * num_nodes
                      return nothing
                  end
          
                  node = (i - 1) % num_nodes + 1

This file has been truncated. show original

vchuravy · February 26, 2024, 4:46pm

Thank you for taking the time to write up your experiences, high-quality feedback like that is very much appreciated.

I will take a look at 1 & 2 since those are footguns that need not be there.

3 is a personal preference thing. In the context of a macro Global is just a symbol. So :Global is a double quotation.

I was scared off by various “TODO” statements which are printed verbatim in the documentation

Yeah the docs really need a cleanup and filling in some of the blanks, now that the semantics have stayed stable for so long.

and the fact that all the unit-tests seeming to be failing

Nightly is messing with one performance test, but yes agreed a user shouldn’t need to dig into what is failing.

LaurentPlagne · February 26, 2024, 4:51pm

Hi Valentin,
A quick question : is the Metal backend for KernelAbstraction.jl usable ?

vchuravy · February 26, 2024, 5:08pm

Let’s not take this thread to far off-topic, but the short answer is yes it should be!
If not I recommend filing an issue with Metal.jl