Atomic operations stop working when upgrading to CUDA.jl version 3.3

I am developing a library and using CUDA. Because of the new capabilities of CUDA 3.3, I decided to upgrate to Julia 1.6 and now all the test I run fail in kernels depending on atomic operations.

I tried to make the most basic example and it seems to be a problem on the pointer invocation.

Example:

using CUDA

function kernel(x)
      for i in 1:length(x)
           CUDA.atomic_add!(pointer(x,1),1)
      end
      return
end

x = CUDA.zeros(4)
@cuda kernel(x)

and I obtain an answer like:

julia> @cuda kernel(x)
ERROR: InvalidIRError: compiling kernel kernel(CuDeviceVector{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_add!)
Stacktrace:
 [1] atomic_arrayset
   @ ~/.julia/packages/CUDA/mVgLI/src/device/intrinsics/atomics.jl:498
 [2] atomic_arrayset
   @ ~/.julia/packages/CUDA/mVgLI/src/device/intrinsics/atomics.jl:480
 [3] macro expansion
   @ ~/.julia/packages/CUDA/mVgLI/src/device/intrinsics/atomics.jl:475
 [4] kernel
   @ REPL[6]:3
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(kernel), Tuple{CuDeviceVector{Float32, 1}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2WWTr/src/validation.jl:111
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/2WWTr/src/driver.jl:319 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/PZq45/src/TimerOutput.jl:226 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/2WWTr/src/driver.jl:317 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2WWTr/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/mVgLI/src/compiler/execution.jl:313
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2WWTr/src/cache.jl:89
  [8] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Float32, 1}}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/mVgLI/src/compiler/execution.jl:288
  [9] cufunction(f::typeof(kernel), tt::Type{Tuple{CuDeviceVector{Float32, 1}}})
    @ CUDA ~/.julia/packages/CUDA/mVgLI/src/compiler/execution.jl:282
 [10] top-level scope
    @ ~/.julia/packages/CUDA/mVgLI/src/compiler/execution.jl:102
 [11] top-level scope
    @ ~/.julia/packages/CUDA/mVgLI/src/initialization.jl:52

My device and packages are the following:

┌ Info: System information:
│ CUDA toolkit 11.3.1, artifact installation
│ CUDA driver 11.3.0
│ NVIDIA driver 465.31.0
│ 
│ Libraries: 
│ - CUBLAS: 11.5.1
│ - CURAND: 10.2.4
│ - CUFFT: 10.4.2
│ - CUSOLVER: 11.1.2
│ - CUSPARSE: 11.6.0
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+465.31
│ - CUDNN: 8.20.0 (for CUDA 11.3.0)
│ - CUTENSOR: 1.3.0 (for CUDA 11.2.0)
│ 
│ Toolchain:
│ - Julia: 1.6.1
│ - LLVM: 12.0.0
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
│ - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
│ 
│ 1 device:
└   0: NVIDIA GeForce RTX 2070 SUPER (sm_75, 7.176 GiB / 7.792 GiB available)

Could anyone give me a clue?

Thanks,

Gabriel

It seems that the following modification solves the problem.

using CUDA

function kernel(x)
      for i in 1:length(x)
           CUDA.atomic_add!(CUDA.pointer(x,1),Float32(1))
      end
      return
end

x = CUDA.zeros(4)
@cuda kernel(x)

But now I uncertain of the reproducibility of the results as it was kind of unexpected problem. Is it possible that in previous versions, CUDA.jl was in charge of promoting to the corresponding variables? Is someone could give me an idea, I would be very grateful.

I don’t remember changing anything like that, so it was probably an unintended side effect of another change. Which version are you upgrading from?

It does seem like the atomic_add! function from Base also requires values to have the same type, so maybe we shouldn’t change it back. But an argument can be made for the @atomic macro to do automatic conversion; why aren’t you using that invocation anyway?

Hi @maleadt,

I was upgrading from Julia 1.5 and CUDA 2.4.

The atomic macro with the above function was giving me the same problems, that is why I switched to the function version.

This code gives errors:

x = CUDA.zeros(3)
function f(x)
    for i in 1:length(x)
        @atomic x[1] += 1.
    end
    return nothing
end
@cuda f(x)
x

This code works:

x = CUDA.zeros(3)
function f(x)
    for i in 1:length(x)
        @atomic x[1] += Float32(1.)
    end
    return nothing
end
@cuda f(x)
x

The @atomic macro seems to return the appropriate pointer, but still, the kernel doesn’t promote the Float to the correct type and it has to be explicitly promoted.

I know, that’s what I meant with “an argument can be made for”. I’ve implemented that suggestion in Perform type conversions in at-atomic. by maleadt · Pull Request #990 · JuliaGPU/CUDA.jl · GitHub.