Broadcasting in CUDA kernels

Hello,

In the following example code there are two kernels, kernel1 and kernel2, which do exactly the same, but kernel1 uses broadcasting and, as a result, causes an error. Is it an expected behaviour? Is it possible to use broadcasting in CUDA kernels?

Thank you.

using CUDA

function kernel1(a, b)
    @. a = b
    return nothing
end


function kernel2(a, b)
    for i=1:length(a)
        a[i] = b[i]
    end
    return nothing
end


a = CUDA.zeros(10)
b = CUDA.ones(10)

# @cuda threads=256 kernel1(a, b)
# ERROR: LoadError: InvalidIRError: compiling kernel kernel1(CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}) resulted in invalid LLVM IR
# Reason: unsupported dynamic function invocation (call to typename(a::DataType) in Base at essentials.jl:304)

@cuda threads=256 kernel2(a, b)

Broadcasting and kernels can be seen as opposites. Broadcasting eases your life, so that you do not need to write kernels. Broadcasting therefore can not be used inside a kernel, inside a kernel you are processing scalar values. Broadcasting launches generic kernels that julia sees fitting for the current operation you are trying to perform. This would be a broadcasting solution that replaces kernel1:

using CUDA
a = CUDA.zeros(10)
b = CUDA.ones(10)
a .= b

And your kernel2 would be the “Kernel”-based approach to solve the problem. As you can see there is rarely a reason to write your own kernels in CUDA.jl. From my experience it only comes up when you are not satisfied with the performance of the broadcasting solution.

In my actual code I have a two-dimensional array and I want to perform a set of operations along one dimension while parallelize along the other. Something like this:

using CUDA

function kernel1(a, b)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    N1, N2 = size(a)
    for i=id:stride:N1
        @. a[i, :] = b
    end
    return nothing
end


function kernel2(a, b)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    N1, N2 = size(a)
    for i=id:stride:N1
        for j=1:N2
            a[i, j] = b[j]
        end
    end
    return nothing
end

N1 = 100
N2 = 10

a = CUDA.zeros((N1, N2))
b = CUDA.ones(N2)

# @cuda threads=N1 kernel1(a, b)

@cuda threads=N1 kernel2(a, b)

For simple cases like this, you can just use plain ol’ broadcasting:

julia> a .= b'
100×10 CuArray{Float32, 2}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
[...]

If your actual kernel function can be written without referring to indices, you can broadcast that, too:

julia> f(x) = 2asin(x)
f (generic function with 1 method)

julia> @. a = f(b')
100×10 CuArray{Float32, 2}:
 3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159
 3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159
[...]

Thank you, but, unfortunately, my case is not so simple. :frowning_face:

Broadcast generally isn’t GPU compatible; it might work with e.g. StaticArrays, but generally you don’t want to do so because that means each thread will be iterating a number of elements (whereas you typically then want to use a parallel reduction and share the work across threads, if possible).

As I tried to show in my second example, the broadcasts can be useful when one has a deal with multidimensional arrays with parallelization along a specific axis. Another case is when you apply an external function with internal broadcasts to one of the array’s axis.
Would it be hard to fix this incompatibility by writing a specific broadcast style for CuDeviceArrays? I thought, in the end, the broadcast is replaced by a corresponding loop.

That should be possible, yes.