Broadcasting in CUDA kernels

fedoroff · June 7, 2021, 2:06pm

Hello,

In the following example code there are two kernels, kernel1 and kernel2, which do exactly the same, but kernel1 uses broadcasting and, as a result, causes an error. Is it an expected behaviour? Is it possible to use broadcasting in CUDA kernels?

Thank you.

using CUDA

function kernel1(a, b)
    @. a = b
    return nothing
end


function kernel2(a, b)
    for i=1:length(a)
        a[i] = b[i]
    end
    return nothing
end


a = CUDA.zeros(10)
b = CUDA.ones(10)

# @cuda threads=256 kernel1(a, b)
# ERROR: LoadError: InvalidIRError: compiling kernel kernel1(CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}) resulted in invalid LLVM IR
# Reason: unsupported dynamic function invocation (call to typename(a::DataType) in Base at essentials.jl:304)

@cuda threads=256 kernel2(a, b)

ozmaden · June 7, 2021, 2:48pm

Broadcasting and kernels can be seen as opposites. Broadcasting eases your life, so that you do not need to write kernels. Broadcasting therefore can not be used inside a kernel, inside a kernel you are processing scalar values. Broadcasting launches generic kernels that julia sees fitting for the current operation you are trying to perform. This would be a broadcasting solution that replaces kernel1:

using CUDA
a = CUDA.zeros(10)
b = CUDA.ones(10)
a .= b

And your kernel2 would be the “Kernel”-based approach to solve the problem. As you can see there is rarely a reason to write your own kernels in CUDA.jl. From my experience it only comes up when you are not satisfied with the performance of the broadcasting solution.

fedoroff · June 7, 2021, 3:02pm

In my actual code I have a two-dimensional array and I want to perform a set of operations along one dimension while parallelize along the other. Something like this:

using CUDA

function kernel1(a, b)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    N1, N2 = size(a)
    for i=id:stride:N1
        @. a[i, :] = b
    end
    return nothing
end


function kernel2(a, b)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    N1, N2 = size(a)
    for i=id:stride:N1
        for j=1:N2
            a[i, j] = b[j]
        end
    end
    return nothing
end

N1 = 100
N2 = 10

a = CUDA.zeros((N1, N2))
b = CUDA.ones(N2)

# @cuda threads=N1 kernel1(a, b)

@cuda threads=N1 kernel2(a, b)

stillyslalom · June 7, 2021, 4:12pm

For simple cases like this, you can just use plain ol’ broadcasting:

julia> a .= b'
100×10 CuArray{Float32, 2}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
[...]

If your actual kernel function can be written without referring to indices, you can broadcast that, too:

julia> f(x) = 2asin(x)
f (generic function with 1 method)

julia> @. a = f(b')
100×10 CuArray{Float32, 2}:
 3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159
 3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159  3.14159
[...]

fedoroff · June 7, 2021, 4:24pm

Thank you, but, unfortunately, my case is not so simple.

maleadt · June 8, 2021, 10:08am

Broadcast generally isn’t GPU compatible; it might work with e.g. StaticArrays, but generally you don’t want to do so because that means each thread will be iterating a number of elements (whereas you typically then want to use a parallel reduction and share the work across threads, if possible).

fedoroff · June 8, 2021, 12:45pm

As I tried to show in my second example, the broadcasts can be useful when one has a deal with multidimensional arrays with parallelization along a specific axis. Another case is when you apply an external function with internal broadcasts to one of the array’s axis.
Would it be hard to fix this incompatibility by writing a specific broadcast style for CuDeviceArrays? I thought, in the end, the broadcast is replaced by a corresponding loop.

maleadt · June 8, 2021, 12:58pm

That should be possible, yes.

Topic		Replies	Views
Question about CUDA kernels GPU question	4	588	February 10, 2023
Broadcast question GPU	2	413	September 19, 2022
How to vectorize any function on the GPU with CUDA.jl? GPU question , function	3	445	March 14, 2024
Broadcasting an expression-evaluator function GPU	2	498	August 2, 2022
How to use `copy!`(or other array operations) inside a CUDA kernel? GPU cuda	7	2036	February 23, 2021

Broadcasting in CUDA kernels

Related topics