Simple CUDA kernel on matrix slower than running GPU

I am trying to write a simple CUDA kernel that swaps pairs of columns in a square matrix and puts a sign on one of them:

using CUDA, BenchmarkTools

function swap_and_sign!(A::Matrix)
    n = size(A, 1) ÷ 2
    for i in 1:2n
        for j in 2:2:2n
            A[i,j], A[i,j-1] = A[i,j-1], -A[i,j]
        end
    end
    nothing
end

function swap_and_sign!(A::CuArray)
    n = size(A, 1) ÷ 2
    threads = n
    blocks = 2n
    @inline function kernel(data)
        i, j = blockIdx().x, 2 * threadIdx().x
        data[i,j], data[i,j-1] = data[i,j-1], -data[i,j]
        nothing
    end
    @cuda threads=threads blocks=blocks kernel(A)
    nothing
end

arr = rand(1000, 1000)
cuarr = arr |> cu


@btime CUDA.@sync swap_and_sign!(cuarr)
#  2.469 ms (19 allocations: 576 bytes)

@btime swap_and_sign!(arr)
#  1.563 ms (0 allocations: 0 bytes)

As you see, even for a 1000×1000 matrix this is slower than running the same on the CPU. Is there something more clever I could do with indexing and number of threads and blocks to speed things up, or is this performance expected?

For reference, the GPU is a GeForce MX250 (so not the beefiest) and the CPUa Intel i7-10850H @ 2.7GHz

If I remember correctly there might be special things to do to properly benchmark GPU cude, besides just CUDA.@sync. But a quick fix you could try is switching your array to Float32, that’s usually the type for which GPUs are optimized.

There’s many optimizations you could apply, from making sure there’s no exceptions, optimizing the iteration order to maximize coalescing, optimizing the launch configuration, etc. See Performance Tips · CUDA.jl for a couple of Julia-specific hints, but ultimately this is a CUDA optimization problem.

That said, your GPU just seems extraordinarily underpowered here. On my workstation with high-end CPU and GPU hardware, I get 500us for the CPU version, and 36us for the GPU version.

What specifically are you thinking about? Just doing CUDA.@sync should be sufficient. There’s also the integrated profiler, of course (CUDA.@profile and CUDA.@bprofile) which show a lot of additional information that could help optimize this.

2 Likes

Not sure, a distant memory of someone telling me BenchmarkTools was inappropriate, but I can’t find the source. It’s probably nothing.

I’d suggest

function swap_and_sign_transpose!(A::Matrix)
    n = size(A, 1) ÷ 2
    @inbounds for j in 2:2:2n
        for i in 1:2n
            A[i,j], A[i,j-1] = A[i,j-1], -A[i,j]
        end
    end
    nothing
end

if you want to improve the CPU’s performance. Replace @inbounds with Polyester.@batch if you want multithreading.

1 Like

Regarding making the type Float32: Sending the array to the GPU via arr |> cu seems to do that automatically. cuarr shows as CuArray{Float32, 2, CUDA.DeviceMemory}.

It turned out it was indeed mostly the GPU being enormously underpowered. Running the same code on a proper GPU gave me similar numbers to yours and further optimising as shown below made it even faster. But since the main culprit was the weak GPU and not actually something enourmously stupid in the code I am gonna accept that answer.

function swap_and_sign!(A::CuArray)
    n = size(A, 1) ÷ 2
    @inline function kernel_fun!(data)
        i = (blockIdx().x-1) * blockDim().x + threadIdx().x
        i_stride = gridDim().x * blockDim().x

        # multiply with 2 here for correct indexing.
        j = 2*((blockIdx().y-1) * blockDim().y + threadIdx().y) 
        j_stride = 2 * gridDim().y * blockDim().y

        while i <= 2n
            while j <= 2n
                data[i,j], data[i,j-1] = data[i,j-1], -data[i,j]
                j += j_stride
            end
            i += i_stride
        end
        nothing
    end
    kernel = @cuda launch=false kernel_fun!(A)
    config = launch_configuration(kernel.fun)

    # this seemed to give the faster iteration order
    threads_x = min(2n, config.threads)
    threads_y = min(fld(config.threads, threads_x), n) # put all remaining threads in here
    blocks_y = cld(n, threads_y)
    blocks_x = cld(2n, threads_x)
    threads = (threads_x, threads_y)
    blocks = (blocks_x, blocks_y)
    CUDA.@sync kernel(A; threads, blocks)
    nothing
end
1 Like

EDIT: your improvement is the same speed as KA but with much more boilerplate.

I’m a big fan of KernelAbstractions.jl which somehow is almost 6 times faster.

using CUDA, BenchmarkTools, KernelAbstractions

function swap_and_sign!(A::CuArray)
    n = size(A, 1) ÷ 2
    threads = n
    blocks = 2n
    @inline function kernel(data)
        i, j = blockIdx().x, 2 * threadIdx().x
        data[i,j], data[i,j-1] = data[i,j-1], -data[i,j]
        nothing
    end
    @cuda threads=threads blocks=blocks kernel(A)
    nothing
end

function swap_and_sign_KA!(A::CuArray)
	backend = KernelAbstractions.get_backend(A)
	kernel! = swap_and_sign_kernel!(backend)
	kernel!(A, ndrange=(size(A, 1) , size(A,2) ÷ 2))
end

@kernel function swap_and_sign_kernel!(A)
	i, j = @index(Global, NTuple)
	j = 2 * j
	A[i,j], A[i,j-1] = A[i,j-1], -A[i,j]
end


arr = rand(1000, 1000);
cuarr = arr |> cu;

@btime CUDA.@sync swap_and_sign!($cuarr)
#    209.772 μs (18 allocations: 560 bytes)

@btime CUDA.@sync swap_and_sign_KA!($cuarr)
#  36.228 μs (44 allocations: 1.23 KiB)
2 Likes

Oh, that is indeed much cleaner! Thanks for pointing KernelAbstractions.jl out!

1 Like