Simple CUDA kernel on matrix slower than running GPU

jlbosse · May 31, 2024, 4:15pm

I am trying to write a simple CUDA kernel that swaps pairs of columns in a square matrix and puts a sign on one of them:

using CUDA, BenchmarkTools

function swap_and_sign!(A::Matrix)
    n = size(A, 1) ÷ 2
    for i in 1:2n
        for j in 2:2:2n
            A[i,j], A[i,j-1] = A[i,j-1], -A[i,j]
        end
    end
    nothing
end

function swap_and_sign!(A::CuArray)
    n = size(A, 1) ÷ 2
    threads = n
    blocks = 2n
    @inline function kernel(data)
        i, j = blockIdx().x, 2 * threadIdx().x
        data[i,j], data[i,j-1] = data[i,j-1], -data[i,j]
        nothing
    end
    @cuda threads=threads blocks=blocks kernel(A)
    nothing
end

arr = rand(1000, 1000)
cuarr = arr |> cu


@btime CUDA.@sync swap_and_sign!(cuarr)
#  2.469 ms (19 allocations: 576 bytes)

@btime swap_and_sign!(arr)
#  1.563 ms (0 allocations: 0 bytes)

As you see, even for a 1000×1000 matrix this is slower than running the same on the CPU. Is there something more clever I could do with indexing and number of threads and blocks to speed things up, or is this performance expected?

For reference, the GPU is a GeForce MX250 (so not the beefiest) and the CPUa Intel i7-10850H @ 2.7GHz

gdalle · May 31, 2024, 4:36pm

If I remember correctly there might be special things to do to properly benchmark GPU cude, besides just CUDA.@sync. But a quick fix you could try is switching your array to Float32, that’s usually the type for which GPUs are optimized.

maleadt · May 31, 2024, 4:40pm

There’s many optimizations you could apply, from making sure there’s no exceptions, optimizing the iteration order to maximize coalescing, optimizing the launch configuration, etc. See Performance Tips · CUDA.jl for a couple of Julia-specific hints, but ultimately this is a CUDA optimization problem.

That said, your GPU just seems extraordinarily underpowered here. On my workstation with high-end CPU and GPU hardware, I get 500us for the CPU version, and 36us for the GPU version.

What specifically are you thinking about? Just doing CUDA.@sync should be sufficient. There’s also the integrated profiler, of course (CUDA.@profile and CUDA.@bprofile) which show a lot of additional information that could help optimize this.

gdalle · May 31, 2024, 4:46pm

Not sure, a distant memory of someone telling me BenchmarkTools was inappropriate, but I can’t find the source. It’s probably nothing.

Elrod · May 31, 2024, 6:07pm

I’d suggest

function swap_and_sign_transpose!(A::Matrix)
    n = size(A, 1) ÷ 2
    @inbounds for j in 2:2:2n
        for i in 1:2n
            A[i,j], A[i,j-1] = A[i,j-1], -A[i,j]
        end
    end
    nothing
end

if you want to improve the CPU’s performance. Replace @inbounds with Polyester.@batch if you want multithreading.

jlbosse · June 3, 2024, 10:43am

Regarding making the type Float32: Sending the array to the GPU via arr |> cu seems to do that automatically. cuarr shows as CuArray{Float32, 2, CUDA.DeviceMemory}.

jlbosse · June 3, 2024, 10:46am

It turned out it was indeed mostly the GPU being enormously underpowered. Running the same code on a proper GPU gave me similar numbers to yours and further optimising as shown below made it even faster. But since the main culprit was the weak GPU and not actually something enourmously stupid in the code I am gonna accept that answer.

function swap_and_sign!(A::CuArray)
    n = size(A, 1) ÷ 2
    @inline function kernel_fun!(data)
        i = (blockIdx().x-1) * blockDim().x + threadIdx().x
        i_stride = gridDim().x * blockDim().x

        # multiply with 2 here for correct indexing.
        j = 2*((blockIdx().y-1) * blockDim().y + threadIdx().y) 
        j_stride = 2 * gridDim().y * blockDim().y

        while i <= 2n
            while j <= 2n
                data[i,j], data[i,j-1] = data[i,j-1], -data[i,j]
                j += j_stride
            end
            i += i_stride
        end
        nothing
    end
    kernel = @cuda launch=false kernel_fun!(A)
    config = launch_configuration(kernel.fun)

    # this seemed to give the faster iteration order
    threads_x = min(2n, config.threads)
    threads_y = min(fld(config.threads, threads_x), n) # put all remaining threads in here
    blocks_y = cld(n, threads_y)
    blocks_x = cld(2n, threads_x)
    threads = (threads_x, threads_y)
    blocks = (blocks_x, blocks_y)
    CUDA.@sync kernel(A; threads, blocks)
    nothing
end

roflmaostc · June 3, 2024, 11:43am

EDIT: your improvement is the same speed as KA but with much more boilerplate.

I’m a big fan of KernelAbstractions.jl which somehow is almost 6 times faster.

using CUDA, BenchmarkTools, KernelAbstractions

function swap_and_sign!(A::CuArray)
    n = size(A, 1) ÷ 2
    threads = n
    blocks = 2n
    @inline function kernel(data)
        i, j = blockIdx().x, 2 * threadIdx().x
        data[i,j], data[i,j-1] = data[i,j-1], -data[i,j]
        nothing
    end
    @cuda threads=threads blocks=blocks kernel(A)
    nothing
end

function swap_and_sign_KA!(A::CuArray)
	backend = KernelAbstractions.get_backend(A)
	kernel! = swap_and_sign_kernel!(backend)
	kernel!(A, ndrange=(size(A, 1) , size(A,2) ÷ 2))
end

@kernel function swap_and_sign_kernel!(A)
	i, j = @index(Global, NTuple)
	j = 2 * j
	A[i,j], A[i,j-1] = A[i,j-1], -A[i,j]
end


arr = rand(1000, 1000);
cuarr = arr |> cu;

@btime CUDA.@sync swap_and_sign!($cuarr)
#    209.772 μs (18 allocations: 560 bytes)

@btime CUDA.@sync swap_and_sign_KA!($cuarr)
#  36.228 μs (44 allocations: 1.23 KiB)

jlbosse · June 3, 2024, 12:27pm

Oh, that is indeed much cleaner! Thanks for pointing KernelAbstractions.jl out!

Topic		Replies	Views
CUDA matmul performance GPU question , performance	11	1496	August 21, 2020
Optimizing the use of Blocks, Threads vs. Array Indexing GPU	15	3253	September 21, 2018
GPU-Kernel function for fast matrix multiplication using shared memory GPU kernel	1	1747	August 13, 2021
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	240	June 4, 2025
Adjoint/transpose wrapper question GPU	18	545	April 21, 2024

Simple CUDA kernel on matrix slower than running GPU

Related topics