Performance difference among three CUDA kernels with same results

Chiil · April 2, 2022, 9:38pm

I wrote a small test program to learn a bit more of CUDA.jl. I find it hard to understand why the test3! function is so much slower than test1! and test2!, especially because this is the way one would write a CUDA kernel in C++/CUDA. Also, why is test1! slower than test2!?

  18.399 μs (67 allocations: 3.75 KiB)
  7.500 μs (24 allocations: 2.50 KiB)
  353.689 μs (3 allocations: 256 bytes)

using CUDA
using BenchmarkTools

function test1!(a)
    a[2:end-1, 2:end-1, 2:end-1] .*= 0.1
end

function test2!(a)
    av = @view a[2:end-1, 2:end-1, 2:end-1]
    av .*= 0.1
end

function test3_kernel!(a)
    i = (blockIdx.().x - 1) * blockDim().x + threadIdx().x + 1
    j = (blockIdx.().y - 1) * blockDim().y + threadIdx().y + 1
    k = blockIdx.().z + 1

    if (i < size(a, 1)) && (j < size(a, 2)) && (k < size(a, 3))
        @inbounds a[i, j, k] *= 0.1
    end
    return nothing
end

function test3!(a)
    blocks_size = (32, 32, 1)
    blocks_num = (
        (size(a, 1)-2) ÷ blocks_size[1],
        (size(a, 2)-2) ÷ blocks_size[2],
        (size(a, 3)-2) )

    CUDA.@sync begin
        @cuda threads=blocks_size blocks=blocks_num test3_kernel!(a)
    end
 
end

a = rand(258, 258, 258)
a_gpu = CuArray(a)

a_gpu1 = CuArray(a)
a_gpu2 = CuArray(a)
a_gpu3 = CuArray(a)

test1!(a_gpu1)
test2!(a_gpu2)
test3!(a_gpu3)

println(a_gpu1 ≈ a_gpu2)
println(a_gpu1 ≈ a_gpu3)
println(a_gpu2 ≈ a_gpu3)

@btime test1!($a_gpu)
@btime test2!($a_gpu)
@btime test3!($a_gpu)

torrance · April 3, 2022, 5:44am

I replied earlier, but as it turned out all the benchmarking was incorrect because we weren’t calling CUDA.@sync. Both kernels and array programming are asynchronous, unless there’s a copy from GPU to host which calls an implicit synchoronize(). So I guess we were just measuring kernel launches.

With CUDA.@sync inserted prior to each testn!(), I get the following results:

@btime test2!(a) evals=10 samples=100
@btime begin CUDA.@sync test1!(a_gpu) end evals=100 samples=1000
@btime begin CUDA.@sync test2!(a_gpu) end evals=100 samples=1000
@btime begin CUDA.@sync test3!(a_gpu) end evals=100 samples=1000

CPU test2!(a):  15.919 ms (1 allocation: 80 bytes)
test1!(a_gpu) 2.229 ms (82 allocations: 4.56 KiB)
test2!(a_gpu) 678.646 μs (38 allocations: 3.32 KiB)
test3!(a_gpu) 630.090 μs (17 allocations: 1.03 KiB)

So in actual fact your kernel is perfectly performant.

FWIW, I would have set a launch configuration that ensured I was making sequential accesses to memory across the block. As it turns out this doesn’t make any difference in this case (possibly because the warp size matches your thread dimensions), but here’s my (imo slightly simpler) version anyway:

@inbounds function test4_kernel(a)
    i = threadIdx().x + 1
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y +  1
    k = blockIdx().z + 1

    a[i, j, k] *= 0.1

    return nothing
end

function test4!(a)
    threads = (256, 4, 1)
    blocks = (1, 64, 256)
    @cuda threads=threads blocks=blocks test4_kernel(a)
end

@btime begin CUDA.@sync test4!(a_gpu) end evals=100 samples=1000

test4!() 628.820 μs (17 allocations: 1.03 KiB)

Chiil · April 3, 2022, 6:46am

Great! Thanks for that, this makes a lot more sense. That leaves only my second question, why is test2 faster than test1?

torrance · April 3, 2022, 6:52am

Test1 is making a full copy of the array slice, before applying the kernel.

Chiil · April 3, 2022, 11:37am

But why is that so? I do not understand why a range in the lhs of the equation requires a copy of the data.
.

torrance · April 3, 2022, 12:03pm

It’s a good question. Normal julia expands a[m:n] .*= b to a[m:n] .= a[m:n] .* b. Julia fuses both of those dot operations during broadcasting to create a single loop over the indices m:n.

I seem to recall the CUDA docs once explicitly stated that CUDA.jl does not fuse dot operations, which would mean these occur at two different stages, creating an interim array.

I’m doubting myself here, so perhaps someone more knowledgeable can answer.

photor · April 3, 2022, 2:56pm

The right hand side of a[m:n] .= a[m:n] .* b makes a copy, not a view, even on CPU. You can avoid that by using
@view a[m:n] .*= b

torrance · April 3, 2022, 3:03pm

@photor I benchmarked it and there are zero allocations for the version without an explicit view. So I don’t think it’s allocating.

ToucheSir · April 3, 2022, 3:23pm

I would caution that BenchmarkTools does not capture GPU memory allocations, but in this case we can see from the lowering that it indeed does not make a copy:

julia> using CUDA

julia> a = CUDA.rand(10, 10, 10);

julia> Meta.@lower a[1:3, 2:4, 3:5] .*= 0.1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = 1:3
│   %2 = 2:4
│   %3 = 3:5
│   %4 = Base.dotview(a, %1, %2, %3)
│   %5 = Base.getindex(a, %1, %2, %3)
│   %6 = Base.broadcasted(*, %5, 0.1)
│   %7 = Base.materialize!(%4, %6)
└──      return %7
))))

julia> Meta.@lower @views a[1:3, 2:4, 3:5] .*= 0.1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1  = a
│         ##a#274 = %1
│   %3  = 1:3
│         i#275 = %3
│   %5  = 2:4
│         i#276 = %5
│   %7  = 3:5
│         i#277 = %7
│   %9  = Base.dotview(##a#274, i#275, i#276, i#277)
│   %10 = (Base.maybeview)(##a#274, i#275, i#276, i#277)
│   %11 = Base.broadcasted(*, %10, 0.1)
│   %12 = Base.materialize!(%9, %11)
└──       return %12
))))

julia> CUDA.@time Base.dotview(a, 1:3, 2:4, 3:5);
  0.000007 seconds (6 CPU allocations: 256 bytes)

julia> CUDA.@time a[1:3, 2:4, 3:5];
  0.017774 seconds (42 CPU allocations: 1.578 KiB) (1 GPU allocation: 108 bytes, 0.05% memmgmt time)

Topic		Replies	Views
Cuda kernel help Performance	2	46	March 29, 2025
CUDA.jl kernel is half as fast as c++ Kernel Performance cuda , cudajl	11	1537	September 26, 2022
CUDA.jl tutorial code kernel slower than broadcast Performance gpu , cuda	4	686	March 30, 2021
CUDA unexplainable SPEEDUP! Local memory? Performance cuda	4	451	December 29, 2021
How can I improve the efficiency of the CUDA kernel？ General Usage	4	79	April 21, 2025

Performance difference among three CUDA kernels with same results

Related topics