Performance difference among three CUDA kernels with same results

I wrote a small test program to learn a bit more of CUDA.jl. I find it hard to understand why the test3! function is so much slower than test1! and test2!, especially because this is the way one would write a CUDA kernel in C++/CUDA. Also, why is test1! slower than test2!?

  18.399 μs (67 allocations: 3.75 KiB)
  7.500 μs (24 allocations: 2.50 KiB)
  353.689 μs (3 allocations: 256 bytes)
using CUDA
using BenchmarkTools

function test1!(a)
    a[2:end-1, 2:end-1, 2:end-1] .*= 0.1
end

function test2!(a)
    av = @view a[2:end-1, 2:end-1, 2:end-1]
    av .*= 0.1
end

function test3_kernel!(a)
    i = (blockIdx.().x - 1) * blockDim().x + threadIdx().x + 1
    j = (blockIdx.().y - 1) * blockDim().y + threadIdx().y + 1
    k = blockIdx.().z + 1

    if (i < size(a, 1)) && (j < size(a, 2)) && (k < size(a, 3))
        @inbounds a[i, j, k] *= 0.1
    end
    return nothing
end

function test3!(a)
    blocks_size = (32, 32, 1)
    blocks_num = (
        (size(a, 1)-2) ÷ blocks_size[1],
        (size(a, 2)-2) ÷ blocks_size[2],
        (size(a, 3)-2) )

    CUDA.@sync begin
        @cuda threads=blocks_size blocks=blocks_num test3_kernel!(a)
    end
 
end

a = rand(258, 258, 258)
a_gpu = CuArray(a)

a_gpu1 = CuArray(a)
a_gpu2 = CuArray(a)
a_gpu3 = CuArray(a)

test1!(a_gpu1)
test2!(a_gpu2)
test3!(a_gpu3)

println(a_gpu1 ≈ a_gpu2)
println(a_gpu1 ≈ a_gpu3)
println(a_gpu2 ≈ a_gpu3)

@btime test1!($a_gpu)
@btime test2!($a_gpu)
@btime test3!($a_gpu)

I replied earlier, but as it turned out all the benchmarking was incorrect because we weren’t calling CUDA.@sync. Both kernels and array programming are asynchronous, unless there’s a copy from GPU to host which calls an implicit synchoronize(). So I guess we were just measuring kernel launches.

With CUDA.@sync inserted prior to each testn!(), I get the following results:

@btime test2!(a) evals=10 samples=100
@btime begin CUDA.@sync test1!(a_gpu) end evals=100 samples=1000
@btime begin CUDA.@sync test2!(a_gpu) end evals=100 samples=1000
@btime begin CUDA.@sync test3!(a_gpu) end evals=100 samples=1000

CPU test2!(a):  15.919 ms (1 allocation: 80 bytes)
test1!(a_gpu) 2.229 ms (82 allocations: 4.56 KiB)
test2!(a_gpu) 678.646 μs (38 allocations: 3.32 KiB)
test3!(a_gpu) 630.090 μs (17 allocations: 1.03 KiB)

So in actual fact your kernel is perfectly performant.

FWIW, I would have set a launch configuration that ensured I was making sequential accesses to memory across the block. As it turns out this doesn’t make any difference in this case (possibly because the warp size matches your thread dimensions), but here’s my (imo slightly simpler) version anyway:

@inbounds function test4_kernel(a)
    i = threadIdx().x + 1
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y +  1
    k = blockIdx().z + 1

    a[i, j, k] *= 0.1

    return nothing
end

function test4!(a)
    threads = (256, 4, 1)
    blocks = (1, 64, 256)
    @cuda threads=threads blocks=blocks test4_kernel(a)
end
@btime begin CUDA.@sync test4!(a_gpu) end evals=100 samples=1000

test4!() 628.820 μs (17 allocations: 1.03 KiB)
3 Likes

Great! Thanks for that, this makes a lot more sense. That leaves only my second question, why is test2 faster than test1?

Test1 is making a full copy of the array slice, before applying the kernel.

1 Like

But why is that so? I do not understand why a range in the lhs of the equation requires a copy of the data.
.

It’s a good question. Normal julia expands a[m:n] .*= b to a[m:n] .= a[m:n] .* b. Julia fuses both of those dot operations during broadcasting to create a single loop over the indices m:n.

I seem to recall the CUDA docs once explicitly stated that CUDA.jl does not fuse dot operations, which would mean these occur at two different stages, creating an interim array.

I’m doubting myself here, so perhaps someone more knowledgeable can answer.

1 Like

The right hand side of a[m:n] .= a[m:n] .* b makes a copy, not a view, even on CPU. You can avoid that by using
@view a[m:n] .*= b

@photor I benchmarked it and there are zero allocations for the version without an explicit view. So I don’t think it’s allocating.

1 Like

I would caution that BenchmarkTools does not capture GPU memory allocations, but in this case we can see from the lowering that it indeed does not make a copy:

julia> using CUDA

julia> a = CUDA.rand(10, 10, 10);

julia> Meta.@lower a[1:3, 2:4, 3:5] .*= 0.1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = 1:3
│   %2 = 2:4
│   %3 = 3:5
│   %4 = Base.dotview(a, %1, %2, %3)
│   %5 = Base.getindex(a, %1, %2, %3)
│   %6 = Base.broadcasted(*, %5, 0.1)
│   %7 = Base.materialize!(%4, %6)
└──      return %7
))))

julia> Meta.@lower @views a[1:3, 2:4, 3:5] .*= 0.1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1  = a
│         ##a#274 = %1
│   %3  = 1:3
│         i#275 = %3
│   %5  = 2:4
│         i#276 = %5
│   %7  = 3:5
│         i#277 = %7
│   %9  = Base.dotview(##a#274, i#275, i#276, i#277)
│   %10 = (Base.maybeview)(##a#274, i#275, i#276, i#277)
│   %11 = Base.broadcasted(*, %10, 0.1)
│   %12 = Base.materialize!(%9, %11)
└──       return %12
))))

julia> CUDA.@time Base.dotview(a, 1:3, 2:4, 3:5);
  0.000007 seconds (6 CPU allocations: 256 bytes)

julia> CUDA.@time a[1:3, 2:4, 3:5];
  0.017774 seconds (42 CPU allocations: 1.578 KiB) (1 GPU allocation: 108 bytes, 0.05% memmgmt time)
1 Like