@inbounds slower

eldee · March 6, 2025, 1:27pm

Hi,

I’m encountering some situations in my GPU code where adding @inbounds increases execution time by 10%-20%. Of course it would be great if we would never need @inbounds, but on the other hand there are certainly also situations where adding it does help considerably.

Is there any theoretical reason why using @inbounds might negatively affect performance? Performing less work by omitting checks sounds like it should normally never harm performance, but perhaps there are some subtle incompatibilities, similar to the AVX512 / SIMD CPU situation here: @inbounds code slower than one without? How can one predict in advance if it will be useful to add @inbounds?

I struggle with extracting a good MWE. In the code below, the difference in speed is small, but at least it is quite consistent. Hopefully any conclusions will still carry over to the full code.

using CUDA, Random, Statistics
using CUDA: i32

function main()
    inbounds_times = Float64[]
    noinbounds_times = Float64[]
    for i = 1:1001  # Takes around 1.5 minutes on my RTX 3070
        P = CUDA.rand(3, 16, 2_000_000)
        G = similar(P)
        M = similar(P)
        I = CuArray{Int32}(sort(@view randperm(size(P, 3))[1:500_000]))
        # We will use the equivalent of @view P[:,:,I].

        # Just in case there's some caching going on (though I doubt it), 
        # alternate the order of inbounds and noinbounds for a fair comparison
        if i % 2 == 0
            inbounds_time = @CUDA.elapsed @cuda threads=512 blocks=cld(length(I), 512) inbounds_kernel!(P, G, M, I)
            noinbounds_time = @CUDA.elapsed @cuda threads=512 blocks=cld(length(I), 512) noinbounds_kernel!(P, G, M, I)
        else
            noinbounds_time = @CUDA.elapsed @cuda threads=512 blocks=cld(length(I), 512) noinbounds_kernel!(P, G, M, I)
            inbounds_time = @CUDA.elapsed @cuda threads=512 blocks=cld(length(I), 512) inbounds_kernel!(P, G, M, I)
        end

        # Don't include the first runs
        if i >= 2
            push!(inbounds_times, inbounds_time)
            push!(noinbounds_times, noinbounds_time)
        end

        CUDA.unsafe_free!(P)
        CUDA.unsafe_free!(G)
        CUDA.unsafe_free!(M)
        CUDA.unsafe_free!(I)
    end
    println("Inbounds: ")
    println("\tMin: ", minimum(inbounds_times) * 1000)
    println("\tMedian: ", median(inbounds_times) * 1000)
    println("\tMean: ", mean(inbounds_times) * 1000)
    println()
    println("No inbounds: ")
    println("\tMin: ", minimum(noinbounds_times) * 1000)
    println("\tMedian: ", median(noinbounds_times) * 1000)
    println("\tMean: ", mean(noinbounds_times) * 1000)
end

function inbounds_kernel!(P, G, M, I)  
    idx = threadIdx().x + (blockIdx().x - 1i32) * blockDim().x
    @inbounds while idx <= length(I)
        i = I[idx] 

        for j = axes(P, 1)      # (I guess the compiler might automatically eliminate bounds-checking for P.)
            for k = axes(P, 2)  # (But explicitly using 1i32:3i32 and 1i32:16i32 yields similar timings.)

                g = G[k, j, i]    # (no memory coalescing)
                m = M[k, j, i]

                m = 0.9 * m + 0.1 * g

                M[k, j, i] = m
                P[k, j, i] -= m
            end
        end

        idx += blockDim().x * gridDim().x
    end
    return
end

function noinbounds_kernel!(P, G, M, I)
    idx = threadIdx().x + (blockIdx().x - 1i32) * blockDim().x
    while idx <= length(I)  # not @inbounds
        i = I[idx]

        for j = axes(P, 1)
            for k = axes(P, 2)
                g = G[k, j, i]
                m = M[k, j, i]

                m = 0.9 * m + 0.1 * g

                M[k, j, i] = m
                P[k, j, i] -= m
            end
        end

        idx += blockDim().x * gridDim().x
    end
    return
end

main()

Inbounds:
        Min: 25.07782354950905
        Median: 25.257023982703686
        Mean: 25.382457310333848

No inbounds:
        Min: 24.692736566066742
        Median: 24.902560748159885
        Mean: 25.018115950748324

versioninfo

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.7
NVIDIA driver 566.14.0

CUDA libraries:
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+566.14

Julia packages:
- CUDA: 5.6.1
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.2
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce RTX 3070 (sm_86, 6.619 GiB / 8.000 GiB available)

maleadt · March 7, 2025, 7:36am

There’s many possible reasons (even though the behavior in itself is very unlikely), e.g., the bounds check may break loop unrolling, resulting in a loop-y kernel that uses fewer registers, and consequently has higher occupancy. Only way to verify such hypothesis is to look at and compare the generated code.

I wouldn’t worry about it, it’s IME extremely unlikely that removing bounds checks will regress performance. Instead, I’d drive such decisions based on profile results. For example, when analyzing a kernel with NSight Compute, you may realize that register usage is a little high, and/or occupancy could be doubled by slight simplifications, prompting you to look at the generated code and e.g. force a loop unroll to not happen (by other means than introducing bounds checks).

eldee · March 9, 2025, 1:22pm

Thanks for the insights!

I don’t have enough experience with PTX (or LLVM) to be able to properly understand the generated code, but the inbounds kernel is much shorter than the noinbounds kernel (less than half the length). I’m pretty sure the latter does not get unrolled (.pragma "nounroll";), but I don’t know what’s happening in the former.

inbounds PTX

// PTX CompilerJob of MethodInstance for inbounds_kernel!(::CuDeviceArray{Float32, 3, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceVector{Int32, 1}) for sm_86

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_86
.address_size 64

        // .globl       _Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE // -- Begin function _Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE
                                        // @_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE
.visible .entry _Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE(
        .param .align 8 .b8 _Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_0[16],
        .param .align 8 .b8 _Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1[48],
        .param .align 8 .b8 _Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2[48],
        .param .align 8 .b8 _Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3[48],
        .param .align 8 .b8 _Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_4[32]
)
{
        .reg .pred      %p<7>;
        .reg .b32       %r<12>;
        .reg .f32       %f<6>;
        .reg .b64       %rd<80>;
        .reg .f64       %fd<7>;

// %bb.0:                               // %conversion
        ld.param.u64    %rd58, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_4+24];
        mov.u32         %r6, %tid.x;
        mov.u32         %r7, %ctaid.x;
        mov.u32         %r1, %ntid.x;
        mad.lo.s32      %r8, %r7, %r1, %r6;
        add.s32         %r11, %r8, 1;
        cvt.s64.s32     %rd59, %r11;
        setp.lt.s64     %p1, %rd58, %rd59;
        @%p1 bra        $L__BB0_9;
        bra.uni         $L__BB0_1;
$L__BB0_9:                              // %L314
        ret;
$L__BB0_1:                              // %L44.lr.ph
        ld.param.u64    %rd55, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_4];
        ld.param.u64    %rd52, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3+24];
        ld.param.u64    %rd51, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3+16];
        ld.param.u64    %rd49, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3];
        ld.param.u64    %rd46, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2+24];
        ld.param.u64    %rd45, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2+16];
        ld.param.u64    %rd43, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2];
        ld.param.u64    %rd2, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1+24];
        ld.param.u64    %rd39, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1+16];
        ld.param.u64    %rd37, [_Z16inbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1];
        mov.u32         %r9, %nctaid.x;
        mul.lo.s32      %r3, %r1, %r9;
        mul.lo.s64      %rd60, %rd2, %rd39;
        shl.b64         %rd12, %rd60, 2;
        shl.b64         %rd13, %rd39, 2;
        mul.lo.s64      %rd61, %rd52, %rd51;
        shl.b64         %rd14, %rd61, 2;
        shl.b64         %rd15, %rd51, 2;
        mul.lo.s64      %rd62, %rd46, %rd45;
        shl.b64         %rd16, %rd62, 2;
        shl.b64         %rd17, %rd45, 2;
        setp.lt.s64     %p2, %rd39, 1;
        setp.lt.s64     %p3, %rd2, 1;
        bra.uni         $L__BB0_2;
$L__BB0_8:                              // %L299
                                        //   in Loop: Header=BB0_2 Depth=1
        add.s32         %r11, %r11, %r3;
        cvt.s64.s32     %rd71, %r11;
        setp.ge.s64     %p6, %rd58, %rd71;
        @%p6 bra        $L__BB0_2;
        bra.uni         $L__BB0_9;
$L__BB0_2:                              // %L44
                                        // =>This Loop Header: Depth=1
                                        //     Child Loop BB0_4 Depth 2
                                        //       Child Loop BB0_6 Depth 3
        @%p2 bra        $L__BB0_8;
// %bb.3:                               // %L62.preheader
                                        //   in Loop: Header=BB0_2 Depth=1
        add.s32         %r10, %r11, -1;
        mul.wide.s32    %rd64, %r10, 4;
        add.s64         %rd65, %rd55, %rd64;
        ld.global.s32   %rd66, [%rd65];
        add.s64         %rd67, %rd66, -1;
        mul.lo.s64      %rd68, %rd12, %rd67;
        add.s64         %rd74, %rd37, %rd68;
        mul.lo.s64      %rd69, %rd14, %rd67;
        add.s64         %rd73, %rd49, %rd69;
        mul.lo.s64      %rd70, %rd16, %rd67;
        add.s64         %rd72, %rd43, %rd70;
        mov.u64         %rd75, 1;
        bra.uni         $L__BB0_4;
$L__BB0_7:                              // %L288
                                        //   in Loop: Header=BB0_4 Depth=2
        add.s64         %rd33, %rd75, 1;
        add.s64         %rd74, %rd74, %rd13;
        add.s64         %rd73, %rd73, %rd15;
        add.s64         %rd72, %rd72, %rd17;
        setp.ne.s64     %p5, %rd75, %rd39;
        mov.u64         %rd75, %rd33;
        @%p5 bra        $L__BB0_4;
        bra.uni         $L__BB0_8;
$L__BB0_4:                              // %L62
                                        //   Parent Loop BB0_2 Depth=1
                                        // =>  This Loop Header: Depth=2
                                        //       Child Loop BB0_6 Depth 3
        @%p3 bra        $L__BB0_7;
// %bb.5:                               // %L76.preheader
                                        //   in Loop: Header=BB0_4 Depth=2
        mov.u64         %rd76, %rd2;
        mov.u64         %rd77, %rd72;
        mov.u64         %rd78, %rd73;
        mov.u64         %rd79, %rd74;
$L__BB0_6:                              // %L76
                                        //   Parent Loop BB0_2 Depth=1
                                        //     Parent Loop BB0_4 Depth=2
                                        // =>    This Inner Loop Header: Depth=3
        ld.global.f32   %f1, [%rd77];
        ld.global.f32   %f2, [%rd78];
        cvt.f64.f32     %fd1, %f2;
        mul.f64         %fd2, %fd1, 0d3FECCCCCCCCCCCCD;
        cvt.f64.f32     %fd3, %f1;
        fma.rn.f64      %fd4, %fd3, 0d3FB999999999999A, %fd2;
        cvt.rn.f32.f64  %f3, %fd4;
        st.global.f32   [%rd78], %f3;
        ld.global.f32   %f4, [%rd79];
        cvt.f64.f32     %fd5, %f4;
        sub.f64         %fd6, %fd5, %fd4;
        cvt.rn.f32.f64  %f5, %fd6;
        st.global.f32   [%rd79], %f5;
        add.s64         %rd79, %rd79, 4;
        add.s64         %rd78, %rd78, 4;
        add.s64         %rd77, %rd77, 4;
        add.s64         %rd76, %rd76, -1;
        setp.ne.s64     %p4, %rd76, 0;
        @%p4 bra        $L__BB0_6;
        bra.uni         $L__BB0_7;
                                        // -- End function
}

noinbounds PTX

// PTX CompilerJob of MethodInstance for noinbounds_kernel!(::CuDeviceArray{Float32, 3, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceVector{Int32, 1}) for sm_86

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_86
.address_size 64

        // .globl       _Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE // -- Begin function _Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE
.extern .func julia_throw_boundserror_20419
(
        .param .align 8 .b8 julia_throw_boundserror_20419_param_0[16]
)
;
                                        // @_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE
.visible .entry _Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE(
        .param .align 8 .b8 _Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_0[16],
        .param .align 8 .b8 _Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1[48],
        .param .align 8 .b8 _Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2[48],
        .param .align 8 .b8 _Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3[48],
        .param .align 8 .b8 _Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_4[32]
)
{
        .reg .pred      %p<22>;
        .reg .b32       %r<11>;
        .reg .f32       %f<16>;
        .reg .b64       %rd<172>;
        .reg .f64       %fd<19>;

// %bb.0:                               // %conversion
        ld.param.u64    %rd94, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_4+24];
        mov.u32         %r5, %tid.x;
        mov.u32         %r6, %ctaid.x;
        mov.u32         %r1, %ntid.x;
        mad.lo.s32      %r7, %r6, %r1, %r5;
        add.s32         %r8, %r7, 1;
        cvt.s64.s32     %rd152, %r8;
        setp.lt.s64     %p1, %rd94, %rd152;
        @%p1 bra        $L__BB0_21;
        bra.uni         $L__BB0_1;
$L__BB0_21:                             // %L314
        ret;
$L__BB0_1:                              // %L31.lr.ph
        ld.param.u64    %rd93, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_4+16];
        ld.param.u64    %rd91, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_4];
        ld.param.u64    %rd90, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3+40];
        ld.param.u64    %rd88, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3+24];
        ld.param.u64    %rd87, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3+16];
        ld.param.u64    %rd85, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_3];
        ld.param.u64    %rd84, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2+40];
        ld.param.u64    %rd82, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2+24];
        ld.param.u64    %rd81, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2+16];
        ld.param.u64    %rd79, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_2];
        ld.param.u64    %rd78, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1+40];
        ld.param.u64    %rd76, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1+24];
        ld.param.u64    %rd75, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1+16];
        ld.param.u64    %rd73, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_1];
        ld.param.u32    %r4, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_0+8];
        ld.param.u64    %rd72, [_Z18noinbounds_kernel_13CuDeviceArrayI7Float32Lx3ELx1EES_IS0_Lx3ELx1EES_IS0_Lx3ELx1EES_I5Int32Lx1ELx1EE_param_0];
        max.s64         %rd14, %rd84, 0;
        max.s64         %rd15, %rd90, 0;
        max.s64         %rd16, %rd78, 0;
        mov.u32         %r9, %nctaid.x;
        mul.lo.s32      %r2, %r1, %r9;
        mul.lo.s64      %rd95, %rd76, %rd75;
        shl.b64         %rd17, %rd95, 2;
        shl.b64         %rd18, %rd75, 2;
        mul.lo.s64      %rd96, %rd88, %rd87;
        shl.b64         %rd19, %rd96, 2;
        shl.b64         %rd20, %rd87, 2;
        mul.lo.s64      %rd97, %rd82, %rd81;
        shl.b64         %rd21, %rd97, 2;
        shl.b64         %rd22, %rd81, 2;
        add.s64         %rd23, %rd76, 1;
        setp.lt.s64     %p5, %rd75, 1;
        setp.lt.s64     %p6, %rd76, 1;
        bra.uni         $L__BB0_2;
$L__BB0_20:                             // %L299
                                        //   in Loop: Header=BB0_2 Depth=1
        add.s32         %r10, %r2, %r3;
        cvt.s64.s32     %rd152, %r10;
        setp.ge.s64     %p21, %rd94, %rd152;
        @%p21 bra       $L__BB0_2;
        bra.uni         $L__BB0_21;
$L__BB0_2:                              // %L31
                                        // =>This Loop Header: Depth=1
                                        //     Child Loop BB0_6 Depth 2
                                        //       Child Loop BB0_9 Depth 3
                                        //       Child Loop BB0_28 Depth 3
                                        //       Child Loop BB0_15 Depth 3
        cvt.u32.u64     %r3, %rd152;
        setp.gt.s32     %p2, %r3, 0;
        setp.ge.s64     %p3, %rd93, %rd152;
        and.pred        %p4, %p2, %p3;
        @%p4 bra        $L__BB0_4;
        bra.uni         $L__BB0_3;
$L__BB0_4:                              // %L44
                                        //   in Loop: Header=BB0_2 Depth=1
        @%p5 bra        $L__BB0_20;
// %bb.5:                               // %L62.preheader
                                        //   in Loop: Header=BB0_2 Depth=1
        add.s64         %rd100, %rd152, -1;
        and.b64         %rd101, %rd100, 4294967295;
        shl.b64         %rd102, %rd101, 2;
        add.s64         %rd103, %rd91, %rd102;
        ld.global.s32   %rd104, [%rd103];
        add.s64         %rd105, %rd104, -1;
        mul.lo.s64      %rd106, %rd105, %rd82;
        mul.lo.s64      %rd107, %rd105, %rd88;
        mul.lo.s64      %rd108, %rd105, %rd76;
        mul.lo.s64      %rd25, %rd107, %rd87;
        mul.lo.s64      %rd26, %rd106, %rd81;
        mul.lo.s64      %rd27, %rd108, %rd75;
        mul.lo.s64      %rd109, %rd17, %rd105;
        add.s64         %rd36, %rd73, %rd109;
        mul.lo.s64      %rd110, %rd19, %rd105;
        add.s64         %rd34, %rd85, %rd110;
        mul.lo.s64      %rd111, %rd21, %rd105;
        add.s64         %rd32, %rd79, %rd111;
        mov.u64         %rd160, 1;
        mov.u64         %rd37, 0;
        mov.u64         %rd31, %rd26;
        mov.u64         %rd33, %rd25;
        mov.u64         %rd35, %rd27;
        bra.uni         $L__BB0_6;
$L__BB0_19:                             // %L288
                                        //   in Loop: Header=BB0_6 Depth=2
        add.s64         %rd160, %rd38, 1;
        add.s64         %rd37, %rd37, 1;
        add.s64         %rd36, %rd36, %rd18;
        add.s64         %rd35, %rd35, %rd75;
        add.s64         %rd34, %rd34, %rd20;
        add.s64         %rd33, %rd33, %rd87;
        add.s64         %rd32, %rd32, %rd22;
        add.s64         %rd31, %rd31, %rd81;
        setp.ne.s64     %p20, %rd38, %rd75;
        @%p20 bra       $L__BB0_6;
        bra.uni         $L__BB0_20;
$L__BB0_6:                              // %L62
                                        //   Parent Loop BB0_2 Depth=1
                                        // =>  This Loop Header: Depth=2
                                        //       Child Loop BB0_9 Depth 3
                                        //       Child Loop BB0_28 Depth 3
                                        //       Child Loop BB0_15 Depth 3
        mov.u64         %rd38, %rd160;
        @%p6 bra        $L__BB0_19;
// %bb.7:                               // %L76.preheader
                                        //   in Loop: Header=BB0_6 Depth=2
        mul.lo.s64      %rd112, %rd87, %rd37;
        add.s64         %rd113, %rd25, %rd112;
        min.s64         %rd114, %rd113, 0;
        neg.s64         %rd115, %rd114;
        mul.lo.s64      %rd116, %rd81, %rd37;
        add.s64         %rd117, %rd26, %rd116;
        min.s64         %rd118, %rd117, 0;
        neg.s64         %rd119, %rd118;
        max.u64         %rd120, %rd115, %rd119;
        mul.lo.s64      %rd121, %rd75, %rd37;
        add.s64         %rd122, %rd27, %rd121;
        min.s64         %rd123, %rd122, 0;
        neg.s64         %rd124, %rd123;
        max.u64         %rd125, %rd120, %rd124;
        min.u64         %rd39, %rd76, %rd125;
        setp.eq.s64     %p7, %rd39, 0;
        mov.u64         %rd169, 0;
        mov.u64         %rd163, 1;
        @%p7 bra        $L__BB0_26;
// %bb.8:                               // %L76.preloop.preheader
                                        //   in Loop: Header=BB0_6 Depth=2
        mov.u64         %rd169, 0;
        mov.u64         %rd164, %rd32;
        mov.u64         %rd165, %rd34;
        mov.u64         %rd166, %rd36;
$L__BB0_9:                              // %L76.preloop
                                        //   Parent Loop BB0_2 Depth=1
                                        //     Parent Loop BB0_6 Depth=2
                                        // =>    This Inner Loop Header: Depth=3
        .pragma "nounroll";
        add.s64         %rd137, %rd31, %rd169;
        setp.ge.u64     %p8, %rd137, %rd14;
        @%p8 bra        $L__BB0_10;
// %bb.22:                              // %L109.preloop
                                        //   in Loop: Header=BB0_9 Depth=3
        add.s64         %rd138, %rd33, %rd169;
        setp.ge.u64     %p9, %rd138, %rd15;
        @%p9 bra        $L__BB0_11;
// %bb.23:                              // %L147.preloop
                                        //   in Loop: Header=BB0_9 Depth=3
        ld.global.f32   %f1, [%rd164];
        ld.global.f32   %f2, [%rd165];
        cvt.f64.f32     %fd3, %f2;
        mul.f64         %fd4, %fd3, 0d3FECCCCCCCCCCCCD;
        cvt.f64.f32     %fd5, %f1;
        fma.rn.f64      %fd1, %fd5, 0d3FB999999999999A, %fd4;
        cvt.rn.f32.f64  %f3, %fd1;
        st.global.f32   [%rd165], %f3;
        add.s64         %rd139, %rd35, %rd169;
        setp.ge.u64     %p10, %rd139, %rd16;
        @%p10 bra       $L__BB0_12;
// %bb.24:                              // %L270.preloop
                                        //   in Loop: Header=BB0_9 Depth=3
        ld.global.f32   %f4, [%rd166];
        cvt.f64.f32     %fd6, %f4;
        sub.f64         %fd7, %fd6, %fd1;
        cvt.rn.f32.f64  %f5, %fd7;
        st.global.f32   [%rd166], %f5;
        add.s64         %rd169, %rd169, 1;
        add.s64         %rd166, %rd166, 4;
        add.s64         %rd165, %rd165, 4;
        add.s64         %rd164, %rd164, 4;
        setp.lt.u64     %p11, %rd169, %rd39;
        @%p11 bra       $L__BB0_9;
// %bb.25:                              // %preloop.exit.selector
                                        //   in Loop: Header=BB0_6 Depth=2
        setp.ge.u64     %p12, %rd169, %rd76;
        add.s64         %rd163, %rd169, 1;
        @%p12 bra       $L__BB0_19;
$L__BB0_26:                             // %preloop.pseudo.exit
                                        //   in Loop: Header=BB0_6 Depth=2
        min.s64         %rd126, %rd15, %rd113;
        sub.s64         %rd127, %rd15, %rd126;
        min.u64         %rd128, %rd76, %rd127;
        min.s64         %rd129, %rd14, %rd117;
        sub.s64         %rd130, %rd14, %rd129;
        min.u64         %rd131, %rd128, %rd130;
        min.s64         %rd132, %rd16, %rd122;
        sub.s64         %rd133, %rd16, %rd132;
        min.u64         %rd40, %rd131, %rd133;
        setp.ge.u64     %p13, %rd169, %rd40;
        @%p13 bra       $L__BB0_14;
// %bb.27:                              // %L76.preheader13
                                        //   in Loop: Header=BB0_6 Depth=2
        shl.b64         %rd161, %rd163, 2;
$L__BB0_28:                             // %L76
                                        //   Parent Loop BB0_2 Depth=1
                                        //     Parent Loop BB0_6 Depth=2
                                        // =>    This Inner Loop Header: Depth=3
        mov.u64         %rd42, %rd163;
        add.s64         %rd140, %rd32, %rd161;
        ld.global.f32   %f6, [%rd140+-4];
        add.s64         %rd141, %rd34, %rd161;
        ld.global.f32   %f7, [%rd141+-4];
        cvt.f64.f32     %fd8, %f7;
        mul.f64         %fd9, %fd8, 0d3FECCCCCCCCCCCCD;
        cvt.f64.f32     %fd10, %f6;
        fma.rn.f64      %fd11, %fd10, 0d3FB999999999999A, %fd9;
        cvt.rn.f32.f64  %f8, %fd11;
        st.global.f32   [%rd141+-4], %f8;
        add.s64         %rd142, %rd36, %rd161;
        ld.global.f32   %f9, [%rd142+-4];
        cvt.f64.f32     %fd12, %f9;
        sub.f64         %fd13, %fd12, %fd11;
        cvt.rn.f32.f64  %f10, %fd13;
        st.global.f32   [%rd142+-4], %f10;
        add.s64         %rd163, %rd42, 1;
        add.s64         %rd161, %rd161, 4;
        setp.lt.u64     %p14, %rd42, %rd40;
        @%p14 bra       $L__BB0_28;
// %bb.13:                              // %main.exit.selector
                                        //   in Loop: Header=BB0_6 Depth=2
        setp.ge.u64     %p15, %rd42, %rd76;
        @%p15 bra       $L__BB0_19;
$L__BB0_14:                             // %main.pseudo.exit
                                        //   in Loop: Header=BB0_6 Depth=2
        shl.b64         %rd170, %rd163, 2;
$L__BB0_15:                             // %L76.postloop
                                        //   Parent Loop BB0_2 Depth=1
                                        //     Parent Loop BB0_6 Depth=2
                                        // =>    This Inner Loop Header: Depth=3
        .pragma "nounroll";
        add.s64         %rd143, %rd31, %rd163;
        add.s64         %rd144, %rd143, -1;
        setp.ge.u64     %p16, %rd144, %rd14;
        @%p16 bra       $L__BB0_10;
// %bb.16:                              // %L109.postloop
                                        //   in Loop: Header=BB0_15 Depth=3
        add.s64         %rd145, %rd33, %rd163;
        add.s64         %rd146, %rd145, -1;
        setp.ge.u64     %p17, %rd146, %rd15;
        @%p17 bra       $L__BB0_11;
// %bb.17:                              // %L147.postloop
                                        //   in Loop: Header=BB0_15 Depth=3
        add.s64         %rd147, %rd32, %rd170;
        ld.global.f32   %f11, [%rd147+-4];
        add.s64         %rd148, %rd34, %rd170;
        ld.global.f32   %f12, [%rd148+-4];
        cvt.f64.f32     %fd14, %f12;
        mul.f64         %fd15, %fd14, 0d3FECCCCCCCCCCCCD;
        cvt.f64.f32     %fd16, %f11;
        fma.rn.f64      %fd2, %fd16, 0d3FB999999999999A, %fd15;
        cvt.rn.f32.f64  %f13, %fd2;
        st.global.f32   [%rd148+-4], %f13;
        add.s64         %rd149, %rd35, %rd163;
        add.s64         %rd150, %rd149, -1;
        setp.ge.u64     %p18, %rd150, %rd16;
        @%p18 bra       $L__BB0_12;
// %bb.18:                              // %L270.postloop
                                        //   in Loop: Header=BB0_15 Depth=3
        add.s64         %rd151, %rd36, %rd170;
        ld.global.f32   %f14, [%rd151+-4];
        cvt.f64.f32     %fd17, %f14;
        sub.f64         %fd18, %fd17, %fd2;
        cvt.rn.f32.f64  %f15, %fd18;
        st.global.f32   [%rd151+-4], %f15;
        add.s64         %rd163, %rd163, 1;
        add.s64         %rd170, %rd170, 4;
        setp.eq.s64     %p19, %rd23, %rd163;
        @%p19 bra       $L__BB0_19;
        bra.uni         $L__BB0_15;
$L__BB0_10:                             // %L106
        { // callseq 22, 0
        .reg .b32 temp_param_reg;
        .param .align 8 .b8 param0[16];
        st.param.b64    [param0+0], %rd72;
        st.param.b32    [param0+8], %r4;
        call.uni
        julia_throw_boundserror_20419,
        (
        param0
        );
        } // callseq 22
        trap;
        // begin inline asm
        exit;
        // end inline asm
$L__BB0_11:                             // %L144
        { // callseq 23, 0
        .reg .b32 temp_param_reg;
        .param .align 8 .b8 param0[16];
        st.param.b64    [param0+0], %rd72;
        st.param.b32    [param0+8], %r4;
        call.uni
        julia_throw_boundserror_20419,
        (
        param0
        );
        } // callseq 23
        trap;
        // begin inline asm
        exit;
        // end inline asm
$L__BB0_12:                             // %L226
        { // callseq 24, 0
        .reg .b32 temp_param_reg;
        .param .align 8 .b8 param0[16];
        st.param.b64    [param0+0], %rd72;
        st.param.b32    [param0+8], %r4;
        call.uni
        julia_throw_boundserror_20419,
        (
        param0
        );
        } // callseq 24
        trap;
        // begin inline asm
        exit;
        // end inline asm
$L__BB0_3:                              // %L41
        { // callseq 25, 0
        .reg .b32 temp_param_reg;
        .param .align 8 .b8 param0[16];
        st.param.b64    [param0+0], %rd72;
        st.param.b32    [param0+8], %r4;
        call.uni
        julia_throw_boundserror_20419,
        (
        param0
        );
        } // callseq 25
        trap;
        // begin inline asm
        exit;
        // end inline asm
                                        // -- End function
}

Additionally, inbounds_kernel! uses many fewer registers than noinbounds_kernel!

julia> CUDA.registers(@cuda launch=false inbounds_kernel!(P, G, M, I))
38

julia> CUDA.registers(@cuda launch=false noinbounds_kernel!(P, G, M, I))
54

NSight Compute also shows that inbounds_kernel! executes about half the instructions (15.328.188) compared to noinbounds_kernel! (30.093.820). Based on all of this, I would expect the inbounds version to perform better. Then again, the kernels are very memory-heavy (and absolutely not optimised in that sense). Also, for some reason noinbounds_kernel! has a higher L1/TEX Cache hit rate of 3.66% compared to 1.35% for inbounds_kernel!.

I find it very hard to draw any conclusions from this. I guess the main takeaway is that for optimal performance you should default to @inbounds while/for ..., when proven safe, and perhaps check if some hypothetical @nounroll @inbounds would work better? Though I imagine the results will be device (/architecture) dependent?

maleadt · March 10, 2025, 6:24am

It’s often much easier to look at the LLVM IR than at native code, if anything because it contains better source code references.

That makes me think something else is at play here. With excessive unrolling, I wouldn’t expect the amount of executed instructions to balloon (that much). I’d take a closer look at the LLVM IRs, which should be more readable and hopefully make it clear where the additional operations come from.

eldee · March 10, 2025, 6:20pm

I’ve attached the generated LLVM code, which is too long to insert directly into a Discourse post. These are just .txt files, but since this is not an allowed extension, I’ve added .jl.
inbounds_llvm.txt.jl (13.9 KB)
noinbounds_llvm.txt.jl (40.8 KB)

Thanks for continuing the investigation!

maleadt · March 10, 2025, 9:05pm

Oh, I switched things around when looking at the inbounds/noinbounds PTX kernels. The generated code looks entirely as expected: with @inbounds, everything is simpler, uses fewer registers, and executes fewer instructions. I have no idea why that kernel would then proceed to execute slower than the one without @inbounds. Are you sure your measurements are correct? CUDA.@elapsed can be inaccurate if it includes CPU latency, can you try using CUDA.@profile (or report the timings from NSight Compute) instead? Note that you can also use CUDA.@bprofile to execute a kernel multiple times.

eldee · March 11, 2025, 8:32am

Yes, indeed, it’s very strange that noinbounds_kernel! would be faster than inbounds_kernel!.

For the original timing method I chose the CUDA.@elapsed option because it allows me to use different inputs every iteration, while within each iteration we use the same data for noinbounds_kernel! and inbounds_kernel!. For the timings below, I’m just using a single random (P, G, M, I). To be more precise, the noinbounds and inbounds variants of the kernel will get the same data, so can be compared, but the CUDA.@bprofile and Nsight Compute CUDA.@profile will work on different data. (Additionally, in principle the kernels will modify P and M inplace, so running them multiple times as in @bprofile might result in Infs and/or NaNs. Luckily I messed up by using G = similar(P) instead of G = CUDA.rand(size(P)...), so that in practice G consist of zeros, and P and M (also zeros) don’t actually change .)

With all that out of the way, here are the results from the different timing methods:

CUDA.@profile with the internal profiler yields wildly varying results between runs, making it hard to draw conclusions
CUDA.@bprofile: fairly consistent 24.84 ms ± 0.2 for noinbounds_kernel!, 25.23 ms ± 0.25 for inbounds_kernel!
Nsight Compute: 25.40 ms for noinbounds_kernel!, 26.04 ms for inbounds_kernel!.

MrBelette · March 24, 2025, 8:00pm

For comparison, here are timings from a RTX 4090 using julia -O3.
run 1:
Inbounds:
Min: 4.330495838075876
Median: 6.654975935816765
Mean: 7.263148283120245

No inbounds:
Min: 5.47020835801959
Median: 6.82158418931067
Mean: 7.452108647208661

run 2:
Inbounds:
Min: 4.324351903051138
Median: 8.699247613549232
Mean: 8.305402742698789

No inbounds:
Min: 5.510943941771984
Median: 8.863743394613266
Mean: 8.430402343627065

Mike

MrBelette · March 25, 2025, 1:52am

Interesting … I now see similar, perhaps more unusual, behavior.
Not sure why I’m seeing it now - hopefully I haven’t messed something up!

Inbounds:
Min: 6.000639870762825
Max: 7.096319925040007
Median: 6.125584011897445
Mean: 6.243158718571067

No inbounds:
Min: 2.627455862239003
Max: 3.330976003780961
Median: 2.7030720375478268
Mean: 2.752271997742355

Looking at the SASS code, I see that inbounds is unrolled and noninbounds isn’t.
I see additional code for predicate setting and potentially branching in the noninbounds code.

However, ultimately, these are kernels with only 3 instructions or so performing computational calculations. Almost everything is about memory calculations.

If I increase the number of threads to 1024 for the inbounds kernel and keeping the noinbounds kernel at 512 threads, I see:
Inbounds:
Min: 2.711615990847349
Max: 32.5007364153862
Median: 5.138239823281765
Mean: 4.760638464475051

No inbounds:
Min: 2.6798078324645758
Max: 29.821792617440224
Median: 4.9677761271595955
Mean: 4.616520234150812

Perhaps this is what it takes to maximize performance for this memory bound kernel?

If I use CUDA.@bprofile with time=5.0, I get:
inbounds 512 threads:
Time (%) │ Total time │ Calls │ Time distribution
99.76% │ 4.99 s │ 826 │ 6.05 ms ± 0.13 ( 5.99 ‥ 7.94)

inbounds 1024 threads:
Time (%) │ Total time │ Calls │ Time distribution
99.48% │ 4.98 s │ 1807 │ 2.75 ms ± 1.67 ( 2.61 ‥ 29.73)

no inbounds 512 threads:
Time (%) │ Total time │ Calls │ Time distribution
99.22% │ 4.96 s │ 1840 │ 2.7 ms ± 1.28 ( 2.58 ‥ 28.96)

no inbounds 1024 threads:
Time (%) │ Total time │ Calls │ Time distribution
99.30% │ 4.97 s │ 1623 │ 3.06 ms ± 2.18 ( 2.84 ‥ 42.5)

no inbounds 756 threads:
Time (%) │ Total time │ Calls │ Time distribution
99.23% │ 4.96 s │ 2290 │ 2.17 ms ± 1.15 ( 2.05 ‥ 27.52)

My conclusion from this is that perhaps one needs to be careful with memory bound kernels. The ptxas compiler will try to generate the most performant code, but it may still be up to the user to define the appropriate launch conditions.
It would be good if someone could confirm these observations.

Topic		Replies	Views
When does @inbounds increase performance? Performance inbounds	14	1544	February 14, 2025
@inbounds code slower than one without General Usage	17	2475	March 9, 2019
What does @inbounds actually mean? Performance inbounds , bounds-check	8	1901	August 8, 2023
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2242	August 28, 2021
Is the triple `@inbounds @fastmath @simd` necessary for absolute peak performance? Performance	7	627	October 21, 2024

@inbounds slower

Related topics