Bounds check outside loop affects loop performance

While playing around with a function, I noticed that adding a bounds check (si_check) had a drastic impact on its performance. Although for N <= 18, there is no noticeable performance impact, and the emitted LLVM IR shows that most exceptions from @nextract ... have been elided, when N > 18 (in my benchmarks I use N = 56), using @inbounds benefits performance, but the bounds check actually drastically slows it down. The output of @code_llvm apart from the exceptions is really different too.

Is this expected? Should I file an issue?

@generated function _filt_fir!(out, b::NTuple{N,T}, x, siarr, col) where {N,T}
    silen = N - 1
    si_end = Symbol(:si_, silen)
    SMALL_FILT_VECT_CUTOFF = 18
    si_check = N > SMALL_FILT_VECT_CUTOFF ? :(nothing) : :(@assert length(siarr) == $silen)

    q = quote
        $si_check
        Base.@nextract $silen si siarr
        for i in axes(x, 1)
            xi = x[i, col]
            val = muladd(xi, b[1], si_1)
            Base.@nexprs $(silen-1) j -> (si_j = muladd(xi, b[j+1], si_{j+1}))
            $si_end = b[N] * xi
            out[i, col] = val
        end
    end

    if N > SMALL_FILT_VECT_CUTOFF
        loop_args = q.args[6].args[2].args
        for i in (2, 10)
            loop_args[i] = :(@inbounds $(loop_args[i]))
        end
    end
    q
end

Benchmarks, with the function above unmodified. No assert ... in the body.

julia> x = rand(10_000); out = similar(x);

julia> a = 1.; b = Tuple(rand(56)); si = zeros(55);

julia> @benchmark _filt_fir!($out, $b, $x, $si, $1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min â€Ķ max):  26.900 Ξs â€Ķ 112.600 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     27.000 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   27.320 Ξs Âą   1.496 Ξs  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

  ██ ▁        ▆ ▄                      ▁▄                      ▂
  ██▁█▁█▁▇▇▁▇▁█▁█▇▁▅▁▄▁▃▅▁▆▁▇▁▇▆▁▆▁▆▁▆▁██▁█▁█▁▇▇▁▇▁▇▁▇▆▁▆▁█▁▇▄ █
  26.9 Ξs       Histogram: log(frequency) by time      30.3 Ξs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Benchmark after changing si_check to an unconditional @assert ...

julia> @benchmark _filt_fir!($out, $b, $x, $si, $1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min â€Ķ max):  68.400 Ξs â€Ķ 109.700 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     68.500 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   68.968 Ξs Âą   1.968 Ξs  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

  █▃                                         ▂▁                ▁
  ██▃▁▃▁▃▃▁▁▃▆▅▄▆▅▆▆▇▆▆▅▆▆▅▅▇▅▆▇▆▅▅▅▅▄▅▆▅▆▅▅▆██▅▆▄▆▅▆▆▇▆▄▄▅▄▆▅ █
  68.4 Ξs       Histogram: log(frequency) by time        76 Ξs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Platform details, uses AVX-512 instructions.

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_NUM_THREADS = auto
1 Like