While playing around with a function, I noticed that adding a bounds check (si_check
) had a drastic impact on its performance. Although for N <= 18
, there is no noticeable performance impact, and the emitted LLVM IR shows that most exceptions from @nextract ...
have been elided, when N > 18
(in my benchmarks I use N = 56
), using @inbounds
benefits performance, but the bounds check actually drastically slows it down. The output of @code_llvm
apart from the exceptions is really different too.
Is this expected? Should I file an issue?
@generated function _filt_fir!(out, b::NTuple{N,T}, x, siarr, col) where {N,T}
silen = N - 1
si_end = Symbol(:si_, silen)
SMALL_FILT_VECT_CUTOFF = 18
si_check = N > SMALL_FILT_VECT_CUTOFF ? :(nothing) : :(@assert length(siarr) == $silen)
q = quote
$si_check
Base.@nextract $silen si siarr
for i in axes(x, 1)
xi = x[i, col]
val = muladd(xi, b[1], si_1)
Base.@nexprs $(silen-1) j -> (si_j = muladd(xi, b[j+1], si_{j+1}))
$si_end = b[N] * xi
out[i, col] = val
end
end
if N > SMALL_FILT_VECT_CUTOFF
loop_args = q.args[6].args[2].args
for i in (2, 10)
loop_args[i] = :(@inbounds $(loop_args[i]))
end
end
q
end
Benchmarks, with the function above unmodified. No assert ...
in the body.
julia> x = rand(10_000); out = similar(x);
julia> a = 1.; b = Tuple(rand(56)); si = zeros(55);
julia> @benchmark _filt_fir!($out, $b, $x, $si, $1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min âĶ max): 26.900 Ξs âĶ 112.600 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 27.000 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 27.320 Ξs Âą 1.496 Ξs â GC (mean Âą Ï): 0.00% Âą 0.00%
ââ â â â ââ â
ââââââââââââââââââ
âââââ
âââââââââââââââââââââââââââââââââââââ â
26.9 Ξs Histogram: log(frequency) by time 30.3 Ξs <
Memory estimate: 0 bytes, allocs estimate: 0.
Benchmark after changing si_check
to an unconditional @assert ...
julia> @benchmark _filt_fir!($out, $b, $x, $si, $1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min âĶ max): 68.400 Ξs âĶ 109.700 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 68.500 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 68.968 Ξs Âą 1.968 Ξs â GC (mean Âą Ï): 0.00% Âą 0.00%
ââ ââ â
âââââââââââââ
âââ
ââââââ
âââ
â
ââ
ââââ
â
â
â
ââ
ââ
ââ
â
ââââ
ââââ
âââââââ
âââ
â
68.4 Ξs Histogram: log(frequency) by time 76 Ξs <
Memory estimate: 0 bytes, allocs estimate: 0.
Platform details, uses AVX-512 instructions.
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 Ã 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 11 on 8 virtual cores
Environment:
JULIA_CONDAPKG_BACKEND = Null
JULIA_NUM_THREADS = auto