When does @inbounds increase performance?

leespen1 · December 14, 2024, 12:16am

I am interested in making some optimizations in my software. I noticed many packages make use of the @inbounds macro, and the Julia manual mentions @inbounds in its performance tips section (Performance Tips · The Julia Language).

Before going through the effort of adding @inbounds to my code, I decided to write a script to test how much of an effect @inbounds has on performance:

using BenchmarkTools

function add_one!(x)
    for i in 1:length(x)
        x[i] += 1
    end
end

function add_one_inbounds!(x)
    for i in 1:length(x)
        @inbounds x[i] += 1
    end
end

function main()
    x = rand(1_000_000)
    println("Time taken for non-inbounds version:")
    @btime add_one!($x)
    println("\nTime taken for inbounds version:")
    @btime add_one_inbounds!($x)
end

main()

This is the output I got from running the script:

Time taken for non-inbounds version:
  92.003 μs (0 allocations: 0 bytes)

Time taken for inbounds version:
  91.842 μs (0 allocations: 0 bytes)

The speedup is negligible, and when running the script again I sometimes even get that the inbounds version takes longer than the non-inbounds version.

So overall, it seems to me that there is no noticeable speedup, certainly not enough speedup to make me want to risk uncaught memory errors by using @inbounds, even in applications where performance is critical. And in this case, the amount of work being done per index-check is minimal, I am just adding an integer. In real applications I would probably be doing several FLOPs before accessing an array.

So why do people use @inbounds? Is my test flawed, and performance gains are actually much higher in realistic applications? Could someone provide a minimal example where using @inbounds noticeably increases performance?

xiaodai · December 14, 2024, 12:50am

FWIW I do see a difference

Time taken for non-inbounds version:
  515.100 μs (0 allocations: 0 bytes)

Time taken for inbounds version:
  446.500 μs (0 allocations: 0 bytes)

danielwe · December 14, 2024, 1:37am

No measurable difference means that the compiler was able to prove that all accesses are inbounds and thus remove the runtime check even when you didn’t ask for it. To observe the difference you need to hide the index values better from the compiler, for example like this:

using BenchmarkTools

function add_one!(x, indices)
    for i in indices
        x[i] += 1
    end
end

function add_one_inbounds!(x::AbstractVector, indices::StepRange)
    xifirst, xilast = firstindex(x), lastindex(x)
    ifirst, ilast = first(indices), last(indices)
    if !(xifirst <= ifirst <= xilast) || !(xifirst <= ilast <= xilast)
        throw(BoundsError(x, indices))
    end
    for i in indices
        @inbounds x[i] += 1
    end
end

function main()
    x = rand(1_000_000)
    iref = Ref(1:2:1_000_000)
    println("Time taken for non-inbounds version:")
    @btime add_one!($x, ($iref)[])
    println("\nTime taken for inbounds version:")
    @btime add_one_inbounds!($x, ($iref)[])
end

main()

julia> main()
Time taken for non-inbounds version:
  446.960 μs (0 allocations: 0 bytes)

Time taken for inbounds version:
  405.824 μs (0 allocations: 0 bytes)

In other words, @inbounds is useful in cases where you’re smarter than the compiler, that is, you know that all accesses are inbounds even though the compiler can’t see it.

It’s important to only use @inbounds in contexts where you know it’s never incorrect, otherwise you could easily trigger segfaults or worse. Hence the manual bounds checks at the top of add_one_inbounds!, as well type constraints to match the assumptions in those bounds checks.

mbauman · December 14, 2024, 2:37am

The biggest differences will come when the bounds check is the last thing standing in the way of a SIMD optimization. That’s when you may see a 2x or 4x or more benefit — but again, only when you also managed to outsmart the compiler. And the compiler is getting pretty smart these days. Without SIMD in the mix, bounds checks — even if the compiler didn’t remove them for you — can often be surprisingly cheap (or even ~free) thanks to your processor’s branch predictor. It’s an extremely predictable branch!

bertschi · December 14, 2024, 2:43pm

You probably know that anyways, but indexing via 1:length(x) does not always work as intended and thus, @inbounds can lie unexpectedly. (There have been several discussions around OffsetArrays for instance).
Thus, the proper fix to the above code would be using eachindex(x) or axes(x, 1) instead. From what I understand, the compiler can then infer @inbounds automatically and thus the code becomes faster and more general at the same time (as it should with a good abstraction).

giordano · December 14, 2024, 3:55pm

That’s not necessarily true, but it does happen in many cases. Using abstractions like eachindex or axes is a good idea anyway.

jishnub · December 14, 2024, 4:31pm

E.g. a case where this optimization doesn’t happen is

danielwe · December 14, 2024, 6:02pm

Using eachindex is always good, but what the OPs example shows is that the compiler can optimize away bounds checks even when using 1:length(x), at least when x is a Vector. I think this may even happen in an LLVM loop unswitching pass, and I don’t think eachindex exists at the LLVM level.

leespen1 · December 15, 2024, 1:48am

I’m aware it’s not that using the above code could lead to incorrect behavior when the function argument is not a regular array. But my goal is to figure out when/if using @inbounds is beneficial, so I kept things simple. In fact, I was afraid that using eachindex(x) would lead to a compiler index like what you mention. Using 1:length(x) was my attempt at preventing that from happening. But, as others have stated, it seems the compiler is smart enough to infer @inbounds even when using 1:length(x).

leespen1 · December 15, 2024, 3:15am

Thanks for all the advice everyone, this was very helpful!

I made the following update to my script to try to fool the compiler into not being able to infer @inbounds (and also to show the impact of @simd). I also tried different compiler optimization levels. I found that

using BenchmarkTools

function add_one!(x, indices)
    for i in indices
        x[i] += 1.0
    end
end

function add_one_inbounds!(x, indices)
    for i in indices
        @inbounds x[i] += 1.0
    end
end


function add_one_simd!(x, indices)
    @simd for i in indices
        x[i] += 1.0
    end
end


function add_one_simd_inbounds!(x, indices)
    @simd for i in indices
        @inbounds x[i] += 1.0
    end
end

N = 10
x = rand(N)
rand_indices = rand(1:N, 100*N)
println("Non-inbounds version:")
display(@benchmark add_one!($x, $rand_indices))
println("\nInbounds version:")
display(@benchmark add_one_inbounds!($x, $rand_indices))
println("\nSimd (no inbounds) version:")
display(@benchmark add_one_simd!($x, $rand_indices))
println("\nSimd with inbounds version:")
display(@benchmark add_one_simd_inbounds!($x, $rand_indices))

Script Results (O1 Optimization)

Non-inbounds version:
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.917 μs …  4.357 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.942 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.958 μs ± 74.315 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▇█▆▅▃  ▂▄▃▃▃▂                              ▁▁ ▁         ▂
  ████████▇▇███████▃▄▅▃▃▃▃▁▃▄▁▃▄▃▄▄▄▁▃▁▄▄▃▄▃▅▅▅▅██████▇▆▅▃▆▆ █
  2.92 μs      Histogram: log(frequency) by time      3.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Inbounds version:
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.409 μs …  5.131 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.436 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.447 μs ± 81.676 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁█▄                                                       
  ▃████▅▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  2.41 μs        Histogram: frequency by time        2.84 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Simd (no inbounds) version:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.995 μs …  3.896 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.007 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.021 μs ± 65.587 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁█▁                                                         
  ████▇▄▃▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  1.99 μs        Histogram: frequency by time        2.33 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Simd with inbounds version:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.497 μs …  3.832 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.505 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.515 μs ± 78.196 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▇▅▂                                                      ▂
  █████▆▅▆▅▆▄▁▃▃▃▄▃▄▃▁▃▃▄▄▁▃▁▃▁▁▄▁▃▁▃▁▄▁▃▃▁▃▁▄▅▄▄▆▆▆▆▄▅▇██▇▆ █
  1.5 μs       Histogram: log(frequency) by time     1.79 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Script Results (O2 Optimization)

Non-inbounds version:
BenchmarkTools.Trial: 10000 samples with 192 evaluations.
 Range (min … max):  514.146 ns … 753.661 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     520.198 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   525.612 ns ±  28.248 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▅  ▄▃▁ ▁                                                  ▁ ▁
  ███▆██████▆▃▅▇▄▁▄▁▁▃▁▁▃▄▁▁▃▄▃▁▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁█ █
  514 ns        Histogram: log(frequency) by time        725 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Inbounds version:
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
 Range (min … max):  444.323 ns …  1.088 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     454.086 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   474.327 ns ± 65.738 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▃▃▄▄▃▁                                             ▁▂▁▁   ▂
  █████████████▇▇▇▆▆▆▆▆▆▆▇▆▆▆▆▅▅▅▅▆▅▅▄▅▅▅▅▁▅▄▅▅▁▆▅▆▇██▇██████▇ █
  444 ns        Histogram: log(frequency) by time       729 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Simd (no inbounds) version:
BenchmarkTools.Trial: 10000 samples with 192 evaluations.
 Range (min … max):  511.531 ns …  1.013 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     516.443 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   520.321 ns ± 18.245 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▆██▆▄▂       ▂▃▃▂▂▁▁      ▁                                ▂
  ▆████████▇▆███▇█████████▆█████▆▄▄▄▃▅▄▆▅▆▆▆▆▆▄▅▅▁▃▃▃▄▁▃▁▁▄▁▃▄ █
  512 ns        Histogram: log(frequency) by time       579 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Simd with inbounds version:
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
 Range (min … max):  446.091 ns …  1.033 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     456.263 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   458.399 ns ± 13.619 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁▂▃▆▇███▇▆▄▂▁            ▂▃▃▃▃▁                     ▂
  ▂▃▄▅█▇█▇▇█████████████▇▆▅▅▄▅▅▅▅▅▇█████████▆▆▆▆▅▆▅▅▄▃▆▅▆▇▆▇▇▇ █
  446 ns        Histogram: log(frequency) by time       484 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Script Results (O3 Optimization)

BenchmarkTools.Trial: 10000 samples with 195 evaluations.
 Range (min … max):  484.600 ns … 702.913 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     495.082 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   497.254 ns ±   8.260 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂▄▅▆▆▇▇███▇▆▅▄▂▁           ▁▁▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁          ▃
  ▃▁▄▄▆▇█████████████████▇▇▅▃▅▅▅▇▇███████████████████████▇▇▇▆▆▆ █
  485 ns        Histogram: log(frequency) by time        526 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Inbounds version:
BenchmarkTools.Trial: 10000 samples with 192 evaluations.
 Range (min … max):  511.016 ns … 948.448 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     516.125 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   518.723 ns ±  11.223 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃█▇▃                                                      
  ▁▁▂▄▇█████▇▇▆▄▃▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  511 ns           Histogram: frequency by time          548 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Simd (no inbounds) version:
BenchmarkTools.Trial: 10000 samples with 195 evaluations.
 Range (min … max):  487.067 ns … 684.518 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     496.062 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   498.179 ns ±   7.463 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▁▂▄▅▆▇███▇▇▆▅▄▃▁                ▁▁▂▁▂▁▂▂▂▂▂▁▁▁▁▂▁▁▁    ▃
  ▃▁▁▁▄▇██████████████████▆▆▅▃▅▁▁▅▁▅▅▇▆█████████████████████▇▇█ █
  487 ns        Histogram: log(frequency) by time        523 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Simd with inbounds version:
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
 Range (min … max):  445.384 ns … 561.561 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     455.404 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   456.835 ns ±   6.009 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▄█▆▄                                            
  ▁▁▁▁▁▁▂▂▂▂▂▂▂▄█████▇▅▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  445 ns           Histogram: frequency by time          481 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The results are clearest when using O1 optimization. There is a ~15% speedup from using @inbounds, a ~30% speedup from using @simd, and a ~50% speedup from using @simd and @inbounds.

When using O2 optimization, using @inbounds gives ~10% speedup. Using @simd but not @inbounds results in nearly the same performance as if no macros were used at all (~1% speedup), and using @simd and @inbounds results in ~13% speedup. So as before, @inbounds increases performance, and using @simd and @inbounds together increases performance the most, but overall the effects are much less noticeable than when using O1 optimization (and even the fastest O1 code is nearly 3x slower than the slowest O2 code).

For O3 optimization, using @inbounds but not @simd actually decreased performance by ~4%. On repeat runs it was sometimes faster, and I would conclude that overall there is not much difference. Using @inbounds and @simd was still fastest (nearly the same as the same code with O2 optimization).

For those interested, I have included native assembly code returned by @code_native syntax=:att add_one!(x, rand_indices) (when using O3 optimization). There are two vmovsd instructions, and one vaddsd instruction, indicating that the compiler was smart enough to use simd without me telling it to. But you can still see the bounds checking:

	leaq	-1(%r10), %rsi
	cmpq	%rdx, %rsi
	jae	.LBB0_5
[...]
.LBB0_5:                                # %L49
	movabsq	$j_throw_boundserror_4091, %rax
	leaq	-8(%rbp), %rsi
	movq	%r10, -8(%rbp)
	callq	*%rax

add_one! code native

julia> @code_native syntax=:att add_one!(x, rand_indices)
	.text
	.file	"add_one!"
	.section	.rodata.cst8,"aM",@progbits,8
	.p2align	3, 0x0                          # -- Begin function julia_add_one!_4075
.LCPI0_0:
	.quad	0x3ff0000000000000              # double 1
	.text
	.globl	"julia_add_one!_4075"
	.p2align	4, 0x90
	.type	"julia_add_one!_4075",@function
"julia_add_one!_4075":                  # @"julia_add_one!_4075"
; Function Signature: add_one!(Array{Float64, 1}, Array{Int64, 1})
; ┌ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:4 within `add_one!`
# %bb.0:                                # %top
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl within `add_one!`
	#DEBUG_VALUE: add_one!:x <- [DW_OP_deref] $rdi
	#DEBUG_VALUE: add_one!:indices <- [DW_OP_deref] $rsi
	#DEBUG_VALUE: add_one!:indices <- [DW_OP_deref] 0
	#DEBUG_VALUE: add_one!:x <- [DW_OP_deref] 0
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$16, %rsp
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:5 within `add_one!`
; │┌ @ array.jl:891 within `iterate` @ array.jl:891
; ││┌ @ essentials.jl:11 within `length`
	movq	16(%rsi), %rax
; ││└
; ││┌ @ int.jl:513 within `<`
	testq	%rax, %rax
; ││└
	je	.LBB0_6
# %bb.1:                                # %L34
; ││┌ @ essentials.jl:917 within `getindex`
	movq	(%rsi), %rcx
; │└└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:6 within `add_one!`
; │┌ @ essentials.jl:916 within `getindex`
; ││┌ @ essentials.jl:11 within `length`
	movq	16(%rdi), %rdx
; │└└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:5 within `add_one!`
; │┌ @ array.jl:891 within `iterate` @ array.jl:891
; ││┌ @ essentials.jl:917 within `getindex`
	movq	(%rcx), %r10
; │└└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:6 within `add_one!`
; │┌ @ essentials.jl:916 within `getindex`
	leaq	-1(%r10), %rsi
	cmpq	%rdx, %rsi
	jae	.LBB0_5
# %bb.2:                                # %L52.lr.ph
	movabsq	$.LCPI0_0, %r10
	movq	(%rdi), %r8
	movl	$1, %r9d
	vmovsd	(%r10), %xmm0                   # xmm0 = mem[0],zero
	.p2align	4, 0x90
.LBB0_3:                                # %L71
                                        # =>This Inner Loop Header: Depth=1
; │└
; │┌ @ float.jl:491 within `+`
	vaddsd	(%r8,%rsi,8), %xmm0, %xmm1
; │└
; │┌ @ array.jl:976 within `setindex!`
	vmovsd	%xmm1, (%r8,%rsi,8)
; │└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:7 within `add_one!`
; │┌ @ array.jl:891 within `iterate`
; ││┌ @ int.jl:513 within `<`
	cmpq	%r9, %rax
; ││└
	je	.LBB0_6
# %bb.4:                                # %L104
                                        #   in Loop: Header=BB0_3 Depth=1
; ││┌ @ essentials.jl:917 within `getindex`
	movq	(%rcx,%r9,8), %r10
; │└└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:6 within `add_one!`
; │┌ @ essentials.jl:916 within `getindex`
	incq	%r9
	leaq	-1(%r10), %rsi
	cmpq	%rdx, %rsi
	jb	.LBB0_3
.LBB0_5:                                # %L49
	movabsq	$j_throw_boundserror_4091, %rax
	leaq	-8(%rbp), %rsi
	movq	%r10, -8(%rbp)
	callq	*%rax
.LBB0_6:                                # %L110
; │└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:7 within `add_one!`
	addq	$16, %rsp
	popq	%rbp
	retq
.Lfunc_end0:
	.size	"julia_add_one!_4075", .Lfunc_end0-"julia_add_one!_4075"
; └
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

And below is the native assembly code for the remaining functions

add_one_inbounds! code native

julia> @code_native syntax=:att add_one_inbounds!(x, rand_indices)
	.text
	.file	"add_one_inbounds!"
	.section	.rodata.cst8,"aM",@progbits,8
	.p2align	3, 0x0                          # -- Begin function julia_add_one_inbounds!_4312
.LCPI0_0:
	.quad	0x3ff0000000000000              # double 1
	.text
	.globl	"julia_add_one_inbounds!_4312"
	.p2align	4, 0x90
	.type	"julia_add_one_inbounds!_4312",@function
"julia_add_one_inbounds!_4312":         # @"julia_add_one_inbounds!_4312"
; Function Signature: add_one_inbounds!(Array{Float64, 1}, Array{Int64, 1})
; ┌ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:10 within `add_one_inbounds!`
# %bb.0:                                # %top
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl within `add_one_inbounds!`
	#DEBUG_VALUE: add_one_inbounds!:x <- [DW_OP_deref] $rdi
	#DEBUG_VALUE: add_one_inbounds!:indices <- [DW_OP_deref] $rsi
	#DEBUG_VALUE: add_one_inbounds!:indices <- [DW_OP_deref] 0
	#DEBUG_VALUE: add_one_inbounds!:x <- [DW_OP_deref] 0
	pushq	%rbp
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:11 within `add_one_inbounds!`
; │┌ @ array.jl:891 within `iterate` @ array.jl:891
; ││┌ @ essentials.jl:11 within `length`
	movq	16(%rsi), %rax
	movq	%rsp, %rbp
; ││└
; ││┌ @ int.jl:513 within `<`
	testq	%rax, %rax
; ││└
	je	.LBB0_4
# %bb.1:                                # %L34
; ││┌ @ essentials.jl:917 within `getindex`
	movq	(%rsi), %rcx
; │└└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:12 within `add_one_inbounds!`
; │┌ @ essentials.jl:917 within `getindex`
	movq	(%rdi), %rdx
	movabsq	$.LCPI0_0, %rdi
; │└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:11 within `add_one_inbounds!`
; │┌ @ array.jl:891 within `iterate` @ array.jl:891
; ││┌ @ essentials.jl:917 within `getindex`
	movq	(%rcx), %rsi
; │└└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:12 within `add_one_inbounds!`
; │┌ @ essentials.jl:917 within `getindex`
	vmovsd	-8(%rdx,%rsi,8), %xmm0          # xmm0 = mem[0],zero
; │└
; │┌ @ float.jl:491 within `+`
	vaddsd	(%rdi), %xmm0, %xmm0
; │└
; │┌ @ array.jl:976 within `setindex!`
	vmovsd	%xmm0, -8(%rdx,%rsi,8)
; │└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:13 within `add_one_inbounds!`
; │┌ @ array.jl:891 within `iterate`
; ││┌ @ int.jl:513 within `<`
	cmpq	$1, %rax
; ││└
	je	.LBB0_4
# %bb.2:                                # %L104.preheader
	vmovsd	(%rdi), %xmm0                   # xmm0 = mem[0],zero
	movl	$1, %esi
	.p2align	4, 0x90
.LBB0_3:                                # %L104
                                        # =>This Inner Loop Header: Depth=1
; ││┌ @ essentials.jl:917 within `getindex`
	movq	(%rcx,%rsi,8), %rdi
; ││└
; ││┌ @ int.jl:513 within `<`
	incq	%rsi
; │└└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:12 within `add_one_inbounds!`
; │┌ @ float.jl:491 within `+`
	vaddsd	-8(%rdx,%rdi,8), %xmm0, %xmm1
; │└
; │┌ @ array.jl:976 within `setindex!`
	vmovsd	%xmm1, -8(%rdx,%rdi,8)
; │└
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:13 within `add_one_inbounds!`
; │┌ @ array.jl:891 within `iterate`
; ││┌ @ int.jl:513 within `<`
	cmpq	%rsi, %rax
; ││└
	jne	.LBB0_3
.LBB0_4:                                # %L110
; │└
	popq	%rbp
	retq
.Lfunc_end0:
	.size	"julia_add_one_inbounds!_4312", .Lfunc_end0-"julia_add_one_inbounds!_4312"
; └
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

add_one_simd! code native

julia> @code_native syntax=:att add_one_simd!(x, rand_indices)
	.text
	.file	"add_one_simd!"
	.section	.rodata.cst8,"aM",@progbits,8
	.p2align	3, 0x0                          # -- Begin function julia_add_one_simd!_4339
.LCPI0_0:
	.quad	0x3ff0000000000000              # double 1
	.text
	.globl	"julia_add_one_simd!_4339"
	.p2align	4, 0x90
	.type	"julia_add_one_simd!_4339",@function
"julia_add_one_simd!_4339":             # @"julia_add_one_simd!_4339"
; Function Signature: add_one_simd!(Array{Float64, 1}, Array{Int64, 1})
; ┌ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:17 within `add_one_simd!`
# %bb.0:                                # %top
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl within `add_one_simd!`
	#DEBUG_VALUE: add_one_simd!:x <- [DW_OP_deref] $rdi
	#DEBUG_VALUE: add_one_simd!:indices <- [DW_OP_deref] $rsi
	#DEBUG_VALUE: add_one_simd!:indices <- [DW_OP_deref] 0
	#DEBUG_VALUE: add_one_simd!:x <- [DW_OP_deref] 0
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$16, %rsp
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:18 within `add_one_simd!`
; │┌ @ simdloop.jl:71 within `macro expansion`
; ││┌ @ simdloop.jl:51 within `simd_inner_length`
; │││┌ @ essentials.jl:11 within `length`
	movq	16(%rsi), %rax
; ││└└
; ││ @ simdloop.jl:72 within `macro expansion`
; ││┌ @ int.jl:83 within `<`
	testq	%rax, %rax
; ││└
	jle	.LBB0_4
# %bb.1:                                # %L11.lr.ph
	movabsq	$.LCPI0_0, %r9
	movq	(%rsi), %rcx
	movq	(%rdi), %rdx
	movq	16(%rdi), %rsi
	xorl	%r8d, %r8d
	vmovsd	(%r9), %xmm0                    # xmm0 = mem[0],zero
	.p2align	4, 0x90
.LBB0_2:                                # %L11
                                        # =>This Inner Loop Header: Depth=1
; ││ @ simdloop.jl:76 within `macro expansion`
; ││┌ @ simdloop.jl:54 within `simd_index`
; │││┌ @ essentials.jl:917 within `getindex`
	movq	(%rcx,%r8,8), %r9
; ││└└
; ││ @ simdloop.jl:77 within `macro expansion` @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:19
; ││┌ @ essentials.jl:916 within `getindex`
	leaq	-1(%r9), %r10
	cmpq	%rsi, %r10
	jae	.LBB0_5
# %bb.3:                                # %L64
                                        #   in Loop: Header=BB0_2 Depth=1
; ││└
; ││┌ @ float.jl:491 within `+`
	vaddsd	-8(%rdx,%r9,8), %xmm0, %xmm1
; ││└
; ││ @ simdloop.jl:78 within `macro expansion`
; ││┌ @ int.jl:87 within `+`
	incq	%r8
; ││└
; ││ @ simdloop.jl:77 within `macro expansion` @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:19
; ││┌ @ array.jl:976 within `setindex!`
	vmovsd	%xmm1, -8(%rdx,%r9,8)
; ││└
; ││ @ simdloop.jl:75 within `macro expansion`
; ││┌ @ int.jl:83 within `<`
	cmpq	%r8, %rax
; ││└
	jne	.LBB0_2
.LBB0_4:                                # %L71
; ││ @ simdloop.jl:76 within `macro expansion`
; ││┌ @ simdloop.jl:54 within `simd_index`
; │││┌ @ essentials.jl:916 within `getindex`
	addq	$16, %rsp
	popq	%rbp
	retq
.LBB0_5:                                # %L42
; ││└└
; ││ @ simdloop.jl:77 within `macro expansion` @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:19
; ││┌ @ essentials.jl:916 within `getindex`
	movabsq	$j_throw_boundserror_4354, %rax
	leaq	-8(%rbp), %rsi
	movq	%r9, -8(%rbp)
	callq	*%rax
.Lfunc_end0:
	.size	"julia_add_one_simd!_4339", .Lfunc_end0-"julia_add_one_simd!_4339"
; └└└
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

add_one_simd_inbounds! code native

julia> @code_native syntax=:att add_one_simd_inbounds!(x, rand_indices)
	.text
	.file	"add_one_simd_inbounds!"
	.section	.rodata.cst8,"aM",@progbits,8
	.p2align	3, 0x0                          # -- Begin function julia_add_one_simd_inbounds!_4357
.LCPI0_0:
	.quad	0x3ff0000000000000              # double 1
	.text
	.globl	"julia_add_one_simd_inbounds!_4357"
	.p2align	4, 0x90
	.type	"julia_add_one_simd_inbounds!_4357",@function
"julia_add_one_simd_inbounds!_4357":    # @"julia_add_one_simd_inbounds!_4357"
; Function Signature: add_one_simd_inbounds!(Array{Float64, 1}, Array{Int64, 1})
; ┌ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:24 within `add_one_simd_inbounds!`
# %bb.0:                                # %top
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl within `add_one_simd_inbounds!`
	#DEBUG_VALUE: add_one_simd_inbounds!:x <- [DW_OP_deref] $rdi
	#DEBUG_VALUE: add_one_simd_inbounds!:indices <- [DW_OP_deref] $rsi
	#DEBUG_VALUE: add_one_simd_inbounds!:indices <- [DW_OP_deref] 0
	#DEBUG_VALUE: add_one_simd_inbounds!:x <- [DW_OP_deref] 0
	pushq	%rbp
; │ @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:25 within `add_one_simd_inbounds!`
; │┌ @ simdloop.jl:71 within `macro expansion`
; ││┌ @ simdloop.jl:51 within `simd_inner_length`
; │││┌ @ essentials.jl:11 within `length`
	movq	16(%rsi), %rax
	movq	%rsp, %rbp
; ││└└
; ││ @ simdloop.jl:72 within `macro expansion`
; ││┌ @ int.jl:83 within `<`
	testq	%rax, %rax
; ││└
	jle	.LBB0_3
# %bb.1:                                # %L11.lr.ph
	movabsq	$.LCPI0_0, %r8
	movq	(%rsi), %rcx
	movq	(%rdi), %rdx
	xorl	%esi, %esi
	vmovsd	(%r8), %xmm0                    # xmm0 = mem[0],zero
	.p2align	4, 0x90
.LBB0_2:                                # %L11
                                        # =>This Inner Loop Header: Depth=1
; ││ @ simdloop.jl:76 within `macro expansion`
; ││┌ @ simdloop.jl:54 within `simd_index`
; │││┌ @ essentials.jl:917 within `getindex`
	movq	(%rcx,%rsi,8), %rdi
; ││└└
; ││ @ simdloop.jl:78 within `macro expansion`
; ││┌ @ int.jl:87 within `+`
	incq	%rsi
; ││└
; ││ @ simdloop.jl:77 within `macro expansion` @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:26
; ││┌ @ float.jl:491 within `+`
	vaddsd	-8(%rdx,%rdi,8), %xmm0, %xmm1
; ││└
; ││┌ @ array.jl:976 within `setindex!`
	vmovsd	%xmm1, -8(%rdx,%rdi,8)
; ││└
; ││ @ simdloop.jl:75 within `macro expansion`
; ││┌ @ int.jl:83 within `<`
	cmpq	%rsi, %rax
; ││└
	jne	.LBB0_2
.LBB0_3:                                # %L71
; ││ @ simdloop.jl:76 within `macro expansion`
; ││┌ @ simdloop.jl:54 within `simd_index`
; │││┌ @ essentials.jl:916 within `getindex`
	popq	%rbp
	retq
.Lfunc_end0:
	.size	"julia_add_one_simd_inbounds!_4357", .Lfunc_end0-"julia_add_one_simd_inbounds!_4357"
; └└└└
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

I don’t know exactly why, but it seems the add_one_inbounds! has a few more simd instructions than add_one_simd_inbounds! does, and there is a branch that is implemented a bit differently in each function. Both these appear to be related to the iteration over the indices, and I suspect these differences are what cause add_one_simd_inbounds! to be a little bit faster. Still, I’m surprised that when using @inbounds, the compiler is able to use simd, but when using @inbounds and @simd, the code is different than when using @inbounds alone. In any case, I’ve dug into far enough to be satisfied.

phma · December 15, 2024, 4:13am

Here’s a function from RotBitcount.jl in my package WringTwistree:

function rotBitcountSeq!(src::Vector{UInt8},dst::Vector{UInt8},mult::Integer)
  len=length(src)
  @assert len==length(dst) "rotBitcount: size mismatch"
  @assert src!==dst "rotBitcount: src and dst must be different"
  if len>0
    multmod=mod(mult,len*8)
  else
    multmod=mult
  end
  @inbounds bitcount=mapreduce(count_ones,+,src,init=0)
  if len>0
    rotcount=(bitcount*multmod)%(len*8)
  else
    rotcount=bitcount*multmod
  end
  byte=rotcount>>3
  bit=rotcount&7
  for i in 1:byte
    @inbounds dst[i]=(src[i+len-byte]<<bit) | (src[i+len-byte-1]>>(8-bit))
  end
  @inbounds dst[byte+1]=(src[1]<<bit) | (src[len]>>(8-bit))
  for i in byte+2:len
    @inbounds dst[i]=(src[i-byte]<<bit) | (src[i-byte-1]>>(8-bit))
  end
  bitcount
end

This function rotates a buffer of bytes by the number of one bits in it. It’s been months since I’ve worked on the code; IIRR the compiler (1.10, I’m using 1.11 now) could not tell that the index arithmetic in the second and fourth @inbounds statements always produces indices that are in bounds.

photor · February 12, 2025, 2:26pm

So how to adjust the optimization level (O1~O3) at run time?

abraemer · February 12, 2025, 4:19pm

I don’t think this is possible currently. I think @leespen1 just ran his script multiple times with different settings.

However it would be very cool to have some constructs like CommonLisp’s (declare (optimize (speed 3))) and similar

mbauman · February 12, 2025, 4:29pm

You can do it on a per-module basis with Experimental.@optlevel

leespen1 · February 14, 2025, 1:00am

As @abraemer said, I don’t think it’s possible, but I’m by no means an expert.

I just ran the same script multiple times, with the -O1 flag, then the -O2 flag, then the -O3 flag.

Topic		Replies	Views
Is the triple `@inbounds @fastmath @simd` necessary for absolute peak performance? Performance	7	492	October 21, 2024
@inbounds code slower than one without General Usage	17	2296	March 9, 2019
A safe inbounds use with great performance effect Performance	1	441	May 19, 2022
@inbounds slower GPU inbounds	8	409	March 25, 2025
What does @inbounds actually mean? Performance inbounds , bounds-check	8	1587	August 8, 2023

When does @inbounds increase performance?

Related topics