Thanks for all the advice everyone, this was very helpful!
I made the following update to my script to try to fool the compiler into not being able to infer @inbounds
(and also to show the impact of @simd
). I also tried different compiler optimization levels. I found that
using BenchmarkTools
function add_one!(x, indices)
for i in indices
x[i] += 1.0
end
end
function add_one_inbounds!(x, indices)
for i in indices
@inbounds x[i] += 1.0
end
end
function add_one_simd!(x, indices)
@simd for i in indices
x[i] += 1.0
end
end
function add_one_simd_inbounds!(x, indices)
@simd for i in indices
@inbounds x[i] += 1.0
end
end
N = 10
x = rand(N)
rand_indices = rand(1:N, 100*N)
println("Non-inbounds version:")
display(@benchmark add_one!($x, $rand_indices))
println("\nInbounds version:")
display(@benchmark add_one_inbounds!($x, $rand_indices))
println("\nSimd (no inbounds) version:")
display(@benchmark add_one_simd!($x, $rand_indices))
println("\nSimd with inbounds version:")
display(@benchmark add_one_simd_inbounds!($x, $rand_indices))
Script Results (O1 Optimization)
Non-inbounds version:
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min β¦ max): 2.917 ΞΌs β¦ 4.357 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.942 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 2.958 ΞΌs Β± 74.315 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ
β ββββββ ββ β β
ββββββββββββββββββββ
βββββββββββββββββββββββ
β
β
β
βββββββββ
βββ β
2.92 ΞΌs Histogram: log(frequency) by time 3.3 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Inbounds version:
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min β¦ max): 2.409 ΞΌs β¦ 5.131 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.436 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 2.447 ΞΌs Β± 81.676 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
ββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.41 ΞΌs Histogram: frequency by time 2.84 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Simd (no inbounds) version:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.995 ΞΌs β¦ 3.896 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.007 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 2.021 ΞΌs Β± 65.587 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.99 ΞΌs Histogram: frequency by time 2.33 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Simd with inbounds version:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.497 ΞΌs β¦ 3.832 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.505 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.515 ΞΌs Β± 78.196 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
β β
βββββββ
ββ
ββββββββββββββββββββββββββββββββββββ
ββββββββ
βββββ β
1.5 ΞΌs Histogram: log(frequency) by time 1.79 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Script Results (O2 Optimization)
Non-inbounds version:
BenchmarkTools.Trial: 10000 samples with 192 evaluations.
Range (min β¦ max): 514.146 ns β¦ 753.661 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 520.198 ns β GC (median): 0.00%
Time (mean Β± Ο): 525.612 ns Β± 28.248 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
βββ β β β
βββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββ β
514 ns Histogram: log(frequency) by time 725 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Inbounds version:
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
Range (min β¦ max): 444.323 ns β¦ 1.088 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 454.086 ns β GC (median): 0.00%
Time (mean Β± Ο): 474.327 ns Β± 65.738 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββ ββββ β
βββββββββββββββββββββββββββββ
β
β
β
ββ
β
ββ
β
β
β
ββ
ββ
β
βββ
ββββββββββββ β
444 ns Histogram: log(frequency) by time 729 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Simd (no inbounds) version:
BenchmarkTools.Trial: 10000 samples with 192 evaluations.
Range (min β¦ max): 511.531 ns β¦ 1.013 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 516.443 ns β GC (median): 0.00%
Time (mean Β± Ο): 520.321 ns Β± 18.245 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ βββββββ β β
ββββββββββββββββββββββββββββββββββββ
βββ
βββββββ
β
βββββββββββββ β
512 ns Histogram: log(frequency) by time 579 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Simd with inbounds version:
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
Range (min β¦ max): 446.091 ns β¦ 1.033 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 456.263 ns β GC (median): 0.00%
Time (mean Β± Ο): 458.399 ns Β± 13.619 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββββββ ββββββ β
ββββ
βββββββββββββββββββββ
β
ββ
β
β
β
β
βββββββββββββββ
ββ
β
ββββ
ββββββ β
446 ns Histogram: log(frequency) by time 484 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Script Results (O3 Optimization)
BenchmarkTools.Trial: 10000 samples with 195 evaluations.
Range (min β¦ max): 484.600 ns β¦ 702.913 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 495.082 ns β GC (median): 0.00%
Time (mean Β± Ο): 497.254 ns Β± 8.260 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
ββββββββββ
βββ ββββββββββββββββββ β
ββββββββββββββββββββββββββ
ββ
β
β
βββββββββββββββββββββββββββββββ β
485 ns Histogram: log(frequency) by time 526 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Inbounds version:
BenchmarkTools.Trial: 10000 samples with 192 evaluations.
Range (min β¦ max): 511.016 ns β¦ 948.448 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 516.125 ns β GC (median): 0.00%
Time (mean Β± Ο): 518.723 ns Β± 11.223 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
511 ns Histogram: frequency by time 548 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Simd (no inbounds) version:
BenchmarkTools.Trial: 10000 samples with 195 evaluations.
Range (min β¦ max): 487.067 ns β¦ 684.518 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 496.062 ns β GC (median): 0.00%
Time (mean Β± Ο): 498.179 ns Β± 7.463 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββββββ
βββ βββββββββββββββββββ β
βββββββββββββββββββββββββββ
ββ
βββ
ββ
β
ββββββββββββββββββββββββββ β
487 ns Histogram: log(frequency) by time 523 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Simd with inbounds version:
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
Range (min β¦ max): 445.384 ns β¦ 561.561 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 455.404 ns β GC (median): 0.00%
Time (mean Β± Ο): 456.835 ns Β± 6.009 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββ β
445 ns Histogram: frequency by time 481 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
The results are clearest when using O1 optimization. There is a ~15% speedup from using @inbounds
, a ~30% speedup from using @simd
, and a ~50% speedup from using @simd
and @inbounds
.
When using O2 optimization, using @inbounds
gives ~10% speedup. Using @simd
but not @inbounds
results in nearly the same performance as if no macros were used at all (~1% speedup), and using @simd
and @inbounds
results in ~13% speedup. So as before, @inbounds
increases performance, and using @simd
and @inbounds
together increases performance the most, but overall the effects are much less noticeable than when using O1 optimization (and even the fastest O1 code is nearly 3x slower than the slowest O2 code).
For O3 optimization, using @inbounds
but not @simd
actually decreased performance by ~4%. On repeat runs it was sometimes faster, and I would conclude that overall there is not much difference. Using @inbounds
and @simd
was still fastest (nearly the same as the same code with O2 optimization).
For those interested, I have included native assembly code returned by @code_native syntax=:att add_one!(x, rand_indices)
(when using O3 optimization). There are two vmovsd
instructions, and one vaddsd
instruction, indicating that the compiler was smart enough to use simd without me telling it to. But you can still see the bounds checking:
leaq -1(%r10), %rsi
cmpq %rdx, %rsi
jae .LBB0_5
[...]
.LBB0_5: # %L49
movabsq $j_throw_boundserror_4091, %rax
leaq -8(%rbp), %rsi
movq %r10, -8(%rbp)
callq *%rax
add_one! code native
julia> @code_native syntax=:att add_one!(x, rand_indices)
.text
.file "add_one!"
.section .rodata.cst8,"aM",@progbits,8
.p2align 3, 0x0 # -- Begin function julia_add_one!_4075
.LCPI0_0:
.quad 0x3ff0000000000000 # double 1
.text
.globl "julia_add_one!_4075"
.p2align 4, 0x90
.type "julia_add_one!_4075",@function
"julia_add_one!_4075": # @"julia_add_one!_4075"
; Function Signature: add_one!(Array{Float64, 1}, Array{Int64, 1})
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:4 within `add_one!`
# %bb.0: # %top
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl within `add_one!`
#DEBUG_VALUE: add_one!:x <- [DW_OP_deref] $rdi
#DEBUG_VALUE: add_one!:indices <- [DW_OP_deref] $rsi
#DEBUG_VALUE: add_one!:indices <- [DW_OP_deref] 0
#DEBUG_VALUE: add_one!:x <- [DW_OP_deref] 0
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:5 within `add_one!`
; ββ @ array.jl:891 within `iterate` @ array.jl:891
; βββ @ essentials.jl:11 within `length`
movq 16(%rsi), %rax
; βββ
; βββ @ int.jl:513 within `<`
testq %rax, %rax
; βββ
je .LBB0_6
# %bb.1: # %L34
; βββ @ essentials.jl:917 within `getindex`
movq (%rsi), %rcx
; βββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:6 within `add_one!`
; ββ @ essentials.jl:916 within `getindex`
; βββ @ essentials.jl:11 within `length`
movq 16(%rdi), %rdx
; βββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:5 within `add_one!`
; ββ @ array.jl:891 within `iterate` @ array.jl:891
; βββ @ essentials.jl:917 within `getindex`
movq (%rcx), %r10
; βββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:6 within `add_one!`
; ββ @ essentials.jl:916 within `getindex`
leaq -1(%r10), %rsi
cmpq %rdx, %rsi
jae .LBB0_5
# %bb.2: # %L52.lr.ph
movabsq $.LCPI0_0, %r10
movq (%rdi), %r8
movl $1, %r9d
vmovsd (%r10), %xmm0 # xmm0 = mem[0],zero
.p2align 4, 0x90
.LBB0_3: # %L71
# =>This Inner Loop Header: Depth=1
; ββ
; ββ @ float.jl:491 within `+`
vaddsd (%r8,%rsi,8), %xmm0, %xmm1
; ββ
; ββ @ array.jl:976 within `setindex!`
vmovsd %xmm1, (%r8,%rsi,8)
; ββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:7 within `add_one!`
; ββ @ array.jl:891 within `iterate`
; βββ @ int.jl:513 within `<`
cmpq %r9, %rax
; βββ
je .LBB0_6
# %bb.4: # %L104
# in Loop: Header=BB0_3 Depth=1
; βββ @ essentials.jl:917 within `getindex`
movq (%rcx,%r9,8), %r10
; βββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:6 within `add_one!`
; ββ @ essentials.jl:916 within `getindex`
incq %r9
leaq -1(%r10), %rsi
cmpq %rdx, %rsi
jb .LBB0_3
.LBB0_5: # %L49
movabsq $j_throw_boundserror_4091, %rax
leaq -8(%rbp), %rsi
movq %r10, -8(%rbp)
callq *%rax
.LBB0_6: # %L110
; ββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:7 within `add_one!`
addq $16, %rsp
popq %rbp
retq
.Lfunc_end0:
.size "julia_add_one!_4075", .Lfunc_end0-"julia_add_one!_4075"
; β
# -- End function
.section ".note.GNU-stack","",@progbits
And below is the native assembly code for the remaining functions
add_one_inbounds! code native
julia> @code_native syntax=:att add_one_inbounds!(x, rand_indices)
.text
.file "add_one_inbounds!"
.section .rodata.cst8,"aM",@progbits,8
.p2align 3, 0x0 # -- Begin function julia_add_one_inbounds!_4312
.LCPI0_0:
.quad 0x3ff0000000000000 # double 1
.text
.globl "julia_add_one_inbounds!_4312"
.p2align 4, 0x90
.type "julia_add_one_inbounds!_4312",@function
"julia_add_one_inbounds!_4312": # @"julia_add_one_inbounds!_4312"
; Function Signature: add_one_inbounds!(Array{Float64, 1}, Array{Int64, 1})
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:10 within `add_one_inbounds!`
# %bb.0: # %top
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl within `add_one_inbounds!`
#DEBUG_VALUE: add_one_inbounds!:x <- [DW_OP_deref] $rdi
#DEBUG_VALUE: add_one_inbounds!:indices <- [DW_OP_deref] $rsi
#DEBUG_VALUE: add_one_inbounds!:indices <- [DW_OP_deref] 0
#DEBUG_VALUE: add_one_inbounds!:x <- [DW_OP_deref] 0
pushq %rbp
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:11 within `add_one_inbounds!`
; ββ @ array.jl:891 within `iterate` @ array.jl:891
; βββ @ essentials.jl:11 within `length`
movq 16(%rsi), %rax
movq %rsp, %rbp
; βββ
; βββ @ int.jl:513 within `<`
testq %rax, %rax
; βββ
je .LBB0_4
# %bb.1: # %L34
; βββ @ essentials.jl:917 within `getindex`
movq (%rsi), %rcx
; βββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:12 within `add_one_inbounds!`
; ββ @ essentials.jl:917 within `getindex`
movq (%rdi), %rdx
movabsq $.LCPI0_0, %rdi
; ββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:11 within `add_one_inbounds!`
; ββ @ array.jl:891 within `iterate` @ array.jl:891
; βββ @ essentials.jl:917 within `getindex`
movq (%rcx), %rsi
; βββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:12 within `add_one_inbounds!`
; ββ @ essentials.jl:917 within `getindex`
vmovsd -8(%rdx,%rsi,8), %xmm0 # xmm0 = mem[0],zero
; ββ
; ββ @ float.jl:491 within `+`
vaddsd (%rdi), %xmm0, %xmm0
; ββ
; ββ @ array.jl:976 within `setindex!`
vmovsd %xmm0, -8(%rdx,%rsi,8)
; ββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:13 within `add_one_inbounds!`
; ββ @ array.jl:891 within `iterate`
; βββ @ int.jl:513 within `<`
cmpq $1, %rax
; βββ
je .LBB0_4
# %bb.2: # %L104.preheader
vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
movl $1, %esi
.p2align 4, 0x90
.LBB0_3: # %L104
# =>This Inner Loop Header: Depth=1
; βββ @ essentials.jl:917 within `getindex`
movq (%rcx,%rsi,8), %rdi
; βββ
; βββ @ int.jl:513 within `<`
incq %rsi
; βββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:12 within `add_one_inbounds!`
; ββ @ float.jl:491 within `+`
vaddsd -8(%rdx,%rdi,8), %xmm0, %xmm1
; ββ
; ββ @ array.jl:976 within `setindex!`
vmovsd %xmm1, -8(%rdx,%rdi,8)
; ββ
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:13 within `add_one_inbounds!`
; ββ @ array.jl:891 within `iterate`
; βββ @ int.jl:513 within `<`
cmpq %rsi, %rax
; βββ
jne .LBB0_3
.LBB0_4: # %L110
; ββ
popq %rbp
retq
.Lfunc_end0:
.size "julia_add_one_inbounds!_4312", .Lfunc_end0-"julia_add_one_inbounds!_4312"
; β
# -- End function
.section ".note.GNU-stack","",@progbits
add_one_simd! code native
julia> @code_native syntax=:att add_one_simd!(x, rand_indices)
.text
.file "add_one_simd!"
.section .rodata.cst8,"aM",@progbits,8
.p2align 3, 0x0 # -- Begin function julia_add_one_simd!_4339
.LCPI0_0:
.quad 0x3ff0000000000000 # double 1
.text
.globl "julia_add_one_simd!_4339"
.p2align 4, 0x90
.type "julia_add_one_simd!_4339",@function
"julia_add_one_simd!_4339": # @"julia_add_one_simd!_4339"
; Function Signature: add_one_simd!(Array{Float64, 1}, Array{Int64, 1})
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:17 within `add_one_simd!`
# %bb.0: # %top
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl within `add_one_simd!`
#DEBUG_VALUE: add_one_simd!:x <- [DW_OP_deref] $rdi
#DEBUG_VALUE: add_one_simd!:indices <- [DW_OP_deref] $rsi
#DEBUG_VALUE: add_one_simd!:indices <- [DW_OP_deref] 0
#DEBUG_VALUE: add_one_simd!:x <- [DW_OP_deref] 0
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:18 within `add_one_simd!`
; ββ @ simdloop.jl:71 within `macro expansion`
; βββ @ simdloop.jl:51 within `simd_inner_length`
; ββββ @ essentials.jl:11 within `length`
movq 16(%rsi), %rax
; ββββ
; ββ @ simdloop.jl:72 within `macro expansion`
; βββ @ int.jl:83 within `<`
testq %rax, %rax
; βββ
jle .LBB0_4
# %bb.1: # %L11.lr.ph
movabsq $.LCPI0_0, %r9
movq (%rsi), %rcx
movq (%rdi), %rdx
movq 16(%rdi), %rsi
xorl %r8d, %r8d
vmovsd (%r9), %xmm0 # xmm0 = mem[0],zero
.p2align 4, 0x90
.LBB0_2: # %L11
# =>This Inner Loop Header: Depth=1
; ββ @ simdloop.jl:76 within `macro expansion`
; βββ @ simdloop.jl:54 within `simd_index`
; ββββ @ essentials.jl:917 within `getindex`
movq (%rcx,%r8,8), %r9
; ββββ
; ββ @ simdloop.jl:77 within `macro expansion` @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:19
; βββ @ essentials.jl:916 within `getindex`
leaq -1(%r9), %r10
cmpq %rsi, %r10
jae .LBB0_5
# %bb.3: # %L64
# in Loop: Header=BB0_2 Depth=1
; βββ
; βββ @ float.jl:491 within `+`
vaddsd -8(%rdx,%r9,8), %xmm0, %xmm1
; βββ
; ββ @ simdloop.jl:78 within `macro expansion`
; βββ @ int.jl:87 within `+`
incq %r8
; βββ
; ββ @ simdloop.jl:77 within `macro expansion` @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:19
; βββ @ array.jl:976 within `setindex!`
vmovsd %xmm1, -8(%rdx,%r9,8)
; βββ
; ββ @ simdloop.jl:75 within `macro expansion`
; βββ @ int.jl:83 within `<`
cmpq %r8, %rax
; βββ
jne .LBB0_2
.LBB0_4: # %L71
; ββ @ simdloop.jl:76 within `macro expansion`
; βββ @ simdloop.jl:54 within `simd_index`
; ββββ @ essentials.jl:916 within `getindex`
addq $16, %rsp
popq %rbp
retq
.LBB0_5: # %L42
; ββββ
; ββ @ simdloop.jl:77 within `macro expansion` @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:19
; βββ @ essentials.jl:916 within `getindex`
movabsq $j_throw_boundserror_4354, %rax
leaq -8(%rbp), %rsi
movq %r9, -8(%rbp)
callq *%rax
.Lfunc_end0:
.size "julia_add_one_simd!_4339", .Lfunc_end0-"julia_add_one_simd!_4339"
; βββ
# -- End function
.section ".note.GNU-stack","",@progbits
add_one_simd_inbounds! code native
julia> @code_native syntax=:att add_one_simd_inbounds!(x, rand_indices)
.text
.file "add_one_simd_inbounds!"
.section .rodata.cst8,"aM",@progbits,8
.p2align 3, 0x0 # -- Begin function julia_add_one_simd_inbounds!_4357
.LCPI0_0:
.quad 0x3ff0000000000000 # double 1
.text
.globl "julia_add_one_simd_inbounds!_4357"
.p2align 4, 0x90
.type "julia_add_one_simd_inbounds!_4357",@function
"julia_add_one_simd_inbounds!_4357": # @"julia_add_one_simd_inbounds!_4357"
; Function Signature: add_one_simd_inbounds!(Array{Float64, 1}, Array{Int64, 1})
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:24 within `add_one_simd_inbounds!`
# %bb.0: # %top
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl within `add_one_simd_inbounds!`
#DEBUG_VALUE: add_one_simd_inbounds!:x <- [DW_OP_deref] $rdi
#DEBUG_VALUE: add_one_simd_inbounds!:indices <- [DW_OP_deref] $rsi
#DEBUG_VALUE: add_one_simd_inbounds!:indices <- [DW_OP_deref] 0
#DEBUG_VALUE: add_one_simd_inbounds!:x <- [DW_OP_deref] 0
pushq %rbp
; β @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:25 within `add_one_simd_inbounds!`
; ββ @ simdloop.jl:71 within `macro expansion`
; βββ @ simdloop.jl:51 within `simd_inner_length`
; ββββ @ essentials.jl:11 within `length`
movq 16(%rsi), %rax
movq %rsp, %rbp
; ββββ
; ββ @ simdloop.jl:72 within `macro expansion`
; βββ @ int.jl:83 within `<`
testq %rax, %rax
; βββ
jle .LBB0_3
# %bb.1: # %L11.lr.ph
movabsq $.LCPI0_0, %r8
movq (%rsi), %rcx
movq (%rdi), %rdx
xorl %esi, %esi
vmovsd (%r8), %xmm0 # xmm0 = mem[0],zero
.p2align 4, 0x90
.LBB0_2: # %L11
# =>This Inner Loop Header: Depth=1
; ββ @ simdloop.jl:76 within `macro expansion`
; βββ @ simdloop.jl:54 within `simd_index`
; ββββ @ essentials.jl:917 within `getindex`
movq (%rcx,%rsi,8), %rdi
; ββββ
; ββ @ simdloop.jl:78 within `macro expansion`
; βββ @ int.jl:87 within `+`
incq %rsi
; βββ
; ββ @ simdloop.jl:77 within `macro expansion` @ /home/spencer/Research/QuantumGateDesign.jl/examples/inbounds.jl:26
; βββ @ float.jl:491 within `+`
vaddsd -8(%rdx,%rdi,8), %xmm0, %xmm1
; βββ
; βββ @ array.jl:976 within `setindex!`
vmovsd %xmm1, -8(%rdx,%rdi,8)
; βββ
; ββ @ simdloop.jl:75 within `macro expansion`
; βββ @ int.jl:83 within `<`
cmpq %rsi, %rax
; βββ
jne .LBB0_2
.LBB0_3: # %L71
; ββ @ simdloop.jl:76 within `macro expansion`
; βββ @ simdloop.jl:54 within `simd_index`
; ββββ @ essentials.jl:916 within `getindex`
popq %rbp
retq
.Lfunc_end0:
.size "julia_add_one_simd_inbounds!_4357", .Lfunc_end0-"julia_add_one_simd_inbounds!_4357"
; ββββ
# -- End function
.section ".note.GNU-stack","",@progbits
I donβt know exactly why, but it seems the add_one_inbounds!
has a few more simd instructions than add_one_simd_inbounds!
does, and there is a branch that is implemented a bit differently in each function. Both these appear to be related to the iteration over the indices, and I suspect these differences are what cause add_one_simd_inbounds!
to be a little bit faster. Still, Iβm surprised that when using @inbounds
, the compiler is able to use simd, but when using @inbounds
and @simd
, the code is different than when using @inbounds
alone. In any case, Iβve dug into far enough to be satisfied.