Note that if you hide the index from the compiler, begin+
is actually faster.
Computers are 0-based, so every 1-based index in Julia has an implicit subtraction
The apparent addition of begin+
actually just cancels that addition.
julia> using StrideArraysCore, OffsetArrays
julia> A = rand(2,2,2,2,2);
julia> As = StrideArraysCore.zero_offsets(StrideArray(A));
julia> Ao = OffsetArrays.Origin(0)(A);
julia> foo(A,i,j,k,l,m) = @inbounds A[i,j,k,l,m]
foo (generic function with 1 method)
julia> bar(A,i,j,k,l,m) = @inbounds A[begin+i,begin+j,begin+k,begin+l,begin+m]
bar (generic function with 1 method)
julia> @cn foo(A,1,1,1,1,1)
julia> @code_native syntax=:intel debuginfo=:none foo(A, 1, 1, 1, 1, 1)
.text
.file "foo"
.globl julia_foo_469 # -- Begin function julia_foo_469
.p2align 4, 0x90
.type julia_foo_469,@function
julia_foo_469: # @julia_foo_469
# %bb.0: # %top
push rbp
mov rbp, rsp
mov r10, qword ptr [rdi]
mov r11, qword ptr [rdi + 24]
dec rdx
imul rdx, r11
imul r11, qword ptr [rdi + 32]
dec r9
imul r9, qword ptr [rdi + 48]
lea rax, [r8 + r9]
dec rax
imul rax, qword ptr [rdi + 40]
add rax, rcx
dec rax
imul rax, r11
add rdx, rsi
add rdx, rax
vmovsd xmm0, qword ptr [r10 + 8*rdx - 8] # xmm0 = mem[0],zero
pop rbp
ret
.Lfunc_end0:
.size julia_foo_469, .Lfunc_end0-julia_foo_469
# -- End function
.section ".note.GNU-stack","",@progbits
julia> @cn bar(A,1,1,1,1,1)
julia> @code_native syntax=:intel debuginfo=:none bar(A, 1, 1, 1, 1, 1)
.text
.file "bar"
.globl julia_bar_520 # -- Begin function julia_bar_520
.p2align 4, 0x90
.type julia_bar_520,@function
julia_bar_520: # @julia_bar_520
# %bb.0: # %top
push rbp
mov rbp, rsp
mov rax, qword ptr [rdi + 24]
imul rdx, rax
imul rax, qword ptr [rdi + 32]
mov r10, qword ptr [rdi]
imul r9, qword ptr [rdi + 48]
add r9, r8
imul r9, qword ptr [rdi + 40]
add r9, rcx
imul r9, rax
add rdx, rsi
add rdx, r9
vmovsd xmm0, qword ptr [r10 + 8*rdx] # xmm0 = mem[0],zero
pop rbp
ret
.Lfunc_end0:
.size julia_bar_520, .Lfunc_end0-julia_bar_520
# -- End function
.section ".note.GNU-stack","",@progbits
julia> @cn foo(As, 0, 0, 0, 0, 0)
julia> @code_native syntax=:intel debuginfo=:none foo(As, 0, 0, 0, 0, 0)
.text
.file "foo"
.globl julia_foo_535 # -- Begin function julia_foo_535
.p2align 4, 0x90
.type julia_foo_535,@function
julia_foo_535: # @julia_foo_535
# %bb.0: # %top
push rbp
mov rbp, rsp
imul r9, qword ptr [rdi + 32]
add r9, r8
imul r9, qword ptr [rdi + 24]
add r9, rcx
imul r9, qword ptr [rdi + 16]
add r9, rdx
imul r9, qword ptr [rdi + 8]
add r9, rsi
mov rax, qword ptr [rdi]
vmovsd xmm0, qword ptr [rax + 8*r9] # xmm0 = mem[0],zero
pop rbp
ret
.Lfunc_end0:
.size julia_foo_535, .Lfunc_end0-julia_foo_535
# -- End function
.section ".note.GNU-stack","",@progbits
julia> @cn foo(Ao, 0, 0, 0, 0, 0)
julia> @code_native syntax=:intel debuginfo=:none foo(Ao, 0, 0, 0, 0, 0)
.text
.file "foo"
.globl julia_foo_537 # -- Begin function julia_foo_537
.p2align 4, 0x90
.type julia_foo_537,@function
julia_foo_537: # @julia_foo_537
# %bb.0: # %top
push rbp
mov rbp, rsp
push r15
push r14
push r12
push rbx
mov r11, qword ptr [rdi + 8]
mov r10, qword ptr [rdi + 16]
mov r12, qword ptr [rdi + 24]
mov rbx, qword ptr [rdi + 32]
mov rax, qword ptr [rdi + 40]
mov rdi, qword ptr [rdi]
not r11
mov r14, qword ptr [rdi]
mov r15, qword ptr [rdi + 24]
not r10
add r10, rdx
imul r10, r15
imul r15, qword ptr [rdi + 32]
not r12
add r12, rcx
not rbx
add rbx, r8
not rax
add rax, r9
imul rax, qword ptr [rdi + 48]
add rax, rbx
imul rax, qword ptr [rdi + 40]
add rax, r12
imul rax, r15
add r11, rsi
add r11, r10
add r11, rax
vmovsd xmm0, qword ptr [r14 + 8*r11] # xmm0 = mem[0],zero
pop rbx
pop r12
pop r14
pop r15
pop rbp
ret
.Lfunc_end0:
.size julia_foo_537, .Lfunc_end0-julia_foo_537
# -- End function
.section ".note.GNU-stack","",@progbits
Note all the dec
instructions when using normal, 1-based, indexing that disappear with begin+
.
Note also that OffsetArray
s keeps the offsets dynamic, so you will actually get extra additions there – that’s why we have even more than regular arrays.
StrideArraysCore.jl
allows for both compile-time known offsets and runtime offsets. So here I use compile time offsets to show it has relatively few indices.
Note that I optimized the amount of instructions in an indexing operation like this in StrideArraysCore 0.4.16 (released less than an hour ago); 0.4.15 and older will have a lot more instructions (but do still support compile time and runtime offsets).
Note that the advantage of runtime offsets is avoiding code duplication; you don’t need to compile different almost-the-same version of a function just for changing a few of these values.
In any loop, the optimizer should be able to hoist the offseting out. Meaning this is a cost you’re only really paying once per non-inlined function call. That is, it is extremely cheap.
On the other hand, compile times can be arbitrarily expensive.
Dynamic offsets are a good choice if you’re likely to use more than one offset value. If not, then of course fixing it at compile time is a little nicer.