# `@code_native` shows an algorithm that appears to be O(N), but benchmarking suggests that it's actually O(1)

I hope this is an OK category to put this under, I couldn’t see anything that matched better I’m trying to get better at reading assembly output by looking at some interesting optimisations. I’m currently looking at the famous sum of `n` integers optimisation. When I look at the sum of contiguous integers

``````julia> @code_native debuginfo=:none syntax=:intel sum(i for i in 1:1000)
.text
mov     rcx, qword ptr [rdi]
mov     rdx, qword ptr [rdi + 8]
mov     rsi, rdx
sub     rsi, rcx
jge     L18
xor     eax, eax
ret
L18:
lea     rax, [rcx + 1]
imul    rax, rsi
mov     rdi, rcx
not     rdi
mulx    rsi, rdx, rsi
shld    rsi, rdx, 63
ret
nop     word ptr cs:[rax + rax]
``````

I can work through and figure out that the assembly is basically implementing the formula `(stop^2 - start^2 + start + stop)/2`, which is exactly what I expect.

When I look at the assembly for a sum of strided integers, however, the assembly seems to be a normal loop

``````julia> @code_native debuginfo=:none syntax=:intel sum(i for i in 3:3:1000)
.text
mov     rcx, qword ptr [rdi + 16]
mov     rsi, qword ptr [rdi]
mov     r8, qword ptr [rdi + 8]
test    r8, r8
setg    al
cmp     rsi, rcx
setl    dl
cmp     rsi, rcx
je      L35
xor     al, dl
je      L35
xor     eax, eax
ret
L35:
lea     rdi, [rsi + r8]
sub     rcx, rsi
xor     edx, edx
nop
L48:
mov     rax, rsi
lea     rsi, [rdi + rdx]
cmp     rcx, rdx
jne     L48
ret
nop     word ptr cs:[rax + rax]
``````

which I took to mean that LLVM just didn’t didn’t implement the optimisation for strided ranges. The strange thing is that benchmarking seems to show that this is actually a constant-time algorithm

``````julia> time(minimum(@benchmark sum(i for i in 3:3:10)))
1.636

julia> time(minimum(@benchmark sum(i for i in 3:3:10000)))
1.637

julia> time(minimum(@benchmark sum(i for i in 3:3:10000000)))
1.636

julia> time(minimum(@benchmark sum(i for i in 3:3:10000000000)))
1.636
``````

but only when applied to native integers

``````julia> time(minimum(@benchmark sum(i for i in \$(3:3:big(10)))))
307.8089430894309

julia> time(minimum(@benchmark sum(i for i in \$(3:3:big(100)))))
4518.714285714285

julia> time(minimum(@benchmark sum(i for i in \$(3:3:big(1000)))))
46339.0

julia> time(minimum(@benchmark sum(i for i in \$(3:3:big(10000)))))
467333.0
``````

So what’s happening here? Is `@code_native` showing the wrong output, or is the loop just so fast that it appears to be constant time?

Edit: Should probably mention:

``````julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
``````
1 Like

The compiler is likely defeating your benchmark since it can probably just use constant propagation to optimize away the actual calcation. Try `@benchmark sum(\$(i for i in 3:3:10000))`, which should inhibit constant prop.

1 Like

The following cases A (loop) and B (constant) should be distinguished.

``````julia> VERSION
v"1.6.1"
``````

Case A (loop)

``````julia> f(ran) = sum(i for i in ran);
julia> @code_native debuginfo=:none syntax=:intel f(3:3:1000)
.text
push	rbp
mov	rbp, rsp
mov	r8, qword ptr [rcx + 16]
mov	rdx, qword ptr [rcx]
mov	r9, qword ptr [rcx + 8]
test	r9, r9
setg	al
cmp	rdx, r8
setl	cl
cmp	rdx, r8
je	L40
xor	al, cl
je	L40
xor	eax, eax
pop	rbp
ret
L40:
lea	r10, [rdx + r9]
sub	r8, rdx
xor	ecx, ecx
nop	word ptr cs:[rax + rax]
L64:
mov	rax, rdx
lea	rdx, [r10 + rcx]
cmp	r8, rcx
jne	L64
pop	rbp
ret
nop	word ptr cs:[rax + rax]
``````

Case B (constant)

``````julia> f_3_3_1000() = f(3:3:1000)
jukia> @code_native debuginfo=:none syntax=:intel f_3_3_1000()
.text
push	rbp
mov	rbp, rsp
mov	eax, 166833
pop	rbp
ret
nop	dword ptr [rax + rax]
``````
2 Likes

The point of the benchmark was to show that the output of `@code_native` is not actually the same assembly code that is being run (or maybe it is and there’s something else going on, that’s actually the point of my question). The benchmark is working as intended, it’s `@code_native` that is being defeated in this case.

That’s very interesting. So it’s seems like there are certain optimisations that are not run at the “top-level” of `@code_native`, would that be fair to say? If so, any idea which optimisations those are?

I’ve just realised that I can write these sums as `sum(3:3:1000)`, in which case the assembly output does give the constant-time algorithm (well it gives a constant-time algorithm at least…):

``````julia> @code_native debuginfo=:none syntax=:intel sum(3:3:1000)
.text
push    rbx
mov     rbx, rdi
movabs  rax, offset length
call    rax
mov     rcx, rax
mov     rax, qword ptr [rbx]
test    cl, 1
jne     L48
lea     rsi, [rcx - 1]
mov     rdx, rcx
sar     rdx
imul    rdx, rsi
imul    rdx, qword ptr [rbx + 8]
jmp     L64
L48:
lea     rdx, [rcx - 1]
sar     rdx
imul    rdx, rcx
imul    rdx, qword ptr [rbx + 8]
L64:
imul    rax, rcx
pop     rbx
ret
nop     dword ptr [rax]
`````` I’m still intrigued about why `@code_native sum(i for i in 3:3:1000)` only shows the `O(N)` algorithm even though it clearly does get optimised to the `O(1)` algorithm.

Not to derail the conversation, but an annotated `@code_native` for the const-fold O(1) case would be extremely helpful, if someone has the time.

Its in arm assembly, but I think its easier to read anyway.

``````julia> g() = sum(3:3:1000)
g (generic function with 1 method)

julia> g()
166833

julia> @code_native g()
.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL:1 within `g`
mov	w0, #35761 //moves 35761 to register w0 which is 0x8BB1
movk	w0, #2, lsl #16 //moves 2 in front of w0 giving 0x28BB1 which is 166833
ret //returns
; └
``````

LLVM code

``````julia> @code_llvm g()
;  @ REPL:1 within `g`
define i64 @julia_g_369() #0 {
top:
ret i64 166833
}``````

Of note, `sum(3:3:3000)` actually isn’t a loop in Julia-land. Thanks to the magic of typing, we have the function

``````function sum(r::AbstractRange{<:Real})
l = length(r)
# note that a little care is required to avoid overflow in l*(l-1)/2
return l * first(r) + (iseven(l) ? (step(r) * (l-1)) * (l>>1)
: (step(r) * l) * ((l-1)>>1))
end
``````

The reason you arent seeing the constant propagation in the code native but it happens in the benchmark is because the benchmark interpolates it but `@code_native` runs in the global scope, put it into a function and it shows the expected assembly.

``````julia> g() = sum(i for i in 3:3:10000)
g (generic function with 1 method)

julia> @code_native g()
.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL:1 within `g`
mov	w0, #22189
movk	w0, #254, lsl #16
ret
; └

julia> @code_native debuginfo=:none sum(i for i in 3:3:10000)
.section	__TEXT,__text,regular,pure_instructions
ldp	x9, x10, [x0, #8]
ldr	x8, [x0]
cmp	x9, #0                          ; =0
cset	w11, gt
cmp	x8, x10
cset	w12, lt
b.eq	L44
eor	w11, w11, w12
tbz	w11, #0, L44
mov	x0, #0
ret
L44:
mov	x11, x8
L52:
mov	x0, x11
cmp	x10, x8
b.ne	L52
ret
``````
2 Likes
``````julia> @btime sum(i for i in \$(Ref(3:3:10))[])
3.041 ns (0 allocations: 0 bytes)
18

julia> @btime sum(i for i in \$(Ref(3:3:100))[])
12.387 ns (0 allocations: 0 bytes)
1683

julia> @btime sum(i for i in \$(Ref(3:3:1000))[])
123.477 ns (0 allocations: 0 bytes)
166833
``````

This matches the assembly.
If we let the compiler specialize on the step, we’re back to O(1):

``````julia> @btime sum(i for i in 3:3:\$(Ref(10))[])
2.666 ns (0 allocations: 0 bytes)
18

julia> @btime sum(i for i in 3:3:\$(Ref(100))[])
2.708 ns (0 allocations: 0 bytes)
1683

julia> @btime sum(i for i in 3:3:\$(Ref(1000))[])
2.666 ns (0 allocations: 0 bytes)
166833
``````

With corresponding O(1) assembly to match:

``````julia> g(N) =  sum(i for i in 3:3:N)
g (generic function with 2 methods)

julia> @code_native g(100) # AArch64
``````
``````        .section        __TEXT,__text,regular,pure_instructions
stp     x29, x30, [sp, #-16]!
ldr     x8, [x8, #2192]
mov     x2, x0
mov     w0, #3
mov     w1, #3
blr     x8
cmp     x0, #3                          ; =3
b.lt    L88
mov     x8, #6148914691236517205
mov     x9, #-2
mul     x8, x0, x8
mvn     x10, x8
sub     x8, x9, x8
mul     x9, x10, x8
umulh   x8, x10, x8
extr    x8, x8, x9, #1
add     x8, x8, x8, lsl #1
add     x8, x8, x0, lsl #1
sub     x0, x8, #3                      ; =3
ldp     x29, x30, [sp], #16
ret
L88:
mov     x0, xzr
ldp     x29, x30, [sp], #16
ret
``````
1 Like

It’s not about top level or not. `@code_native sum(i for i in 3:3:10000)` simply expands to `code_native(sum, (typeof(i for i in 3:3:10000),))`, so it is intentional that the code is specialized solely on the argument types without any constant propagation or propagation of other constraints. To get that, you can simply write:

``````code_native(()) do
sum(i for i in 3:3:10000)
end
``````

I will also recommend GitHub - JuliaDebug/Cthulhu.jl: The slow descent into madness here for interactively exploring things like constant propagation.

2 Likes

Thank you, this seems like it should help a lot!