`@code_native` shows an algorithm that appears to be O(N), but benchmarking suggests that it's actually O(1)

I hope this is an OK category to put this under, I couldn’t see anything that matched better :grimacing:

I’m trying to get better at reading assembly output by looking at some interesting optimisations. I’m currently looking at the famous sum of n integers optimisation. When I look at the sum of contiguous integers

julia> @code_native debuginfo=:none syntax=:intel sum(i for i in 1:1000)
        .text
        mov     rcx, qword ptr [rdi]
        mov     rdx, qword ptr [rdi + 8]
        mov     rsi, rdx
        sub     rsi, rcx
        jge     L18
        xor     eax, eax
        ret
L18:
        lea     rax, [rcx + 1]
        imul    rax, rsi
        mov     rdi, rcx
        not     rdi
        add     rdx, rdi
        mulx    rsi, rdx, rsi
        shld    rsi, rdx, 63
        add     rax, rcx
        add     rax, rsi
        ret
        nop     word ptr cs:[rax + rax]

I can work through and figure out that the assembly is basically implementing the formula (stop^2 - start^2 + start + stop)/2, which is exactly what I expect.

When I look at the assembly for a sum of strided integers, however, the assembly seems to be a normal loop

julia> @code_native debuginfo=:none syntax=:intel sum(i for i in 3:3:1000)
        .text
        mov     rcx, qword ptr [rdi + 16]
        mov     rsi, qword ptr [rdi]
        mov     r8, qword ptr [rdi + 8]
        test    r8, r8
        setg    al
        cmp     rsi, rcx
        setl    dl
        cmp     rsi, rcx
        je      L35
        xor     al, dl
        je      L35
        xor     eax, eax
        ret
L35:
        lea     rdi, [rsi + r8]
        add     rcx, r8
        sub     rcx, rsi
        xor     edx, edx
        nop
L48:
        mov     rax, rsi
        lea     rsi, [rdi + rdx]
        add     rsi, rax
        add     rdx, r8
        cmp     rcx, rdx
        jne     L48
        ret
        nop     word ptr cs:[rax + rax]

which I took to mean that LLVM just didn’t didn’t implement the optimisation for strided ranges. The strange thing is that benchmarking seems to show that this is actually a constant-time algorithm

julia> time(minimum(@benchmark sum(i for i in 3:3:10)))
1.636

julia> time(minimum(@benchmark sum(i for i in 3:3:10000)))
1.637

julia> time(minimum(@benchmark sum(i for i in 3:3:10000000)))
1.636

julia> time(minimum(@benchmark sum(i for i in 3:3:10000000000)))
1.636

but only when applied to native integers

julia> time(minimum(@benchmark sum(i for i in $(3:3:big(10)))))
307.8089430894309

julia> time(minimum(@benchmark sum(i for i in $(3:3:big(100)))))
4518.714285714285

julia> time(minimum(@benchmark sum(i for i in $(3:3:big(1000)))))
46339.0

julia> time(minimum(@benchmark sum(i for i in $(3:3:big(10000)))))
467333.0

So what’s happening here? Is @code_native showing the wrong output, or is the loop just so fast that it appears to be constant time?

Edit: Should probably mention:

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
1 Like

The compiler is likely defeating your benchmark since it can probably just use constant propagation to optimize away the actual calcation. Try @benchmark sum($(i for i in 3:3:10000)), which should inhibit constant prop.

1 Like

The following cases A (loop) and B (constant) should be distinguished.

julia> VERSION
v"1.6.1"

Case A (loop)

julia> f(ran) = sum(i for i in ran);
julia> @code_native debuginfo=:none syntax=:intel f(3:3:1000)
	.text
	push	rbp
	mov	rbp, rsp
	mov	r8, qword ptr [rcx + 16]
	mov	rdx, qword ptr [rcx]
	mov	r9, qword ptr [rcx + 8]
	test	r9, r9
	setg	al
	cmp	rdx, r8
	setl	cl
	cmp	rdx, r8
	je	L40
	xor	al, cl
	je	L40
	xor	eax, eax
	pop	rbp
	ret
L40:
	lea	r10, [rdx + r9]
	add	r8, r9
	sub	r8, rdx
	xor	ecx, ecx
	nop	word ptr cs:[rax + rax]
L64:
	mov	rax, rdx
	lea	rdx, [r10 + rcx]
	add	rdx, rax
	add	rcx, r9
	cmp	r8, rcx
	jne	L64
	pop	rbp
	ret
	nop	word ptr cs:[rax + rax]

Case B (constant)

julia> f_3_3_1000() = f(3:3:1000)
jukia> @code_native debuginfo=:none syntax=:intel f_3_3_1000()
	.text
	push	rbp
	mov	rbp, rsp
	mov	eax, 166833
	pop	rbp
	ret
	nop	dword ptr [rax + rax]
2 Likes

The point of the benchmark was to show that the output of @code_native is not actually the same assembly code that is being run (or maybe it is and there’s something else going on, that’s actually the point of my question). The benchmark is working as intended, it’s @code_native that is being defeated in this case.

That’s very interesting. So it’s seems like there are certain optimisations that are not run at the “top-level” of @code_native, would that be fair to say? If so, any idea which optimisations those are?

I’ve just realised that I can write these sums as sum(3:3:1000), in which case the assembly output does give the constant-time algorithm (well it gives a constant-time algorithm at least…):

julia> @code_native debuginfo=:none syntax=:intel sum(3:3:1000)
        .text
        push    rbx
        mov     rbx, rdi
        movabs  rax, offset length
        call    rax
        mov     rcx, rax
        mov     rax, qword ptr [rbx]
        test    cl, 1
        jne     L48
        lea     rsi, [rcx - 1]
        mov     rdx, rcx
        sar     rdx
        imul    rdx, rsi
        imul    rdx, qword ptr [rbx + 8]
        jmp     L64
L48:
        lea     rdx, [rcx - 1]
        sar     rdx
        imul    rdx, rcx
        imul    rdx, qword ptr [rbx + 8]
L64:
        imul    rax, rcx
        add     rax, rdx
        pop     rbx
        ret
        nop     dword ptr [rax]

:man_facepalming:

I’m still intrigued about why @code_native sum(i for i in 3:3:1000) only shows the O(N) algorithm even though it clearly does get optimised to the O(1) algorithm.

Not to derail the conversation, but an annotated @code_native for the const-fold O(1) case would be extremely helpful, if someone has the time.

Its in arm assembly, but I think its easier to read anyway.

julia> g() = sum(3:3:1000)
g (generic function with 1 method)

julia> g()
166833

julia> @code_native g()
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[11]:1 within `g`
	mov	w0, #35761 //moves 35761 to register w0 which is 0x8BB1
	movk	w0, #2, lsl #16 //moves 2 in front of w0 giving 0x28BB1 which is 166833
	ret //returns
; └

LLVM code

julia> @code_llvm g()
;  @ REPL[26]:1 within `g`
define i64 @julia_g_369() #0 {
top:
  ret i64 166833
}

Of note, sum(3:3:3000) actually isn’t a loop in Julia-land. Thanks to the magic of typing, we have the function

function sum(r::AbstractRange{<:Real})
    l = length(r)
    # note that a little care is required to avoid overflow in l*(l-1)/2
    return l * first(r) + (iseven(l) ? (step(r) * (l-1)) * (l>>1)
                                     : (step(r) * l) * ((l-1)>>1))
end

The reason you arent seeing the constant propagation in the code native but it happens in the benchmark is because the benchmark interpolates it but @code_native runs in the global scope, put it into a function and it shows the expected assembly.

julia> g() = sum(i for i in 3:3:10000)
g (generic function with 1 method)

julia> @code_native g()
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[12]:1 within `g`
	mov	w0, #22189
	movk	w0, #254, lsl #16
	ret
; └

julia> @code_native debuginfo=:none sum(i for i in 3:3:10000)
	.section	__TEXT,__text,regular,pure_instructions
	ldp	x9, x10, [x0, #8]
	ldr	x8, [x0]
	cmp	x9, #0                          ; =0
	cset	w11, gt
	cmp	x8, x10
	cset	w12, lt
	b.eq	L44
	eor	w11, w11, w12
	tbz	w11, #0, L44
	mov	x0, #0
	ret
L44:
	add	x10, x9, x10
	mov	x11, x8
L52:
	mov	x0, x11
	add	x8, x9, x8
	add	x11, x8, x11
	cmp	x10, x8
	b.ne	L52
	ret
2 Likes
julia> @btime sum(i for i in $(Ref(3:3:10))[])
  3.041 ns (0 allocations: 0 bytes)
18

julia> @btime sum(i for i in $(Ref(3:3:100))[])
  12.387 ns (0 allocations: 0 bytes)
1683

julia> @btime sum(i for i in $(Ref(3:3:1000))[])
  123.477 ns (0 allocations: 0 bytes)
166833

This matches the assembly.
If we let the compiler specialize on the step, we’re back to O(1):

julia> @btime sum(i for i in 3:3:$(Ref(10))[])
  2.666 ns (0 allocations: 0 bytes)
18

julia> @btime sum(i for i in 3:3:$(Ref(100))[])
  2.708 ns (0 allocations: 0 bytes)
1683

julia> @btime sum(i for i in 3:3:$(Ref(1000))[])
  2.666 ns (0 allocations: 0 bytes)
166833

With corresponding O(1) assembly to match:

julia> g(N) =  sum(i for i in 3:3:N)
g (generic function with 2 methods)

julia> @code_native g(100) # AArch64
        .section        __TEXT,__text,regular,pure_instructions
        stp     x29, x30, [sp, #-16]!
        adrp    x8, #0                          ; StepRange;
        ldr     x8, [x8, #2192]
        mov     x2, x0
        mov     w0, #3
        mov     w1, #3
        blr     x8
        cmp     x0, #3                          ; =3
        b.lt    L88
        mov     x8, #6148914691236517205
        mov     x9, #-2
        mul     x8, x0, x8
        mvn     x10, x8
        sub     x8, x9, x8
        mul     x9, x10, x8
        umulh   x8, x10, x8
        extr    x8, x8, x9, #1
        add     x8, x8, x8, lsl #1
        add     x8, x8, x0, lsl #1
        sub     x0, x8, #3                      ; =3
        ldp     x29, x30, [sp], #16
        ret
L88:
        mov     x0, xzr
        ldp     x29, x30, [sp], #16
        ret
1 Like

It’s not about top level or not. @code_native sum(i for i in 3:3:10000) simply expands to code_native(sum, (typeof(i for i in 3:3:10000),)), so it is intentional that the code is specialized solely on the argument types without any constant propagation or propagation of other constraints. To get that, you can simply write:

code_native(()) do
    sum(i for i in 3:3:10000)
end

I will also recommend GitHub - JuliaDebug/Cthulhu.jl: The slow descent into madness here for interactively exploring things like constant propagation.

2 Likes

Thank you, this seems like it should help a lot!