Optimisations for loops of known size at compile time

Luapulu · March 25, 2022, 5:03pm

Say I write a function like

function func(n, ::Val{N}) where N
    s = 0
    for i in 1:n
        for j in 1:N
            # some kind of logic here, e.g.
            s += i * j
        end
    end
    return s
end

does the language ever unroll the inner loop or somehow make use of the known size of the loop? Would it make sense to add an optimisation like this? What would have to be done to implement optimisations of this kind?

If we did have such optimisations as part of the language would they not allow for an elegant implementation of StaticArrays.jl? It seems like it would then be easy to implement statically sized versions of just about any type because you simply write code as you do normally and the compiler takes care of using the static size to improve performance.

StefanKarpinski · March 25, 2022, 5:13pm

It does:

julia> function unroll(::Val{N}) where N
           t = 0.0
           for i = 1:N
               t += exp(i)
           end
           return t
       end
unroll (generic function with 1 method)

julia> @code_native unroll(Val(0))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	_julia_unroll_198               ## -- Begin function julia_unroll_198
	.p2align	4, 0x90
_julia_unroll_198:                      ## @julia_unroll_198
; ┌ @ REPL[16]:1 within `unroll`
	.cfi_startproc
## %bb.0:                               ## %top
; │ @ REPL[16]:6 within `unroll`
	vxorps	%xmm0, %xmm0, %xmm0
	retq
	.cfi_endproc
; └
                                        ## -- End function
.subsections_via_symbols

julia> @code_native unroll(Val(1))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.section	__TEXT,__literal8,8byte_literals
	.p2align	3                               ## -- Begin function julia_unroll_200
LCPI0_0:
	.quad	0x4005bf0a8b145769              ## double 2.7182818284590451
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_julia_unroll_200
	.p2align	4, 0x90
_julia_unroll_200:                      ## @julia_unroll_200
; ┌ @ REPL[16]:1 within `unroll`
	.cfi_startproc
## %bb.0:                               ## %top
	movabsq	$LCPI0_0, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
; │ @ REPL[16]:6 within `unroll`
	retq
	.cfi_endproc
; └
                                        ## -- End function
.subsections_via_symbols

julia> @code_native unroll(Val(2))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.section	__TEXT,__literal8,8byte_literals
	.p2align	3                               ## -- Begin function julia_unroll_202
LCPI0_0:
	.quad	0x3ff0000000000000              ## double 1
LCPI0_1:
	.quad	0x4000000000000000              ## double 2
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_julia_unroll_202
	.p2align	4, 0x90
_julia_unroll_202:                      ## @julia_unroll_202
; ┌ @ REPL[16]:1 within `unroll`
	.cfi_startproc
## %bb.0:                               ## %top
; │ @ REPL[16]:4 within `unroll`
; │┌ @ math.jl:1356 within `exp`
	pushq	%rbx
	.cfi_def_cfa_offset 16
	subq	$16, %rsp
	.cfi_def_cfa_offset 32
	.cfi_offset %rbx, -16
	movabsq	$_j_exp_204, %rbx
	movabsq	$LCPI0_0, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
	callq	*%rbx
	vxorpd	%xmm1, %xmm1, %xmm1
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	%xmm1, %xmm0, %xmm0
	vmovsd	%xmm0, 8(%rsp)                  ## 8-byte Spill
	movabsq	$LCPI0_1, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
; │└
; │┌ @ math.jl:1356 within `exp`
	callq	*%rbx
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	8(%rsp), %xmm0, %xmm0           ## 8-byte Folded Reload
; │└
; │ @ REPL[16]:6 within `unroll`
	addq	$16, %rsp
	popq	%rbx
	retq
	.cfi_endproc
; └
                                        ## -- End function
.subsections_via_symbols

julia> @code_native unroll(Val(3))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.section	__TEXT,__literal8,8byte_literals
	.p2align	3                               ## -- Begin function julia_unroll_205
LCPI0_0:
	.quad	0x3ff0000000000000              ## double 1
LCPI0_1:
	.quad	0x4000000000000000              ## double 2
LCPI0_2:
	.quad	0x4008000000000000              ## double 3
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_julia_unroll_205
	.p2align	4, 0x90
_julia_unroll_205:                      ## @julia_unroll_205
; ┌ @ REPL[16]:1 within `unroll`
	.cfi_startproc
## %bb.0:                               ## %top
; │ @ REPL[16]:4 within `unroll`
; │┌ @ math.jl:1356 within `exp`
	pushq	%rbx
	.cfi_def_cfa_offset 16
	subq	$16, %rsp
	.cfi_def_cfa_offset 32
	.cfi_offset %rbx, -16
	movabsq	$_j_exp_207, %rbx
	movabsq	$LCPI0_0, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
	callq	*%rbx
	vxorpd	%xmm1, %xmm1, %xmm1
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	%xmm1, %xmm0, %xmm0
	vmovsd	%xmm0, 8(%rsp)                  ## 8-byte Spill
	movabsq	$LCPI0_1, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
; │└
; │┌ @ math.jl:1356 within `exp`
	callq	*%rbx
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	8(%rsp), %xmm0, %xmm0           ## 8-byte Folded Reload
	vmovsd	%xmm0, 8(%rsp)                  ## 8-byte Spill
	movabsq	$LCPI0_2, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
; │└
; │┌ @ math.jl:1356 within `exp`
	callq	*%rbx
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	8(%rsp), %xmm0, %xmm0           ## 8-byte Folded Reload
; │└
; │ @ REPL[16]:6 within `unroll`
	addq	$16, %rsp
	popq	%rbx
	retq
	.cfi_endproc
; └
                                        ## -- End function
.subsections_via_symbols

julia> @code_native unroll(Val(4))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.section	__TEXT,__literal8,8byte_literals
	.p2align	3                               ## -- Begin function julia_unroll_208
LCPI0_0:
	.quad	0x3ff0000000000000              ## double 1
LCPI0_1:
	.quad	0x4000000000000000              ## double 2
LCPI0_2:
	.quad	0x4008000000000000              ## double 3
LCPI0_3:
	.quad	0x4010000000000000              ## double 4
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_julia_unroll_208
	.p2align	4, 0x90
_julia_unroll_208:                      ## @julia_unroll_208
; ┌ @ REPL[16]:1 within `unroll`
	.cfi_startproc
## %bb.0:                               ## %top
; │ @ REPL[16]:4 within `unroll`
; │┌ @ math.jl:1356 within `exp`
	pushq	%rbx
	.cfi_def_cfa_offset 16
	subq	$16, %rsp
	.cfi_def_cfa_offset 32
	.cfi_offset %rbx, -16
	movabsq	$_j_exp_210, %rbx
	movabsq	$LCPI0_0, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
	callq	*%rbx
	vxorpd	%xmm1, %xmm1, %xmm1
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	%xmm1, %xmm0, %xmm0
	vmovsd	%xmm0, 8(%rsp)                  ## 8-byte Spill
	movabsq	$LCPI0_1, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
; │└
; │┌ @ math.jl:1356 within `exp`
	callq	*%rbx
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	8(%rsp), %xmm0, %xmm0           ## 8-byte Folded Reload
	vmovsd	%xmm0, 8(%rsp)                  ## 8-byte Spill
	movabsq	$LCPI0_2, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
; │└
; │┌ @ math.jl:1356 within `exp`
	callq	*%rbx
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	8(%rsp), %xmm0, %xmm0           ## 8-byte Folded Reload
	vmovsd	%xmm0, 8(%rsp)                  ## 8-byte Spill
	movabsq	$LCPI0_3, %rax
	vmovsd	(%rax), %xmm0                   ## xmm0 = mem[0],zero
; │└
; │┌ @ math.jl:1356 within `exp`
	callq	*%rbx
; │└
; │┌ @ float.jl:383 within `+`
	vaddsd	8(%rsp), %xmm0, %xmm0           ## 8-byte Folded Reload
; │└
; │ @ REPL[16]:6 within `unroll`
	addq	$16, %rsp
	popq	%rbx
	retq
	.cfi_endproc
; └
                                        ## -- End function
.subsections_via_symbols

Your example doesn’t exhibit unrolling because the compiler replaces the entire function with a closed form solution. Specilized code gets generated for each N though, but that’s not very useful in this example.

Luapulu · March 25, 2022, 5:26pm

Ah, cool. Then why is ntuple implemented via explicit unrolling for example?

https://github.com/JuliaLang/julia/blob/68e2969217112c6a9ee576183af20101a6132b71/base/ntuple.jl#L69-L80

Why not just write

ntuple(f::F, ::Val{N}) where {F,N} = Tuple(f(i) for i = 1:N)

Edit: my version is missing the ArgumentError but I think you get the point.

In general, how reliable is this behaviour? Can I generally write loops and trust the compiler to use the known size of the loop, or does this only sometimes happen and I need to check @code_native to be sure in each case.

simeonschaub · March 25, 2022, 6:05pm

Note that currently at least, this kind of unrolling only happens at the LLVM level. The fact that it only happens after type inference, constant propagation, DCE, inlining etc have already been run means that this unrolling can’t eliminate type instabilities inside your loop for example. That’s why ntuple doesn’t use a for-loop but instead does manual unrolling at the Julia level.