Here’s my Optimization For Julia Beginners:
(Mostly writing this to establish the limits of my own ignorance, a full, approachable and really really cool intro to all of this is in the Self Guided Course to Compilers which talks about all this stuff in detail and with very friendly explanations. Dead code and other goodies are in the lecture here)
Part 1: LLVM Passes
We need to understand some beginner compiler concepts first to answer your questions.
Imagine this is your Julia code:
x = 3
x = 4
y = x + 1
You can spot that the code should result in y == 5, x == 4
. That means that the line x = 3
is “dead” and a relatively simple analysis of the code would convert it into
x = 4
y = x + 1
(This is a contrived example, but we’ll build up to the other optimizations.)
An analysis that can look at the code and eliminate unneeded lines is called “dead code elimination”. If it was smart enough to figure out that you only needed the last result, the “entire program” could be shrunk further to
y = 5
in what would be called a “constant propagation” analysis. These different analyses are called passes in LLVM, and the general notion is to build data structures (a Control Flow Graph, or CFG, for example) about the code so that you can apply this automated reasoning and hopefully a) prove certain optimizations are correct and b) apply them to make the code faster.
These were simple optimizations in that you only need to reason “locally” about what the code lines immediately around the addition operation happened, but you can imagine that this gets more and more complex as you add in loops, function boundaries, different compiler architectures…
Furthermore, there’s a planning issue - some optimizations are wayyy more profitable to run after others. Inlining (copy pasting the function code to where its called) enables many other optimizations to happen after it (because you just brought code chunks to a local context, so the local passes are easier and cheaper!) Point is, the compiler now also has to worry about the scheduling, or ordering of the optimization passes.
At a certain point, it becomes really hard to completely figure out which optimizations you can prove about the code (because you need to update and manipulate the CFG) and it’s best to just have heuristics, or cost models, about when an optimization is profitable. (This is why you will hear about the Inliner Cost model being a big thing to know about when you are exploring those depths.) This is just the classic tradeoff of compilation time vs run-time performance - it’s just not profitable to analyze and prove all possible correct transformations of the code, so a certain set of heuristics are established that work “well enough”.
This is why you can set ``–optimize=0to
1,2or
3`. You’re just telling the compiler how much effort it should put into giving you those tasty, tasty FLOP/s, at the cost of more compilation time/work.
Part 2: Answering your question
So what do --compile=min --optimize=0
actually do?
-
--optimize=0
will reduce the number of passes that actually get applied to a bare minimum set of profitable ones. Reading the LLVM Passes list is daunting, there’s dozens of them. (Side note - the autovectorization pass is what gives your code SIMD speedups, it’s just another analysis that gets run on your code.) It will also lower the heuristics of things like the inliner cost model to fire off less frequently in order to avoid work (there being a trade-off between compilation time and optimized runtime).
IIUC, (But someone else will probably will probably correct me on this, as I couldn’t find documentation in the manual for this flag.)
-
--compile=min
on the other hand tries not to optimize for performance, but for program size, eg reducing the amount of code generated and cached in the process. This is desirable when shipping binaries and executables and is obviously in tension with the inliner optimizations (which are, handwavingly, copy pasting function code all over the place).
Together those optimizations generally hint at wanting to share programs with others and that you don’t want to spend oodles of compilation time on them.
Here’s an example:
# Run with Julia1.6-rc1 --compile=min --optimize=0
julia> xs = collect(1:1000000);
julia> @btime mysum($xs);
7.480 ms (6 allocations: 208 bytes)
julia> function mysum(xs)
res = 0
for i in xs
res += i
end
res
end
mysum (generic function with 1 method)
# Run with normal Julia 1.6-rc1
julia> @btime mysum($xs);
393.703 μs (0 allocations: 0 bytes)
And the emmitted code is here
# --compile-min --optimize=0 version
julia> @code_native debuginfo=:none mysum(collect(1:100))
.text
subq $88, %rsp
xorl %eax, %eax
movl %eax, %ecx
vxorps %xmm0, %xmm0, %xmm0
vmovdqa %xmm0, 64(%rsp)
movq $0, 80(%rsp)
movq %fs:0, %rdx
movq %rdx, %rsi
addq $-32768, %rsi # imm = 0x8000
movq $4, 64(%rsp)
leaq 72(%rsp), %r8
movq -32768(%rdx), %r9
movq %r9, 72(%rsp)
leaq 64(%rsp), %r9
movq %r9, -32768(%rdx)
movq $0, 80(%rsp)
movq %rdi, 80(%rsp)
movq %rdi, %rdx
movq %rdx, %r9
movq 8(%rdx), %rdx
cmpq %rdx, %rcx
setb %r10b
andb $1, %r10b
andb $1, %r10b
xorb $-1, %r10b
testb $1, %r10b
movb $1, %r10b
movq %rdi, 56(%rsp)
movq %rsi, 48(%rsp)
movq %r8, 40(%rsp)
movq %r9, 32(%rsp)
movq %rdx, 24(%rsp)
movb %r10b, 23(%rsp)
movq %rcx, 8(%rsp)
jne L193
xorl %eax, %eax
movq 32(%rsp), %rcx
movq (%rcx), %rdx
movq (%rdx), %rdx
movb %al, 23(%rsp)
movq %rdx, 8(%rsp)
L193:
movq 8(%rsp), %rax
movb 23(%rsp), %cl
xorl %edx, %edx
movl %edx, %esi
xorb $1, %cl
xorb $-1, %cl
testb $1, %cl
movl $2, %edi
movq 24(%rsp), %r8
movq 56(%rsp), %r9
movq %rsi, %r10
movq %r8, (%rsp)
movq %r9, -8(%rsp)
movq %rax, -16(%rsp)
movq %rdi, -24(%rsp)
movq %r10, -32(%rsp)
movq %rsi, -40(%rsp)
jne L533
L268:
movq -32(%rsp), %rax
movq -24(%rsp), %rcx
movq -16(%rsp), %rdx
movq -8(%rsp), %rsi
movq (%rsp), %rdi
addq %rdx, %rax
movq %rcx, %rdx
subq $1, %rdx
cmpq %rdi, %rdx
setb %r8b
andb $1, %r8b
andb $1, %r8b
xorb $-1, %r8b
testb $1, %r8b
movb $1, %r8b
movq %rcx, -48(%rsp)
movq %rax, -56(%rsp)
movq %rdx, -64(%rsp)
movq %rsi, -72(%rsp)
movq %rdi, -80(%rsp)
movq %r9, -88(%rsp)
movb %r8b, -89(%rsp)
jne L423
xorl %eax, %eax
movq 32(%rsp), %rcx
movq (%rcx), %rdx
movq -64(%rsp), %rsi
movq (%rdx,%rsi,8), %rdx
movq -48(%rsp), %rdi
addq $1, %rdi
movq 56(%rsp), %r8
movq %r8, -72(%rsp)
movq %rdx, -80(%rsp)
movq %rdi, -88(%rsp)
movb %al, -89(%rsp)
L423:
movb -89(%rsp), %al
movq -88(%rsp), %rcx
movq -80(%rsp), %rdx
movq -72(%rsp), %rsi
xorb $1, %al
xorb $-1, %al
testb $1, %al
movq -56(%rsp), %rdi
movq %rcx, -104(%rsp)
movq %rdx, -112(%rsp)
movq %rsi, -120(%rsp)
movq %rdi, -40(%rsp)
jne L533
movq -120(%rsp), %rax
movq 8(%rax), %rax
movq -120(%rsp), %rcx
movq -112(%rsp), %rdx
movq -104(%rsp), %rsi
movq -56(%rsp), %rdi
movq %rax, (%rsp)
movq %rcx, -8(%rsp)
movq %rdx, -16(%rsp)
movq %rsi, -24(%rsp)
movq %rdi, -32(%rsp)
jmp L268
L533:
movq -40(%rsp), %rax
movq 40(%rsp), %rcx
movq (%rcx), %rdx
movq 48(%rsp), %rsi
movq %rdx, (%rsi)
addq $88, %rsp
retq
nop
And here is the normal Julia version:
julia> @code_native debuginfo=:none mysum(collect(1:100))
.text
movq 8(%rdi), %rcx
testq %rcx, %rcx
je L195
movq (%rdi), %rdx
movq (%rdx), %rax
cmpq $1, %rcx
je L191
cmpq $2, %rcx
movl $2, %esi
cmovbeq %rsi, %rcx
leaq -1(%rcx), %r8
movl $1, %edi
cmpq $16, %r8
jb L170
movq %r8, %r9
andq $-16, %r9
leaq 1(%r9), %rdi
leaq 2(%r9), %rsi
vmovq %rax, %xmm0
vpxor %xmm1, %xmm1, %xmm1
xorl %eax, %eax
vpxor %xmm2, %xmm2, %xmm2
vpxor %xmm3, %xmm3, %xmm3
nopl (%rax,%rax)
L96:
vpaddq 8(%rdx,%rax,8), %ymm0, %ymm0
vpaddq 40(%rdx,%rax,8), %ymm1, %ymm1
vpaddq 72(%rdx,%rax,8), %ymm2, %ymm2
vpaddq 104(%rdx,%rax,8), %ymm3, %ymm3
addq $16, %rax
cmpq %rax, %r9
jne L96
vpaddq %ymm0, %ymm1, %ymm0
vpaddq %ymm0, %ymm2, %ymm0
vpaddq %ymm0, %ymm3, %ymm0
vextracti128 $1, %ymm0, %xmm1
vpaddq %xmm1, %xmm0, %xmm0
vpshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
vpaddq %xmm1, %xmm0, %xmm0
vmovq %xmm0, %rax
cmpq %r9, %r8
je L191
L170:
subq %rsi, %rcx
incq %rcx
L176:
addq (%rdx,%rdi,8), %rax
movq %rsi, %rdi
incq %rsi
decq %rcx
jne L176
L191:
vzeroupper
retq
L195:
xorl %eax, %eax
retq
nopw %cs:(%rax,%rax)
Aha! those vpxadd
and friends instructions on the %xmm0
are vector operations on the SIMD registers. That would explain part of the slowdown.