Implications of --compile=min and --optimize=0, for dummies

I’m starting to build some command line interfaces for some projects, and trying to figure out the best practices. In Fredrik Ekre’s juliacon talk, he recommends one option is to use --compile=min --optimize=0, but I’m trying to figure out what the implications / trade-offs of these flags are.

Certainly, for development purposes, they reduce lag time considerably; I tend to use ArgParse.jl, and in one example, julia myscript.jl --help takes 6 seconds without those flags, and 0.3 sec with them. A bunch of my scripts are just doing things like calling external programmings and some file administration, and in those cases, it also seems like a no-brainer.

But if I’m actually using julia packages to do numerically-intensive work, am I going to be shooting myself in the foot? I’ve tried to do a bit of reading, but I really don’t understand the low-level stuff enough to make heads or tails of it. Is compile=min just deferring compilation that will happen anyway later on, so it makes no difference in something that isn’t interactive? Or is there a cost that will be paid over time? Can someone do a “optimization for dummies” explanation?

In the end, I’m sure I should just use PackageCompiler, but I’m still pretty early in development and that seems like a heavier lift than just passing some command line flags. It would be nice to have a better sense of the implications of these things, if someone that understands it can volunteer :slight_smile:

PS - yes, I also know DaemonMode.jl is an option and I’m playing with that too.

10 Likes

Here’s my Optimization For Julia Beginners:

(Mostly writing this to establish the limits of my own ignorance, a full, approachable and really really cool intro to all of this is in the Self Guided Course to Compilers which talks about all this stuff in detail and with very friendly explanations. Dead code and other goodies are in the lecture here)

Part 1: LLVM Passes

We need to understand some beginner compiler concepts first to answer your questions.

Imagine this is your Julia code:

x = 3
x = 4
y = x + 1

You can spot that the code should result in y == 5, x == 4. That means that the line x = 3 is “dead” and a relatively simple analysis of the code would convert it into

x = 4
y = x + 1

(This is a contrived example, but we’ll build up to the other optimizations.)

An analysis that can look at the code and eliminate unneeded lines is called “dead code elimination”. If it was smart enough to figure out that you only needed the last result, the “entire program” could be shrunk further to

y = 5

in what would be called a “constant propagation” analysis. These different analyses are called passes in LLVM, and the general notion is to build data structures (a Control Flow Graph, or CFG, for example) about the code so that you can apply this automated reasoning and hopefully a) prove certain optimizations are correct and b) apply them to make the code faster.

These were simple optimizations in that you only need to reason “locally” about what the code lines immediately around the addition operation happened, but you can imagine that this gets more and more complex as you add in loops, function boundaries, different compiler architectures…

Furthermore, there’s a planning issue - some optimizations are wayyy more profitable to run after others. Inlining (copy pasting the function code to where its called) enables many other optimizations to happen after it (because you just brought code chunks to a local context, so the local passes are easier and cheaper!) Point is, the compiler now also has to worry about the scheduling, or ordering of the optimization passes.

At a certain point, it becomes really hard to completely figure out which optimizations you can prove about the code (because you need to update and manipulate the CFG) and it’s best to just have heuristics, or cost models, about when an optimization is profitable. (This is why you will hear about the Inliner Cost model being a big thing to know about when you are exploring those depths.) This is just the classic tradeoff of compilation time vs run-time performance - it’s just not profitable to analyze and prove all possible correct transformations of the code, so a certain set of heuristics are established that work “well enough”.

This is why you can set ``–optimize=0to1,2or3`. You’re just telling the compiler how much effort it should put into giving you those tasty, tasty FLOP/s, at the cost of more compilation time/work.

Part 2: Answering your question

So what do --compile=min --optimize=0 actually do?

  • --optimize=0 will reduce the number of passes that actually get applied to a bare minimum set of profitable ones. Reading the LLVM Passes list is daunting, there’s dozens of them. (Side note - the autovectorization pass is what gives your code SIMD speedups, it’s just another analysis that gets run on your code.) It will also lower the heuristics of things like the inliner cost model to fire off less frequently in order to avoid work (there being a trade-off between compilation time and optimized runtime).

IIUC, (But someone else will probably will probably correct me on this, as I couldn’t find documentation in the manual for this flag.)

  • --compile=min on the other hand tries not to optimize for performance, but for program size, eg reducing the amount of code generated and cached in the process. This is desirable when shipping binaries and executables and is obviously in tension with the inliner optimizations (which are, handwavingly, copy pasting function code all over the place).

Together those optimizations generally hint at wanting to share programs with others and that you don’t want to spend oodles of compilation time on them.

Here’s an example:

# Run with Julia1.6-rc1 --compile=min --optimize=0
julia> xs = collect(1:1000000);

julia> @btime mysum($xs);
  7.480 ms (6 allocations: 208 bytes)

julia> function mysum(xs)
           res = 0
           for i in xs
               res += i
           end
           res
       end
mysum (generic function with 1 method)

# Run with normal Julia 1.6-rc1
julia> @btime mysum($xs);
  393.703 μs (0 allocations: 0 bytes)

And the emmitted code is here

# --compile-min --optimize=0 version
julia> @code_native debuginfo=:none mysum(collect(1:100))
        .text
        subq    $88, %rsp
        xorl    %eax, %eax
        movl    %eax, %ecx
        vxorps  %xmm0, %xmm0, %xmm0
        vmovdqa %xmm0, 64(%rsp)
        movq    $0, 80(%rsp)
        movq    %fs:0, %rdx
        movq    %rdx, %rsi
        addq    $-32768, %rsi                   # imm = 0x8000
        movq    $4, 64(%rsp)
        leaq    72(%rsp), %r8
        movq    -32768(%rdx), %r9
        movq    %r9, 72(%rsp)
        leaq    64(%rsp), %r9
        movq    %r9, -32768(%rdx)
        movq    $0, 80(%rsp)
        movq    %rdi, 80(%rsp)
        movq    %rdi, %rdx
        movq    %rdx, %r9
        movq    8(%rdx), %rdx
        cmpq    %rdx, %rcx
        setb    %r10b
        andb    $1, %r10b
        andb    $1, %r10b
        xorb    $-1, %r10b
        testb   $1, %r10b
        movb    $1, %r10b
        movq    %rdi, 56(%rsp)
        movq    %rsi, 48(%rsp)
        movq    %r8, 40(%rsp)
        movq    %r9, 32(%rsp)
        movq    %rdx, 24(%rsp)
        movb    %r10b, 23(%rsp)
        movq    %rcx, 8(%rsp)
        jne     L193
        xorl    %eax, %eax
        movq    32(%rsp), %rcx
        movq    (%rcx), %rdx
        movq    (%rdx), %rdx
        movb    %al, 23(%rsp)
        movq    %rdx, 8(%rsp)
L193:
        movq    8(%rsp), %rax
        movb    23(%rsp), %cl
        xorl    %edx, %edx
        movl    %edx, %esi
        xorb    $1, %cl
        xorb    $-1, %cl
        testb   $1, %cl
        movl    $2, %edi
        movq    24(%rsp), %r8
        movq    56(%rsp), %r9
        movq    %rsi, %r10
        movq    %r8, (%rsp)
        movq    %r9, -8(%rsp)
        movq    %rax, -16(%rsp)
        movq    %rdi, -24(%rsp)
        movq    %r10, -32(%rsp)
        movq    %rsi, -40(%rsp)
        jne     L533
L268:
        movq    -32(%rsp), %rax
        movq    -24(%rsp), %rcx
        movq    -16(%rsp), %rdx
        movq    -8(%rsp), %rsi
        movq    (%rsp), %rdi
        addq    %rdx, %rax
        movq    %rcx, %rdx
        subq    $1, %rdx
        cmpq    %rdi, %rdx
        setb    %r8b
        andb    $1, %r8b
        andb    $1, %r8b
        xorb    $-1, %r8b
        testb   $1, %r8b
        movb    $1, %r8b
        movq    %rcx, -48(%rsp)
        movq    %rax, -56(%rsp)
        movq    %rdx, -64(%rsp)
        movq    %rsi, -72(%rsp)
        movq    %rdi, -80(%rsp)
        movq    %r9, -88(%rsp)
        movb    %r8b, -89(%rsp)
        jne     L423
        xorl    %eax, %eax
        movq    32(%rsp), %rcx
        movq    (%rcx), %rdx
        movq    -64(%rsp), %rsi
        movq    (%rdx,%rsi,8), %rdx
        movq    -48(%rsp), %rdi
        addq    $1, %rdi
        movq    56(%rsp), %r8
        movq    %r8, -72(%rsp)
        movq    %rdx, -80(%rsp)
        movq    %rdi, -88(%rsp)
        movb    %al, -89(%rsp)
L423:
        movb    -89(%rsp), %al
        movq    -88(%rsp), %rcx
        movq    -80(%rsp), %rdx
        movq    -72(%rsp), %rsi
        xorb    $1, %al
        xorb    $-1, %al
        testb   $1, %al
        movq    -56(%rsp), %rdi
        movq    %rcx, -104(%rsp)
        movq    %rdx, -112(%rsp)
        movq    %rsi, -120(%rsp)
        movq    %rdi, -40(%rsp)
        jne     L533
        movq    -120(%rsp), %rax
        movq    8(%rax), %rax
        movq    -120(%rsp), %rcx
        movq    -112(%rsp), %rdx
        movq    -104(%rsp), %rsi
        movq    -56(%rsp), %rdi
        movq    %rax, (%rsp)
        movq    %rcx, -8(%rsp)
        movq    %rdx, -16(%rsp)
        movq    %rsi, -24(%rsp)
        movq    %rdi, -32(%rsp)
        jmp     L268
L533:
        movq    -40(%rsp), %rax
        movq    40(%rsp), %rcx
        movq    (%rcx), %rdx
        movq    48(%rsp), %rsi
        movq    %rdx, (%rsi)
        addq    $88, %rsp
        retq
        nop


And here is the normal Julia version:

julia> @code_native debuginfo=:none mysum(collect(1:100))
        .text
        movq    8(%rdi), %rcx
        testq   %rcx, %rcx
        je      L195
        movq    (%rdi), %rdx
        movq    (%rdx), %rax
        cmpq    $1, %rcx
        je      L191
        cmpq    $2, %rcx
        movl    $2, %esi
        cmovbeq %rsi, %rcx
        leaq    -1(%rcx), %r8
        movl    $1, %edi
        cmpq    $16, %r8
        jb      L170
        movq    %r8, %r9
        andq    $-16, %r9
        leaq    1(%r9), %rdi
        leaq    2(%r9), %rsi
        vmovq   %rax, %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        xorl    %eax, %eax
        vpxor   %xmm2, %xmm2, %xmm2
        vpxor   %xmm3, %xmm3, %xmm3
        nopl    (%rax,%rax)
L96:
        vpaddq  8(%rdx,%rax,8), %ymm0, %ymm0
        vpaddq  40(%rdx,%rax,8), %ymm1, %ymm1
        vpaddq  72(%rdx,%rax,8), %ymm2, %ymm2
        vpaddq  104(%rdx,%rax,8), %ymm3, %ymm3
        addq    $16, %rax
        cmpq    %rax, %r9
        jne     L96
        vpaddq  %ymm0, %ymm1, %ymm0
        vpaddq  %ymm0, %ymm2, %ymm0
        vpaddq  %ymm0, %ymm3, %ymm0
        vextracti128    $1, %ymm0, %xmm1
        vpaddq  %xmm1, %xmm0, %xmm0
        vpshufd $78, %xmm0, %xmm1               # xmm1 = xmm0[2,3,0,1]
        vpaddq  %xmm1, %xmm0, %xmm0
        vmovq   %xmm0, %rax
        cmpq    %r9, %r8
        je      L191
L170:
        subq    %rsi, %rcx
        incq    %rcx
L176:
        addq    (%rdx,%rdi,8), %rax
        movq    %rsi, %rdi
        incq    %rsi
        decq    %rcx
        jne     L176
L191:
        vzeroupper
        retq
L195:
        xorl    %eax, %eax
        retq
        nopw    %cs:(%rax,%rax)

Aha! those vpxadd and friends instructions on the %xmm0 are vector operations on the SIMD registers. That would explain part of the slowdown.

24 Likes

@kescobo I didn’t manage to find documentation on the --compile-min flag on the manual or in the REPL - file an issue maybe?

[henrique AreaDeTrabalho]$ julia --help-hidden
julia [switches] -- [programfile] [args...]
 --compile={yes|no|all|min}Enable or disable JIT compiler, or request exhaustive compilation
 --output-o name           Generate an object file (including system image data)
 --output-ji name          Generate a system image data file (.ji)
 --output-unopt-bc name    Generate unoptimized LLVM bitcode (.bc)
 --output-jit-bc name      Dump all IR generated by the frontend (not including system image)
 --output-bc name          Generate LLVM bitcode (.bc)
 --output-asm name         Generate an assembly file (.s)
 --output-incremental=no   Generate an incremental output file (rather than complete)
 --trace-compile={stdout,stderr}
                           Print precompile statements for methods compiled during execution.
8 Likes

Obrigado!

4 Likes

Yes, if you turn off optimization / compilation some workloads can be penalized to an extreme degree.

No, it turns compilation off.

3 Likes

This is confusing - code has to be compiled at some point to run, right?

Super helpful, thanks!

2 Likes

It can be interpreted

2 Likes

Python would disagree :P.

But seriously, it depends a bit what you mean with “compile”. In Python you could argue that you compile to a byte code which is then executed via the Python runtime. Same with Julia --compile=no, you compile the source code to some kind of lower level code which is then executed by the Julia runtime. But with --compile=no you don’t have for example the LLVM part of the standard Julia compilation process, which is often a reason for great speedups.

5 Likes

With --compile=min I tried:

julia> @code_native 1+1
.text
; ┌ @ int.jl:87 within `+'
	leaq	(%rdi,%rsi), %rax
	retq
	nopw	%cs:(%rax,%rax)
	nop
; └

and this same result is a bit confusing, either @code_native is just showing me what would have happened otherwise, or during “interpretation”, it actually does compile (and maybe for each iteration of a loop, throwing out older code generation; explaining why this can be even slower than Python). Do you know which is true (and how I could have found out for myself)?

1 Like

Here is an example:

julia> function mysum(x)
           s = 0
           for v in x
               s += v
           end
           return s
       end
mysum (generic function with 1 method)

# default
julia> @time mysum(rand(10^6))
  0.004718 seconds (25 allocations: 7.649 MiB)

# --compile=min
julia> @time mysum(rand(10^6))
  5.036822 seconds (13.01 M allocations: 252.561 MiB, 1.86% gc time, 0.19% compilation time)

No, it does not compile every iteration. It is interpreting it. It isn’t strange that a language designed to be compiled (Julia) can be slower to interpret than a language designed to be interpreted (Python).

9 Likes

Ahh, got it… sort of. Anyway, I understand enough that I have a better heuristic for when to use this in scripts.

I guess the final question (though I think I know the answer), is whether there’s any way to change this mid-execution. That is, it seems like the ideal script flow would be something like:

Interpreted, No optimization

  • interpret command line args
  • set up directories / file paths / logging options
  • make sure there aren’t any obvious problems

Switch to fully optimized / compiled mode

  • do hard stuff

I’m assuming that’s not possible, and I know there are efforts to do something kinda like this to get the best of both worlds, but I gather that’s a lot of work and isn’t on the immediate horizon. I suppose I could just launch a second julia process from inside the first one, but if there’s a switch that I can throw to avoid that complication, that would be nice :slight_smile:

I do remember, there were talks of individual flags for packages inside single script. Not sure what was the outcome of it, but surely that would have been awesome.

Put Plots to minimum settings and DataFrames to maximum and get the best of both worlds, it’s like a dream come true :slight_smile:

You can change it based on the module, for example:

5 Likes

Regarding ArgParse, here is a PR that improves its latency a bit: reduce inference and optimization time (by turning it off) by KristofferC · Pull Request #104 · carlobaldassi/ArgParse.jl · GitHub

7 Likes

Ooh, that’s quite nice. Thanks for doing it!

1 Like

As @kristoffer.carlsson says, it’s module-specific. Since I imagine that ArgParse.jl doesn’t do much heavy numerical computation, these kinds of “walled optimization gardens” work out quite nicely in practice.

1 Like

Why can’t Julia cache the compiled code created the first time you run a command line tool, and re-use that in the second run?

It can if you use https://github.com/JuliaLang/PackageCompiler.jl. For the normal usage case, it is harder because code can be loaded in a different order, can invalidate earlier assumptions etc. But it is slowly being chipped away at, most recently by all the invalidation reduce work.

2 Likes

Yes, I know of the PackageCompiler, and it’s nice, but it’s not particularly automatic. I’m thinking of this one guy making performance comparisons of some languages who was running scripts. Of course he was always benchmarking the Julia script and included compilation time.

So a humble wish for a command line option to do that… Might not be easy.