Rethinking optimization, lower ok for all Julia code (for e.g. faster startup) as a default?

Clearly lower optimization wouldn’t be faster always, or we would always use lower levels. As of version 1.5 it doesn’t need to be one global setting, and since we can tune I suggest changing the default. E.g. for some real-world 27 line script of mine (that uses three packages, e.g. CSV.jl) runs for 20 sec. (17 sec. on julia-1.6-dev) on Julia’s defaults that are 45% slower, and I’ve gotten 3.7x the speed, with non-default, lowest (lower than -O0 only; with --compile=min additionally), optimization settings on code from someone else, where -O0 or -O1 where only about 30% faster.

There are trade-offs, low optimization may be ok for 90% of code (why Python is popular), and 10% of code, some loops, need to be fast. See my other posts further down on that, e.g. on superoptimizer.

Laurence Tratt: What Challenges and Trade-Offs do Optimising Compilers Face? [emphasis mine]

I’m going to write about some thoughts that were initially inspired by a (somewhat controversial) talk by Daniel Bernstein. The talk itself, entitled “Death of Optimizing Compilers” (slides and audio), was presented at ETAPS 2015 […]
Because of that, optimising compilers are operating in a region between a problem that’s not worth solving (most parts of most programs) and a problem they can’t solve (the really performance critical parts)

One of the reasons Go is getting popular is fast compilation, and they don’t even pay the cost every single time they run the code, like Julia does. So to compete, I suggest Julia’s default be changed from -O2 to -O1, or at least for people to try it (or even -O0) for a while to see how it goes. E.g. do: alias julia="julia -O1"

For some context, there are two more levels (controlling LLVM optimization only, that Julia uses; plus the other compile/tuning option), -O0, and -O3, and for some reason -O3 is already not the Julia default. E.g. for OpenBSD, for C code, -O3 compilation is banned as a policy, because it’s dangerous, and I think for Julia the reason for non-default may be the same.

There is clear benefit, for Julia, for at least some code, first showing for default -O2:

$ ~/julia-1.4.0/bin/julia --startup-file=no
julia> @time using Revise
  5.002075 seconds (2.06 M allocations: 112.873 MiB, 0.48% gc time)

$ ~/julia-1.4.0/bin/julia -O1 --startup-file=no -O1
julia> @time using Revise
  3.862468 seconds (2.06 M allocations: 112.873 MiB, 0.62% gc time)

$ ~/julia-1.4.0/bin/julia -O0 --startup-file=no
julia> @time using Revise
  2.799413 seconds (2.06 M allocations: 112.873 MiB, 0.84% gc time)



--compile=min is something to be aware of, but likely going too far:

$ ~/julia-1.4.0/bin/julia --compile=min -O0 --startup-file=no
julia> @time using Revise
  1.656072 seconds (1.06 M allocations: 62.919 MiB, 0.65% gc time)

In my julia-1.5-ea669c3d3e I get better times for all levels compared to 1.4, always better for lower levels.

In my 10 days old master Julia 1.6, the relative order is the same, while there’s absolute regression (possibly related to Jeff Bezanson remarks on LLVM "getting slower and slower") compared to 1.5 and 1.4:

$ ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --startup-file=no
julia> @time using Revise
  5.430166 seconds (1.38 M allocations: 81.815 MiB, 0.46% gc time)

$ ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -O1 --startup-file=no
julia> @time using Revise
  3.347072 seconds (1.38 M allocations: 81.815 MiB, 0.76% gc time)

$ ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -O0 --startup-file=no
julia> @time using Revise
  3.353581 seconds (1.38 M allocations: 81.817 MiB, 0.74% gc time)


$ ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --compile=min -O0 --startup-file=no
julia> @time using Revise
  0.869692 seconds (748.52 k allocations: 48.743 MiB, 1.71% gc time)

Best cases startup (and probably luck 1.5 timed faster than 1.6) I’ve seen (lots of variability between runs on my loaded machine, just interesting to see min. achievable), -O0 and -O1 seem not to make a difference for julia itself, only for when it starts running other code:

$ hyperfine 'julia-1.5 --startup-file=no -O0 --inline=yes -e ""'
Benchmark #1: julia-1.5 --startup-file=no -O0 --inline=yes -e ""
  Time (mean ± σ):     163.6 ms ±  27.3 ms    [User: 106.4 ms, System: 78.3 ms]
  Range (min … max):   142.4 ms … 241.4 ms    12 runs


$ hyperfine '~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --startup-file=no -O1 --inline=yes -e ""'
Benchmark #1: ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --startup-file=no -O1 --inline=yes -e ""
  Time (mean ± σ):     167.9 ms ±  29.3 ms
  Range (min … max):   143.0 ms … 228.8 ms    13 runs


Compared to hyperfine reporting ''min.'' 10.0 ms for Python 2 and 28.0 ms for Python 3, and:

$ hyperfine 'perl -e ""'
Benchmark #1: perl -e ""
  Time (mean ± σ):       3.3 ms ±   1.1 ms    [User: 2.1 ms, System: 1.2 ms]
  Range (min … max):     1.5 ms …  11.5 ms    870 runs
 
  Warning: Command took less than 5 ms to complete. Results might be inaccurate.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Python2 (not 3) has 10-times+ better startup time, and while startup time itself is an interesting discussion, let’s keep this discussion on start-to-finish time for some real-world programs:

I’ve been making PRs to packages with the new module-level low-optimization trick, for better startup, but instead of doing it all over the place and thinking if appropriate or not (it’s not always), maybe we should do it the other way around, defaulting to lower, and add to a few packages/modules (and when possible function) similar but opposite -O3, in effect?

Hopefully later, it could be in the hands of users to decide (and less granular than at the module level), like @opt3 using Module.

3 Likes

How are you benchmarking to make sure that your PRs does not introduce runtime regressions?

I’m not (at least yet; for Revise, I’m just using it with the change and haven’t seen a difference). It’s up to the maintainer, and I mention a possible slowdown (I guess why all but one PRs are open, and that one closed unmerged).

It’s not like I suggest[ed] this for packages I know it would hurt. I’ve only done it, until this post, if I think it will likely help (and I see faster startup, as I always do, find it likely to be for all or most packages).

Jeff did this first for Plots.jl, then that I know has of been done for PyCall.jl, so I suggested for similar packages.

I was most cautious at JavaCall.jl with “MIGHT be a slowdown” and got a “thank you for suggesting this. I’m looking into it.” I let the maintainers decide. Some are a no-brainers like RDatasets.jl, while PR still open with “this makes sense to me.”

While in the post here “I suggest changing the default”, I do not mean right away. Julia 1.5 is around the corner, and it’s seems like a good release I wouldn’t want to spoil.

It would be good for people to try this out -O1 first to be conservative, so that we can see the effect. I’m just one person and can’t test all packages. I do expect a slowdown for some code, but less often than people may think, and then they can add -O2 back (selectively), or even -O3, or @avx etc.

There’s even a “A superoptimizer for LLVM” we could try: GitHub - google/souper: A superoptimizer for LLVM IR Bernstein makes a good case for “Death of Optimizing Compilers”, maybe I’m proposing going too much in the other direction?

About the superoptimizer:

If we can automatically derive compiler optimizations, we might be able to sidestep some of the substantial engineering challenges involved in creating and maintaining a high-quality compiler.

Don’t get me wrong, I like how Julia is fast as C (while ironically it may feel slower than Python), and @code_native etc. It was made for technical computing/HPC, and I can see how people may think this suggestion of lower opt. is admitting defeat. But less optimization gives you time to use your time more wisely, with e.g. (always targeted) superoptimization.

@Elrod’s work with @avx may not be one in a strict sense, while it feels like it to me.

The way I see it, is, people giving up on Julia, for some work, where e.g. Python (or Perl) are good, so we’re sacrificing world dominance for the smaller HPC/Fortran environment/language replacement:

I would hate to throw the bath out with the bathwater, and succeed at neither. It doesn’t have to be either or. I mean, to me (not all) Julia has proven to be a Fortran (and MATLAB) replacement. It’s the only inherently fast interactive language I know, but we know about the endless (well until recently) time-to-first-plot issue, and the “short script” problem above is just the same. Most people aren’t going to accept Julia as a Python/MATLAB replacement, until those issues go away, and they will stay to some degree with the current default.

2 Likes

I think the portion of compilation that gets “wasted” is a big part of this. It would be nice if you didn’t have to rely on PackageCompiler (which even then isn’t always perfect) to avoid having so much compilation repeated each time you restart the REPL/load a new script.

3 Likes

By “wasted” compilation (true in a sense) I think you mean “invalidations” that you can fix (to some degree with):

Tim Holy is working on applying it to, at least, his other package Revise (one of his many), and see comment in that thread (and his upcoming blog post):

https://github.com/timholy/Revise.jl/pull/484#issuecomment-632872181

As this seems all rather complicated (and only one part of the solution, as I think it only works for package, not scripts), for now easiest to test, use -O that works for you, and I opened an issue:

https://github.com/JuliaLang/julia/issues/36025

It seems to me you stop all invalidation with --compile=min but it can be very slow (or sometimes fastest option).

JIT compilation >> The application code is initially interpreted, but the JVM monitors which sequences of bytecode are frequently executed and translates them to machine code for direct execution on the hardware.
@@ Just-in-time compilation - Wikipedia

Otherwise A JIT compiler should operate on the Pareto rule
the Pareto principle >>
"… for many outcomes, roughly 80% of consequences come from 20% of the causes (the “vital few”).[1] "
@@ Pareto principle - Wikipedia

1 Like