Clearly lower optimization wouldn’t be faster always, or we would always use lower levels. As of version 1.5 it doesn’t need to be one global setting, and since we can tune I suggest changing the default. E.g. for some real-world 27 line script of mine (that uses three packages, e.g. CSV.jl) runs for 20 sec. (17 sec. on julia-1.6-dev) on Julia’s defaults that are 45% slower, and I’ve gotten 3.7x the speed, with non-default, lowest (lower than -O0
only; with --compile=min
additionally), optimization settings on code from someone else, where -O0
or -O1
where only about 30% faster.
There are trade-offs, low optimization may be ok for 90% of code (why Python is popular), and 10% of code, some loops, need to be fast. See my other posts further down on that, e.g. on superoptimizer.
Laurence Tratt: What Challenges and Trade-Offs do Optimising Compilers Face? [emphasis mine]
I’m going to write about some thoughts that were initially inspired by a (somewhat controversial) talk by Daniel Bernstein. The talk itself, entitled “Death of Optimizing Compilers” (slides and audio), was presented at ETAPS 2015 […]
Because of that, optimising compilers are operating in a region between a problem that’s not worth solving (most parts of most programs) and a problem they can’t solve (the really performance critical parts)
One of the reasons Go is getting popular is fast compilation, and they don’t even pay the cost every single time they run the code, like Julia does. So to compete, I suggest Julia’s default be changed from -O2
to -O1
, or at least for people to try it (or even -O0
) for a while to see how it goes. E.g. do: alias julia="julia -O1"
For some context, there are two more levels (controlling LLVM optimization only, that Julia uses; plus the other compile/tuning option), -O0
, and -O3
, and for some reason -O3
is already not the Julia default. E.g. for OpenBSD, for C code, -O3
compilation is banned as a policy, because it’s dangerous, and I think for Julia the reason for non-default may be the same.
There is clear benefit, for Julia, for at least some code, first showing for default -O2
:
$ ~/julia-1.4.0/bin/julia --startup-file=no
julia> @time using Revise
5.002075 seconds (2.06 M allocations: 112.873 MiB, 0.48% gc time)
$ ~/julia-1.4.0/bin/julia -O1 --startup-file=no -O1
julia> @time using Revise
3.862468 seconds (2.06 M allocations: 112.873 MiB, 0.62% gc time)
$ ~/julia-1.4.0/bin/julia -O0 --startup-file=no
julia> @time using Revise
2.799413 seconds (2.06 M allocations: 112.873 MiB, 0.84% gc time)
--compile=min is something to be aware of, but likely going too far:
$ ~/julia-1.4.0/bin/julia --compile=min -O0 --startup-file=no
julia> @time using Revise
1.656072 seconds (1.06 M allocations: 62.919 MiB, 0.65% gc time)
In my julia-1.5-ea669c3d3e I get better times for all levels compared to 1.4, always better for lower levels.
In my 10 days old master Julia 1.6, the relative order is the same, while there’s absolute regression (possibly related to Jeff Bezanson remarks on LLVM "getting slower and slower") compared to 1.5 and 1.4:
$ ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --startup-file=no
julia> @time using Revise
5.430166 seconds (1.38 M allocations: 81.815 MiB, 0.46% gc time)
$ ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -O1 --startup-file=no
julia> @time using Revise
3.347072 seconds (1.38 M allocations: 81.815 MiB, 0.76% gc time)
$ ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -O0 --startup-file=no
julia> @time using Revise
3.353581 seconds (1.38 M allocations: 81.817 MiB, 0.74% gc time)
$ ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --compile=min -O0 --startup-file=no
julia> @time using Revise
0.869692 seconds (748.52 k allocations: 48.743 MiB, 1.71% gc time)
Best cases startup (and probably luck 1.5 timed faster than 1.6) I’ve seen (lots of variability between runs on my loaded machine, just interesting to see min. achievable), -O0
and -O1
seem not to make a difference for julia itself, only for when it starts running other code:
$ hyperfine 'julia-1.5 --startup-file=no -O0 --inline=yes -e ""'
Benchmark #1: julia-1.5 --startup-file=no -O0 --inline=yes -e ""
Time (mean ± σ): 163.6 ms ± 27.3 ms [User: 106.4 ms, System: 78.3 ms]
Range (min … max): 142.4 ms … 241.4 ms 12 runs
$ hyperfine '~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --startup-file=no -O1 --inline=yes -e ""'
Benchmark #1: ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia --startup-file=no -O1 --inline=yes -e ""
Time (mean ± σ): 167.9 ms ± 29.3 ms
Range (min … max): 143.0 ms … 228.8 ms 13 runs
Compared to hyperfine reporting ''min.'' 10.0 ms for Python 2 and 28.0 ms for Python 3, and:
$ hyperfine 'perl -e ""'
Benchmark #1: perl -e ""
Time (mean ± σ): 3.3 ms ± 1.1 ms [User: 2.1 ms, System: 1.2 ms]
Range (min … max): 1.5 ms … 11.5 ms 870 runs
Warning: Command took less than 5 ms to complete. Results might be inaccurate.
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Python2 (not 3) has 10-times+ better startup time, and while startup time itself is an interesting discussion, let’s keep this discussion on start-to-finish time for some real-world programs:
I’ve been making PRs to packages with the new module-level low-optimization trick, for better startup, but instead of doing it all over the place and thinking if appropriate or not (it’s not always), maybe we should do it the other way around, defaulting to lower, and add to a few packages/modules (and when possible function) similar but opposite -O3
, in effect?
Hopefully later, it could be in the hands of users to decide (and less granular than at the module level), like @opt3 using Module
.