Julia startup speed cut in half. Was: (Unofficial) Julia 1.9 for lower latency (startup)

The timing for (not unofficial) 1.9 master shows 7.6% faster startup than for Julia 1.7.0:

$ hyperfine '~/Downloads/julia-a60c76ea57/bin/julia -e ""'
Benchmark 1: ~/Downloads/julia-a60c76ea57/bin/julia -e ""
  Time (mean ± σ):     191.2 ms ±  14.4 ms    [User: 179.9 ms, System: 303.4 ms]
  Range (min … max):   168.6 ms … 213.0 ms    14 runs

$ hyperfine 'julia -e ""'
Benchmark 1: julia -e ""
  Time (mean ± σ):     206.0 ms ±  29.6 ms    [User: 165.5 ms, System: 122.3 ms]
  Range (min … max):   182.4 ms … 278.2 ms    10 runs

Since the “System” time is more than the total time, it implies (2?) threads used. And note, with the timing below using time I get very different numbers for “sys”, so I assume it might mean average thread time for here presumably 2 threads.

How could the startup time be reduced further? I was thinking about compiling my own Julia (already done, not shown here), and throwing out as much as possible, stuff not needed, such as LinearAlgebra, maybe Threads and basically everything from Base, not used by Julia itself.

Note, already thrown out in Julia 1.9 from the sysimage are e.g. Statistics, DelimitedFiles, or so I thought. Whatever the reason(s) for the speedup, that and/or some other, the sysimage is actually larger:

232731608 jún 25 17:00 sys.so
32830120 jún 25 16:48 libopenblas64_.0.3.20.so

vs in 1.7.0:
199483960 nóv 30 2021 sys.so
31736520 nóv 30 2021 libopenblas64_.0.3.13.so

Those are the big-ticket items to reduce (sys.so), or eliminate e.g. LinearAlgebra/libopenblas64. I’ve yet to profile anything (I recall from JuliaLang issue, it’s been done). Can anyone tell me where to look in the code about removing e.g. that .so and best tools to profile, or point to that forgotten issue.

A. I’m thinking of doing this unofficial (breaking) Julia 2.0, not as a hostile takeover, but to explore how much can and should be taken out, but still be useful for scripts and benchmarks such as Debian Benchmark Game (some scripts there require threads… at least one GMP/BigInt, but none LinearAlgebra).

B. I’m also considering implementing some of the changes from the 2.0 milestone (any ideas?), some that seems sensible, at least if faster, and also removing Dict from Base… i.e. changing to a better (for Julia) unexported version. I suspect it only needs small Dicts, not a scalable Dict implementation.

$ time julia --startup-file=no -O0 -e "println(\"Hello world\")"
Hello world

real	0m0,216s
user	0m0,194s
sys	0m0,069s

$ time ~/Downloads/julia-a60c76ea57/bin/julia --startup-file=no -O0 -e "println(\"Hello world\")"
Hello world

real	0m0,192s
user	0m0,160s
sys	0m0,156s
$ hyperfine '~/Downloads/julia-a60c76ea57/bin/julia --startup-file=no -O0 -e "println(\"Hello world\")"'
Benchmark 1: ~/Downloads/julia-a60c76ea57/bin/julia --startup-file=no -O0 -e "println(\"Hello world\")"
  Time (mean ± σ):     190.1 ms ±  14.6 ms    [User: 177.9 ms, System: 291.8 ms]
  Range (min … max):   170.7 ms … 208.6 ms    14 runs

$ hyperfine 'julia --startup-file=no -O0 -e "println(\"Hello world\")"'
Benchmark 1: julia --startup-file=no -O0 -e "println(\"Hello world\")"
  Time (mean ± σ):     213.0 ms ±  17.8 ms    [User: 183.2 ms, System: 140.9 ms]
  Range (min … max):   184.9 ms … 236.4 ms    12 runs
1 Like

check this PR:

1 Like

The sharpest constraint we have is that we cannot remove any stdlibs that are direct or indirect dependencies of Pkg. (Because otherwise, you cannot install any external packages.)

@DilumAluthge, why is that? I want to excise as much as possible (even Pkg). It seems that a stdlib must only depend on stdlibs? Is that restriction really needed?

For (very simple) scripts (just as an experiment), I do not need Pkg (or the REPL). I WOULD still like to have Pkg available…, just as an ordinary package. It seems that might be hypothetically possible, except you would have the problem of installing it first… If it came with Julia (in the “mere aggregation” sense), it seems like I should possibly be able to use it, by somehow pointing to it.

You list 30 dependencies of Pkg (actually 29…) that all (plus even LinearAlgebra) could go, in particular (to not need to handle security issues in Julia) LibCURL_jll, LibGit2, LibSSH2_jll, MbedTLS_jll, MozillaCACerts_jll (and more Sockets?), with the exception of (probably) Unicode (and maybe Random because of Threads, that I’m conflicted about dropping).

Was discussed here (but don’t want to discuss there, since not only on DelimitedFiles):

Recipe for moving thing out while retaining history: JuliaAI/MLJOpenML.jl#1

Do you think it will help with startup time to get any or all of those stdlibs out? Or do they have NO impact until actually used? If they do slow down is that because of precompilation in the sysimage (that could be gotten rid of, still keeping the stdlibs).

“-10% … -19% with PGO+LTO” is nice (will use when PR merged), in addition to the 7.6% speedup already, but I’m aiming for something closer to the 98% speed-reduction you get by using Perl.

I know getting that far is unrealistic, but I want to know where the extra time is spent, when actually not compiling ANY code:

$ time ~/julia-1-9-DEV-a60c76ea57/bin/julia --startup-file=no --compile=min -e ""

vs.

$ time ~/perl -c ""

A lot of it is loading code from the sysimage.

1 Like

Right, thanks, why I want to radically reduce it. It’s just unclear to me if stdlibs go there. I think not, except if some of them are precompiled. E.g. openblas[64] is just machine code binary, but some wrapper code exists to use it, that goes in there. Do you have any idea what might be the largest single factor?

Another workaround, and WHY loading the sysimage is slow, is that it’s not fully compiled to machine code (I think, or historically, same for packages). If that is changing (already PR in that stores machine code with?), then it’s unclear why not much faster, if not instant:

For .so files, to use them, is it like memory mapping, very quick, and I don’t need to worry about the size, until I do something more than the bare minimum that I need to do first.

now Julia has a multi-microarchitecture: (3x - X86 ):

probably 4 separated X86 microarchitecture will be leaner ( at least on disk )

  • x86-64-v1
  • x86-64-v2
  • x86-64-v3 (AVX2)
  • x86-64-v4 ( AVX512)
1 Like

How about a ramdisk? :- ) But seriously, and this is slightly off topic, so I hope you don’t mind. Do you know what is the current state of affairs re BLAS settings for Julia 1.8 and even more up to date versions? I’m recalling I’ve been reading about some significant changes planned (link), however, when I skimmed RC notes I can’t recall any particular words on this topic, thus the question.

I’m not up-to-speed on BLAS, and not looking much since I want to drop it (i.e. OpenBLAS, @Elrod has a substitute).

@ScottPJones may not be active here anymore, but he made a “Julia-lite” in 2015 (the jlite branch, he also has e.g. more recent lite branch from 2016, and his master is non-lite), that wasn’t used much as far as I know, ahead of his time…

He did drop LinearAlgebra/BLAS, from the sysimage at least. First look at his base/exports.jl it looked like he dropped e.g. Dates too, then I see he just rearranged, and LinAlg still, there.

So I’m looking at doing a more recent version of this (and maybe compile this, and also time some other old versions):
https://github.com/JuliaLang/julia/commit/67e716ce006398f459fdebac45baef6ea563e90b

Sounds like a great idea re OpenBLAS substitute and thanks for the info. As for OpenBLAS itself I’ll try to check at github.

Building a sysimage without any stdlibs is very easy with PackageCompiler. Just use the filter_stdlibs argument and an empty project.

6 Likes

I had removed more than the things that were later moved out to stdlibs.
In particular, I removed most anything that required extra libraries (besides LLVM, of course).
BigInt and BigFloat, for example.
Regex support could really be moved out as well, while there are a few uses of regexes in Base Julia, they are generally very simple patterns that would actually be more efficient written without regexes.

I hadn’t done any more on this, because after most of the “kitchen sink” items were moved out to stdlib, it became a lot easier to deal with Julia on things like the Raspberry Pi (also, faster, more powerful Pi’s came out :slight_smile: )

1 Like

[None of this needs 1.9, I’m timing with 1.8.1 currently.]

FYI: The search in the docs finds nothing for filter_stdlibs only for filter_stdlibs=true (and then only under Apps). That seems like a bug in the search function not your package (or then only for the markup you use).

Would you be opposed to the official Julia binary download including a non-default sysimage for scripting and benchmarks (I think Debian’s would approve of it, though maybe only if it’s an official one) with it, as I describe below.

For some benchmarks, e.g. Debian benchmark game, it’s critical to reduce startup or we never make top spot, and Julia’s default sysimage blocks that. What I describe blow, will put us in the top spot for many of its sub-benchmarks already.

Anyway I used your example, modified this way, and it only took a few min. to make:
https://julialang.github.io/PackageCompiler.jl/stable/sysimages.html

julia> create_sysimage(["Example"]; filter_stdlibs=true, incremental=false, sysimage_path="Scripting_benchmark_Sysimage.so")

First the good news, it cuts startup latency in half, to 94.1 ms from 169.2 ms and makes the sysimage, 67% smaller, to 75 MB (and I’m not using new sysimage strip option[s] in 1.9, so likely should do better), according to:

$ hyperfine --min-runs 100 "julia -J Scripting_benchmark_Sysimage.so --startup-file=no --history-file=no --compile=min -e ''"

Doing something actual:

$ julia -J Scripting_benchmark_Sysimage.so --startup-file=no --history-file=no -O0 -e 'println(\"Hello world\")

though trivial adds only 4.4 ms, or 9.3 ms with -O0.

Now some strangeness/bug, and not relevant to scripting(?), I get:

┌ Warning: REPL provider not available: using basic fallback
└ @ Base client.jl:424

and the julia> prompt is no longer green, but otherwise (all?) stuff seems to work in the REPL, until you exit, then you hit some infinite loop.

Some things or maybe just rand() do not work, and would be better to fix, at least if in official download (I think rand() the major exception, and do you know how to fix your example to include Random?), unless you do like:

$ julia -J ExampleSysimage.so --startup-file=no --history-file=no -O0 -e 'using Random; println(rand())'

This is slow (and I guess many similar):

$ time julia -J ExampleSysimage.so --startup-file=no --history-file=no -e 'using Random'

I stopped the web browser etc. to time, but I’m using julia from juliaup, that adds some overhead, forgot to run as, which is better: [see also $ whereis julia]

$ ~/.julia/juliaup/julia-1.8.1+0.x64/bin/julia …

It doesn’t make much sense to me to have an “official” binary that is only made to score better in some benchmark game.

Not so strange since you don’t have the REPL standard library in the sysimage anymore. Julia falls back to a much simpler version of a REPL then.

You need the Random stdlib for rand to work. Same as you need LinearAlgebra for matrix multiplication to work.

1 Like

People like Julia for scripting, and that would be the main use for most.

You need the Random stdlib for rand to work.

Yes, I mean, it there a way to filter out all except Random stdlib? It’s a true or false flag, so I’m missing out on how. And if you filter out LinearAlgebra stdlib, then all from it still works, right, it’s just slower the first time? It’s not likely to be used in (some) scripting. I think regex will work too (fast enough), but maybe better to not filter it out.

Just add it to your project that you use to create the sysimage from.

1 Like

Since I dropped all stdlibs, what’s in the remaining 33% of the sysimage (Julia source code, or is that elsewhere?)? Or rather, any big-ticket items that can be further stripped?

Thanks, I tried it and the sysimage builds in under 5 min. with Random in it (then I think has full compatibility, but it may not be worth it to have it in even if you want to use rand).

The sysimage is of course larger, but not slightly, it’s 1.6 MB larger or by 2%. And still I ran Julia this way and PackageCompiler, this time around (see under --help-hidden, though this was likely ineffective, maybe PackageCompiler needs to do this and forks a julia process?):

$ julia --strip-metadata --strip-ir

Now:

$ time julia -J BenchmarkSysimage.so -E "rand()"
0.11315268297346925

real	0m0,455s

It’s faster than without Random in the sysimage, but still slower than, for some reason I do not understand:

$ time julia -E "rand()"
0.30915648801695506

real	0m0,252s

For those who would like to do the same:

[deps]
Example = "7876af07-990d-54b4-ab0e-23690620f79a"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"  # This might be wrong UUID, I needed one, and for now just "randomly" googled for one.

Off-topic:
Strangely this is reliably 21% slower (default sysimage):

$ time julia -e "println(rand())"

than:
$ time julia -E "rand()"

while reverses with the non-default sysimage (both there about 74% slower):

$ time julia -J BenchmarkSysimage.so -E "rand()"
0.6843914768479171

real	0m0,446s

$ time julia -J BenchmarkSysimage.so -e "println(rand())"
0.4807361366759346

real	0m0,467s