Looking for advice on achieving faster startup times

(reposting from a Julia+ARM thread because this is not really ARM specific)

I am testing the feasibility of using Julia for an embedded Linux platform. Our current target hardware has a Cortex-A7 based SOC which can run the official 32-bit-ARM binary release of Julia 1.6.1.

Other than the large storage requirements, the biggest problem we have found so far is the very long startup time for even relatively simple scripts. Initially this was largely caused by precompilation times for some standard library functions and operators. For example, the precompilation of an “inverse divide” operator was taking ~14 seconds :zzz: . For comparison, this same statement took ~5 seconds to precompile on a Raspberry Pi 4 (Cortex-A72), and slightly under 1 second on a x86_64 laptop.

We can remove a lot of the precompilation time delay by using PackageCompiler.create_sysimage. But the run time overhead remains uncomfortably high, in the order of 10 seconds for a test script with less than 20 lines, or half of that time if the filesystem data is cached in memory (which we can not assume).

Any suggestions on how to further speed up the startup/load time of Julia scripts/modules/libraries? Or any tricks for getting more out of PackageCompiler?

We would be particularly interested in ways to remove unused code from the binaries, which might help with load times, and would reduce the storage requirements. I was hoping that PackageCompiler.create_app might help with that, but it does not seem to improve the startup time. It does help a bit in reducing storage requirements, though.

The filter_stdlibs argument in create_sysimage and create_app sounds promising, but I have not figured out how to use it; create_sysimage fails with compilation errors when I try it, and the documentation talks about “potential pitfalls” without further details. Any hints or pointers on using filter_stdlibs?

5 Likes

I am letting PackageCompiler.create_sysimage handle the details of the compilation. I am giving it as precompile_execution_file the same script I am using for timing, so I was hoping that it would be precompiling anything that is significantly expensive, but that may not be not the case. I have just checked, and there are a few statements that, when using the sysimage, still take in the order of 1-2 seconds (each) to run the first time they appear in the script, but not on subsequent instances. For example, the first call to print(Float64) still takes over 2 seconds when using the sysimage. The second such call takes <2ms.

I am not sure how much of that is due to precompilation, and how much due to non-cached vs. cached filesystem reads. Those times were measured after having run the script a few times, so I would expect that most of the filesystem reads are from cache, but I can’t be sure.

But at least half of the ~10 seconds that the script takes to run after rebooting the OS (i.e, with no filesystem caching) take place before any of the script statements start to be evaluated/precompiled. This initial overhead is similar to the time it takes Julia to “run” an empty script after a reboot (6 seconds).

3 Likes

With julia -JYourSysImage.so --trace-compile=stderr you can output all functions which are not yet precompiled. Maybe the manual approach as described here allows you to include all functions in the system image.
But it is indeed surprising that you system images is not complete.

3 Likes

Thanks for the suggestion. That helped track down some issues. It seems that PackageCompiler and --trace-compile do not play nice with code invoked inside @time and @info macros. As long as we do not include any of those, everything seems to be precompiled.

After taking care of that problem, we have reduced the startup time of our tests to little more than the run-time of an empty script. Which as I mentioned in my initial post, is around 6 seconds on the target hardware.

Still, any significant improvement on this empty-script run-time would be very valuable. Is it possible to build the Julia binary and libraries with different configuration options to gain speed? Maybe trading it off for size, or for optional debug features?

3 Likes

This in indeed interesting. Maybe there are some subtle difference in the code path of these macros when used e.g. in a interactive session or from a script (with possibly the output being redirected to a file).

Maybe you gain additional information by running a profiler like valgrind:

valgrind --tool=callgrind  /home/abarth/opt/julia-1.6.0/bin/julia --eval 'println("foo")'
$ callgrind_annotate  callgrind.out.36227 | head -n 40
--------------------------------------------------------------------------------
Profile data file 'callgrind.out.36227' (creator: callgrind-3.15.0)
--------------------------------------------------------------------------------
I1 cache: 
D1 cache: 
LL cache: 
Timerange: Basic block 0 - 76677294
Trigger: Program termination
Profiled target:  /home/abarth/opt/julia-1.6.0/bin/julia --eval println("foo") (PID 36227, part 1)
Events recorded:  Ir
Events shown:     Ir
Event sort order: Ir
Thresholds:       99
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
Ir          
--------------------------------------------------------------------------------
411,883,942  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir          file:function
--------------------------------------------------------------------------------
90,783,858  /buildworker/worker/package_linux64/build/src/staticdata.c:jl_restore_system_image_from_stream [/home/abarth/opt/julia-1.6.0/lib/julia/libjulia-internal.so.1.6]
31,110,076  /build/glibc-eX1tMB/glibc-2.31/elf/dl-lookup.c:do_lookup_x [/lib/x86_64-linux-gnu/ld-2.31.so]
22,314,812  /build/glibc-eX1tMB/glibc-2.31/elf/../sysdeps/generic/dl-hash.h:do_lookup_x
22,303,198  /build/glibc-eX1tMB/glibc-2.31/elf/dl-lookup.c:_dl_lookup_symbol_x [/lib/x86_64-linux-gnu/ld-2.31.so]
20,083,366  /buildworker/worker/package_linux64/build/src/flisp/flisp.c:apply_cl'2 [/home/abarth/opt/julia-1.6.0/lib/julia/libjulia-internal.so.1.6]
14,592,602  /build/glibc-eX1tMB/glibc-2.31/elf/dl-lookup.c:check_match [/lib/x86_64-linux-gnu/ld-2.31.so]
13,483,800  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:__strcmp_sse2_unaligned [/lib/x86_64-linux-gnu/libc-2.31.so]
11,132,546  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/strcmp.S:strcmp [/lib/x86_64-linux-gnu/ld-2.31.so]
 8,864,148  /buildworker/worker/package_linux64/build/src/support/dtypes.h:jl_restore_system_image_from_stream
 8,766,219  ???:llvm::PMTopLevelManager::setLastUser(llvm::ArrayRef<llvm::Pass*>, llvm::Pass*) [/home/abarth/opt/julia-1.6.0/lib/julia/libLLVM-11jl.so]
 8,154,285  /buildworker/worker/package_linux64/build/src/flisp/flisp.c:symbol [/home/abarth/opt/julia-1.6.0/lib/julia/libjulia-internal.so.1.6]
 7,760,694  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/../strchr.S:__strchr_sse2 [/lib/x86_64-linux-gnu/libc-2.31.so]
 5,710,520  /buildworker/worker/package_linux64/build/src/symbol.c:_jl_symbol [/home/abarth/opt/julia-1.6.0/lib/julia/libjulia-internal.so.1.6]
 5,206,551  /buildworker/worker/package_linux64/build/src/flisp/read.c:read_token [/home/abarth/opt/julia-1.6.0/lib/julia/libjulia-internal.so.1.6]
 5,108,943  /buildworker/worker/package_linux64/build/src/support/MurmurHash3.c:MurmurHash3_x64_128 [/home/abarth/opt/julia-1.6.0/lib/julia/libjulia-internal.so.1.6]

The filter_stdlibs option sounds indeed promising but I haven’t it used myself. Maybe if you post the error messagethat you are seeing or a MWE somebody can help you with this.

FYI: There’s an issue in the julia repo about standard libraries; I commented there and linked to this thread.

1 Like

Quickly played around with the stdlibs. PackageCompiler.jl provides gather_stdlibs_project, which for a project collects all required standard libraries. It does so by comparing the project’s manifest vs. the standard libraries in the current system image. The latter, in turn, are determined by intersecting all available standard libraries and all modules in the current image.

Function gather_stdlibs_project is used by create_sysimage in the case of filter_stdlibs = true. So, I created a dummy module with just two one-line functions (one printing hello world, the other adding its two arguments), and a script using this module, and calling the two functions. I then created a new project/environment, and only added this module. Running gather_stdlibs_project on that project gives empty arrays – correct.

The way then is to activate an environment with PackageCompiler, using it, then activating the project to be used to generate the reduced sysimage. Then, something like

PackageCompiler.create_sysimage(:MyMinimalPackage;
    sysimage_path="MyMinimalPackageSysimagePrecompile_StdLibFilter.so",
    precompile_execution_file="call_all_functions.jl",
    filter_stdlibs=true, incremental=false)

generates a new “base” sysimage (i.e. not incremental) without any standard libraries. On Windows 10, this produces a 65 MB sysimage, vs. 150 MB without removing the stdlibs.

Starting the Julia REPL with this image needs to precompile a REPL provider and falls back to a very basic one. The functions from the module are callable right away, as is everything from core, but nothing more. There’s some examples of stdlib functions being available by default, e.g. rand or the backslash operator \, which are not available anymore. You’d have to explicitly do using Random (or using LinearAlgebra, respectively), which will precompile those stdlibs and then also has to JIT-compile the functions.

Now the interesting question is, how much this reduces the startup time on an embedded device. Could you try this, @enriquer?

2 Likes

Hrm, I didn’t get any errors, all was straightforward. Said pitfalls (and possibly also your compilation errors?) might be those imported-by-default stdlib functions that one might not be aware of. So, if there is no dependency on, say, Random in your project and thus this stdlib is not in the manifest, but you use rand in your package/code, this will at least generate a runtime error.

The errors were like this one:

ERROR: LoadError: MethodError: no method matching *(::Matrix{Int64}, ::Vector{Float64})
Closest candidates are:
  *(::Any, ::Any, ::Any, ::Any...) at operators.jl:560
  *(::Number, ::AbstractArray) at arraymath.jl:52
  *(::AbstractMatrix{var"#s79"} where var"#s79"<:Number) at abstractarraymath.jl:98

@asprionj figured out the problem:

Indeed, that was the cause of the errors I was seeing. The example was using some operators from LinearAlgebra without importing the modules explicitly. Adding explicit imports for LinearAlgebra and Random and updating Manifest.toml accordingly fixed the errors.

1 Like

It does help!

These are the run times of the same example with different sysimage configurations. All the times are measured right after a OS reboot (subsequent runs go much faster due to filesystem caching).

  • No sysimage: 43 seconds
  • Regular sysimage: 7.3 seconds
  • filter_stdlibs sysimage: 5.4 seconds

All in all, the startup performance on the embedded system is now quite acceptable.

BTW, using filter_stdlibs was not completely straightforward. In addition to the problem with missing using statements mentioned in another reply, the initial create_sysimage call did not precompile everything. Using --trace-compile with the resulting image still showed some precompile statements. I had to switch to precompile_statements_file, and do multiple passes, appending the output of --trace-compile to the precompile statements after each create_sysimage pass, until --trace-compile stopped generating any output. I ended up having to run create_sysimage three times.

I did not observe the same issue when generating an image without filter_stdlibs and with incremental=true. In that case, running --trace-compile while using the initial image (first pass) did not generate any further precompile statements.

3 Likes

I assume you mean incremental=false?

I did not pass the incremental argument explicitly in that experiment, and I think it defaults to true. I only set it explicitly (to false) when I was using filter_stdlibs=true.