SIMD and compiled code

Hi,

Different processors have different SIMD capabilities, especially different vector sizes on which the processor can work at once.
However, there are a few points w.r.t. code compilation which I am not sure about.

If I understood correctly, Julia generates automatically SIMD code optimal to the current machine. It is able to do so because the code is compiled on the individual machine.
Pre-compiled binaries for shipping, however, would either need to be compiled for the lowest common denominator (i.e. not using SIMD capabilities of most modern CPUs to full extend) or require explicit treatment of different SIMD capabilities at runtime, correct?

What about the default Julia sysimage? Would it increase runtime speed of (some) base functionality if the sysimage is recompiled on the individual machine, taking its full SIMD capabilities into accout?

3 Likes

I had a related question a few years back. There I asked whether it matters which compiler (icc vs gcc) one uses when building Julia. The answer was pretty much no.

This question asked about the system image which is not related to the compiler used to compile julia itself. The default system image is simutaniously compiled for mulitple microarchs to take advantage of the different cpu features including SIMD.

4 Likes

If I compile Julia myself, is the resulting system image also compiled for multiple microarchs?

Update: I guess what I’m really asking is how independent the compilation of the system image is from and during the compilation of Julia itself. From https://docs.julialang.org/en/v1/devdocs/sysimg/ I take that this can be specified by a make option “during system image compilation”.

Update2: In https://github.com/JuliaLang/julia/blob/master/Make.inc I found:

# JULIA_CPU_TARGET is the JIT-only complement to MARCH. Setting it explicitly is not generally necessary,
#    since it is set equal to MARCH by default

Does this imply that the system image obtained when building Julia from source is not for multiple microarchs?

AFAIU Julia supports an intermediate technique between the two you mention: it can bake into one sysimage several versions of the same code, compiled and optimized for different micro architectures. So that it achieves the best of both worlds: (almost-)fully optimized code and no run-time latency (for what gets baked into the system image).

As explained in the devdocs, this is what’s used for the default sysimage that is shipped with official pre-built julia releases is compiled.

If you want to use PackageCompiler to produce custom system images for your own packages, you can achieve the same effect (i.e. produce sysimages that are both portable and optimized) using the cpu_target keyword argument to create_sysimage. For example for x86_64 architectures:

create_sysimage(
    packages,
    sysimage_path = "my_sysimage.so",
    # [other kw args]
    cpu_target = "generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)"
)
6 Likes

Thanks for your anwers @yuyichao and @ffevotte!
I assume Fortran and C compiler are working in the same way, i.e. compiling binaries for different micro-architectures, correct?