Global constants that cannot be defined until runtime because they depend on specifics of the CPU architecture

@elrod and I are running into problems when trying to make LoopVectorization.jl relocatable. (for example, compiling a sysimage that includes LoopVectorization.jl on machine A, and then using that sysimage on machine B).

The basic idea is that we have global constants that depend on the specific CPU architecture.

As an example:

module Foo

import CpuId

struct IntelCpu end
struct OtherCpu end

const CPU_BRAND = if startswith(CpuId.cpubrand(), "Intel(R) ")
    IntelCpu()
else
    OtherCpu()
end

do_stuff() = do_stuff(CPU_BRAND)
do_stuff(::IntelCpu) = 1
do_stuff(::OtherCpu) = 1.0 

end # module

Unfortunately, the above code is not relocatable. If I compile my package Foo.jl in a sysimage or app using a computer with an Intel CPU, and then I try to move my sysimage or app to a computer with a non-Intel CPU, bad things will happen.

So I figure that because the information I need about the CPU architecture is not available until runtime, I should move that logic into __init__. So I try this instead:

module Foo

import CpuId

struct IntelCpu end
struct OtherCpu end

do_stuff() = do_stuff(CPU_BRAND)
do_stuff(::IntelCpu) = 1
do_stuff(::OtherCpu) = 1.0 

function __init__()
    if startswith(CpuId.cpubrand(), "Intel(R) ")
        @eval const CPU_BRAND = IntelCpu()
    else
        @eval const CPU_BRAND = OtherCpu()
    end
    return nothing
end

end # module

Unfortunately, this will break precompilation. If I have a package Bar.jl that depends on Foo.jl, e.g. this:

module Bar

import Foo

end # module

When I try to do import Bar, I get this error:

julia> import Bar
[ Info: Precompiling Bar [f4235cf3-1c45-4253-b7f4-6bb3fb59c5c4]
ERROR: LoadError: InitError: Evaluation into the closed module `Foo` breaks incremental compilation because the side effects will not be permanent. This is likely due to some other module mutating `Foo` with `eval` during precompilation - don't do this.
Stacktrace:
  [1] eval
    @ ./boot.jl:369 [inlined]
  [2] __init__()
    @ Foo ~/Downloads/MWE-eval/Foo.jl/src/Foo.jl:10
  [3] _include_from_serialized(path::String, depmods::Vector{Any})
    @ Base ./loading.jl:670
  [4] _require_from_serialized(path::String)
    @ Base ./loading.jl:723
  [5] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:1027
  [6] require(uuidkey::Base.PkgId)
    @ Base ./loading.jl:910
  [7] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:897
  [8] include
    @ ./Base.jl:386 [inlined]
  [9] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
    @ Base ./loading.jl:1209
 [10] top-level scope
    @ none:1
 [11] eval
    @ ./boot.jl:369 [inlined]
 [12] eval(x::Expr)
    @ Base.MainInclude ./client.jl:453
 [13] top-level scope
    @ none:1
during initialization of module Foo
in expression starting at /Users/dilum/Downloads/MWE-eval/Bar.jl/src/Bar.jl:1
ERROR: Failed to precompile Bar [f4235cf3-1c45-4253-b7f4-6bb3fb59c5c4] to /Users/dilum/.julia/compiled/v1.7/Bar/jl_Q6bC82.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::Base.TTY, internal_stdout::Base.TTY)
   @ Base ./loading.jl:1356
 [3] compilecache(pkg::Base.PkgId, path::String)
   @ Base ./loading.jl:1302
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1017
 [5] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:910
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:897

Any ideas on how we might accomplish this?

1 Like

Currently, I implemented an awful hack that I strongly suspect is not future proof, but works for now.

I define a lot of @generated functions that read global mutable state.
The parsed forms of these @generated functions are saved when precompiled, but hopefully none of the constants get baked in.

It’ll load host-specific data on __init__(), and then afterwards functions can query these @generated functions, which can then be compiled with correct host-information.

This still requires that others limit the amount of compilation they do, e.g. they shouldn’t call any of these methods in a precompile script used by PackageCompiler.jl or their own packages precompile scripts.
Someone improving their package’s latency via following this good suggestion could inadvertently make their package incompatible with redistribution by depending indirectly on VectorizationBase.jl

It’s also unfortunate that the solution has to be “stop precompiling”, when I suspect the majority of users are still compiling all their code on their machine for their machine – they shouldn’t have to accept latency.

@yuyichao has been warning about this for sometime.

I know all of this can be correctly handled by LLVM, and multiversioning means we can have the cake and eat it too in the LLVM world (i.e., get precompilation and still have target-specific code).
Is this a solved problem in the LLVM world, but an unsolvable one in the Julia world? I’d much rather not abandon LoopVectorziation as a Julia project over this. That’s unlikely, just the solution I think a few would suggest. :wink:

3 Likes

Option 1 (widely compatible, sub-optimal codegen, precompile-friendly)

Use a Ref{T} at global scope:

module Foo

import CpuId

abstract type AbstractCpuType  end
struct UnknownCpu <: AbstractCpuType end
struct IntelCpu <: AbstractCpuType end
struct OtherCpu <: AbstractCpuType end

do_stuff() = do_stuff(CPU_BRAND)
do_stuff(::IntelCpu) = 1
do_stuff(::OtherCpu) = 1.0

const CPU_BRAND = Ref{AbstractCpuType}(UnknownCpu())

function __init__()
    if startswith(CpuId.cpubrand(), "Intel(R) ")
        CPU_BRAND[] = IntelCpu()
    else
        CPU_BRAND[] = OtherCpu()
    end
    return nothing
end

end # module

The type won’t be available at compile-time though; so if you’re hoping to make use of it for clever dispatching, you’re going to be breaking out into the dynamic environment a lot because it won’t know beforehand what kind of CPU it is:

julia> function do_lots_of_stuff()
           return Foo.do_stuff(Foo.CPU_BRAND[])
       end

julia> @code_warntype do_lots_of_stuff()
Variables
  #self#::Core.Const(do_lots_of_stuff)

Body::Union{Float64, Int64}
1 ─ %1 = Foo.do_stuff::Core.Const(Foo.do_stuff)
│   %2 = Foo.CPU_BRAND::Core.Const(Base.RefValue{Foo.AbstractCpuType}(Foo.OtherCpu()))
│   %3 = Base.getindex(%2)::Foo.AbstractCpuType
│   %4 = (%1)(%3)::Union{Float64, Int64}
└──      return %4

(Note the compiler just thinks it’s an AbstractCpuType)

Option 2 (widely compatible, optimal codegen, precompile-unfriendly)

You can use #265 tricks to regenerate everything downstream of your switching functions:

module Foo

import CpuId

abstract type AbstractCpuType  end
struct UnknownCpu <: AbstractCpuType end
struct IntelCpu <: AbstractCpuType end
struct OtherCpu <: AbstractCpuType end

do_stuff() = do_stuff(CPU_BRAND)
do_stuff(::IntelCpu) = 1
do_stuff(::OtherCpu) = 1.0

CPU_BRAND() = UnknownCpu()

function __init__()
    if startswith(CpuId.cpubrand(), "Intel(R) ")
        @eval CPU_BRAND() = IntelCpu()
    else
        @eval CPU_BRAND() = OtherCpu()
    end
    return nothing
end

end # module

Note this can be very expensive, compile-time wise! Anything that depends on the result of CPU_BRAND() will necessarily be recompiled after that @eval, which can be quite the performance penalty, and it won’t be cached. You can instantiate CPU_BRAND with the “proper” choice at the top-level and then re-set it afterwards, but even if you’re not making any functional change, I’m pretty sure the @eval will still re-trigger recompilation for all dependent methods purely because it’s a new method instance.

Proof of optimal codegen:

julia> using Foo
       function do_lots_of_stuff()
           return Foo.do_stuff(Foo.CPU_BRAND())
       end
do_lots_of_stuff (generic function with 1 method)

julia> @code_warntype do_lots_of_stuff()
Variables
  #self#::Core.Const(do_lots_of_stuff)

Body::Float64
1 ─ %1 = Foo.do_stuff::Core.Const(Foo.do_stuff)
│   %2 = Foo.CPU_BRAND::Core.Const(Foo.CPU_BRAND)
│   %3 = (%2)()::Core.Const(Foo.OtherCpu())
│   %4 = (%1)(%3)::Core.Const(1.0)
└──      return %4

Option 3 (v1.6+ compatible only, optimal codegen, precompile-kinda-friendly)

You can use Preferences.jl to bake these choices into your .ji files, then force recompilation if it doesn’t match. Do something like:

module Foo

import CpuId
using Preferences

abstract type AbstractCpuType  end
struct UnknownCpu <: AbstractCpuType end
struct IntelCpu <: AbstractCpuType end
struct OtherCpu <: AbstractCpuType end

do_stuff() = do_stuff(CPU_BRAND)
do_stuff(::IntelCpu) = 1
do_stuff(::OtherCpu) = 1.0

function parse_cpu_brand(brand::String)
    if startswith(brand, "Intel(R) ")
        return IntelCpu()
    else
        return OtherCpu()
    end
end

# This function will force a new entry into the method table,
# triggering recompilation of everything that uses `CPU_BRAND()`
function set_cpu_brand(new_brand::String)
    @eval CPU_BRAND() = $(parse_cpu_brand(new_brand))
    @set_preferences!("cpubrand" => new_brand)
end

# Generate CPU_BRAND(); this is what will be stored in the `.ji` file.
# We load the CPU brand from the stored preferences if it exists,
# otherwise we just use the current host.  This is used so that we can
# use the precompilation preference checker to invalidate a `.ji` file
# if it doesn't match the currently-running host, by setting a different
# preference.
set_cpu_brand(@load_preference("cpubrand", CpuId.cpubrand()))

function __init__()
    curr_brand = CpuId.cpubrand()

    # Set a new preference if the current host doesn't match
    if parse_cpu_brand(curr_brand) != CPU_BRAND()
        @info("CPU brand mismatch detected; forcing re-compilation")
        set_cpu_brand(curr_brand)
    end
    return nothing
end

end # module

With this setup, we will (1) bake a choice into the .ji file, (2) check to make sure it matches in __init__(), (3) force recompilation (through the #265-based trick), (4) record the choice as a preference, forcing generation of a new .ji file in the future.

Proof of optimal codegen:

julia> using Foo
       function do_lots_of_stuff()
           return Foo.do_stuff(Foo.CPU_BRAND())
       end
do_lots_of_stuff (generic function with 1 method)

julia> @code_warntype do_lots_of_stuff()
Variables
  #self#::Core.Const(do_lots_of_stuff)

Body::Float64
1 ─ %1 = Foo.do_stuff::Core.Const(Foo.do_stuff)
│   %2 = Foo.CPU_BRAND::Core.Const(Foo.CPU_BRAND)
│   %3 = (%2)()::Core.Const(Foo.OtherCpu())
│   %4 = (%1)(%3)::Core.Const(1.0)
└──      return %4

julia> Foo.set_cpu_brand("Intel(R) ")

julia> @code_warntype do_lots_of_stuff()
Variables
  #self#::Core.Const(do_lots_of_stuff)

Body::Int64
1 ─ %1 = Foo.do_stuff::Core.Const(Foo.do_stuff)
│   %2 = Foo.CPU_BRAND::Core.Const(Foo.CPU_BRAND)
│   %3 = (%2)()::Core.Const(Foo.IntelCpu())
│   %4 = (%1)(%3)::Core.Const(1)
└──      return %4

This is a slight abuse of the Preference system, but since it’s the only way that I know to record choices that can invalidate .ji files, it’s the only way I know how to avoid paying the price of compilation every time you load a package from disk when the saved .ji file is incorrect.

Concerns about System images

Since I saw you were concerned about system images, any #265-based approach will work with system images, but you’ll pay the recompilation price every time you load, since the generated code is just wrong. There’s really no way around that; users must recompile if the compiled code is wrong. With all three approaches above you should get functional code, but (3) won’t actually effect anything, since there is no invalidation check when loading sys.so; unlike .ji files, we can’t invalidate the system image. :wink:

5 Likes

I think there are two decent choices:

  • Generate code for all architectures and pick the one you use at runtime (similar to BLAS).
  • Have a way to target a specific architecture at sysimage compile time

This is similar to the cpu_target that PackageCompiler supports for the sysimage.

Of course, since there is a JIT you can just jit your code at startup but I guess the assumption is that this isn’t really desired.

3 Likes

Wonderful suggestions, thank you!

I’d rather not choose option 1, as optimal codegen is these package’s primary objective.

With the caveat that it requires 1.6, option 3 sounds strictly better than option 2.
Option 2 is still way better than my current @generated approach, but because the @generated seems to work well enough for now on 1.5, I think I’ll just go straight for Preferences.jl.
Abusing #265 is so much better than @generated as a solution, I’m a little embarrassed that I didn’t think of it. I also definitely use @generated too often.

I’ll set defaults to most-likely values, e.g. assume AVX2 but no AVX512 for x86_64, so that if/when it gets built into a sysimage, the majority don’t have to recompile.

Both are good, but how can these be done at the Julia level?

Unfortunately, I don’t even know how to query information about the target, so LoopVectorization won’t work correctly if you start Julia with -Cnehalem, for example.
There is:

julia> run(pipeline(`nm -D $(Base.libllvm_path())`, `grep TargetMachineFeatureString`));
00000000022e4370 T LLVMGetTargetMachineFeatureString@@JL_LLVM_11.0

But I’m not sure how to call this function. I still don’t know much about how LLVM works (I should spend some time learning the ecoystem eventually), but I’d need a TargetMachine object. Presumably this is involved in the JIT somewhere?

So I’ve just been using

julia> run(pipeline(`nm -D $(Base.libllvm_path())`, `grep HostCPUFeatures`));
00000000022e4b00 T LLVMGetHostCPUFeatures@@JL_LLVM_11.0

Which maybe Julia wraps specially, because all that’s needed as of Julia 1.6 is

llvmlib_path = Base.libllvm_path();
libllvm = Libdl.dlopen(llvmlib_path)
gethostcpufeatures = Libdl.dlsym(libllvm, :LLVMGetHostCPUFeatures)
features_cstring = ccall(gethostcpufeatures, Cstring, ())

It’d be great if this could take advantage of the sysimage’s multiversioning or otherwise look at and use the target architecture.

1 Like

Note that this is the very reason I told you that this is the wrong level to do things.

1 Like

The reason I didn’t really pursue @eval routes earlier was because when Dilum tried just evaling the definitions in the init, we immediately found this didn’t work:

julia> using VectorizedRNG
[ Info: Precompiling VectorizedRNG [33b4df10-0173-11e9-2a0c-851a7edac40e]
ERROR: LoadError: InitError: Evaluation into the closed module `VectorizationBase` breaks incremental compilation because the side effects will not be permanent. This is likely due to some other module mutating `VectorizationBase` with `eval` during precompilation - don't do this.

I made a PR implementing the Preferences.jl approach (option 3), but got the above error whenever it tried to @eval during __init__.
Meaning I can’t just @eval (or include) corrections. Being able to make corrections is of course essential for sysimages, as they will need to make the corrections every time.
If it’s just a precompile file, all that’d be needed is Preferences to invalidate the precompile file.

But Requires.jl somehow works around this, so I’ll take a look at it.
EDIT: Solved, looking at Requires.jl:

function __init__()
    ccall(:jl_generating_output, Cint, ()) == 1 && return
    ...
end

is all it takes, i.e., if it’s currently precompiling, don’t @eval.

Of course, somehow being able to use cpu_target or getting multi-versioning would be even better.

You gave other reasons too, but yes, it may be nice to eventually move the project to a lower level. But it’s still experimental, with a fair bit more experimentation planned, where the faster iteration (aided by my familiarity with Julia) is helpful.
Or maybe MLIR and related projects will just render it all obsolete (although in that case I’d wish I was more engrossed in the LLVM and MLIR world).

On the other hand, if such problems are solve-able in the Julia world, there are some advantages. I don’t know if being able to move quickly will ever cease to be an advantage.

3 Likes

I talked this over with @dilumaluthge a little more, and decided to go with option 2, because unless I’m misunderstanding something, Preferences.jl doesn’t (currently?) have an advantage.

The PR, as of writing this comment (it will change shortly, but the link should be to a static snapshot), it works like this;

The module itself is written to define system-info functions as having the @load_preference value, or, if @load_preference returns nothing, as having the dynamically determined value.

function define_cpu_name(cpu)
    @eval cpu_name() = $cpu
    @set_preference!("cpu_name" => cpu)
end
define_cpu_name(@load_preference("cpu_name", Sys.CPU_NAME))

It @set_preference! s these to make them to save the value.

Then, on __init__() , it goes through each of these, and checks if the dynamic value matches the preference. If they match, it continues.
If they don’t match, it @eval s to overwrite the old system-info function, and then @set_preference! s to set it as a new default.

function __init__()
    ccall(:jl_generating_output, Cint, ()) == 1 && return
    cpu = @load_preference("cpu_name")
    cpu == Sys.CPU_NAME || define_cpu_name(Sys.CPU_NAME)
end

Now, when loading the package, if the preference was wrong, we correct the function definition, and set a new preference so that VectorizationBase will re-precompile with the correct value the next time it is loaded.

What this would look like without Preferences.jl is
The module itself is written to define system-info functions as having the dynamically determined value.

function define_cpu_name(cpu)
    @eval cpu_name() = $cpu
end
define_cpu_name(Sys.CPU_NAME)

Then, on __init__() , it goes through each of these, and checks if the dynamic value matches the function definition. If they match, it continues. If they don’t match, it @eval s to overwrite the old system-info function.

function __init__()
    ccall(:jl_generating_output, Cint, ()) == 1 && return
    cpu_name() == Sys.CPU_NAME || define_cpu_name(Sys.CPU_NAME)
end

The advantage of Preferences here is that it forces the .ji file to re-precompile, but just @eval does not.

I see three scenarios:

  1. Users install packages in the normal way.
  2. Users install packages in such a way that the same .ji file gets distributed to multiple systems, possibly with different architectures.
  3. Users receive Julia binaries, such as system images or apps that include VectorizationBase.jl.

Behavior is identical in cases 1 and 3. In case 1, the original precompile file is correct, and never needs updating, so the ability to force re-precompilation never comes into play. In case 3, it can’t re-precompile, so there is no difference.

In case 2, Preferences.jl is not necessarily preferable. It might be if users were receiving precompiled .jis when adding packages. However, currently, case 2 is most common with distributed file systems such as clusters. In that case, when heterogenous, odds are each of them will constantly redefine preferences, forcing constant re-compilation of VectorizationBase.jl and all of its dependents every time they are loaded. This will be worse and much slower than simply @evaling a few functions on every startup.

1 Like