How to encourage compiler optimizations?

I have two functions which contain an identical block of code foo followed by different blocks of code bar1, bar2. I time these functions using our timing tool which is essentially a derivate of _timer_expr from TimerOutputs.jl.

What I observe is that depending on whether foo is followed by bar1 or bar2 the timings

@trixi_timeit timer() "foo" begin
    foo
end
<bar1, bar2>

vary for the default optimization level 2 by a factor of three, i.e., the unchanged block of code foo is for one case turned into machine code that is three times slower.

If I, however, run the code only with --optimize=1 the timer outputs are equal (up to system noise).

How can I achieve the same optimization for both cases? Should I split foo into a separate function?

Compilation is done for functions, not individual code blocks. So if there are some kind of dependency between your foo and bar1 or bar2, optimizations can be different. It’s hard to say what exactly happens here without further details.

3 Likes

That is good to know!

Unfortunately, extracting a separate function foo

function foo()
    foo
end

@trixi_timeit timer() "foo" begin
    foo()
end
<bar1, bar2>

still gives the same results in terms of performance.

Is it possible that in one case it is inlined and in the other case not? Did you try the @inline decorator?

Alternatively, also try @noinline. Also, a code snippet, or a function, can behave differently if the types of variables are different.

1 Like

For both b1, b2 the outer function is not inlined.

Hm, @noinline function foo() gives the same as @inline function foo() and function foo() in terms of performance.

So, then, are the arguments to foo(), and any global variables used, of the same type and size in both cases?

A small-enough reproducible example would be nice.

I have one (smaller) struct struct1 and one superset of this, struct2 which stores some additional quantities.

foo() receives instances of struct1 and struct2, respectively, but all operations within foo() are done on the variables which both structs have.

As long as you’re not giving us a reproducer this is just guesswork, however:

It’s expected that optimization would reorder code, that’s one of the things that optimizers do to make our code faster.

So, you’re asking “how to encourage optimizations”, but if you want reliable timers what you actually want to ask is “how to inhibit optimizations”.

With two different structs a lot of things can happen. Are they both mutable or both immutable? Are the fields you use identically declared in the structs, and concrete? A factor of 3 is quite a lot. It could be because with one of the structs it’s possible to vectorize the computation, or stack allocate, or a field you use is abstract in one of the structs, leading to bad performance. It’s hard to speculate about without an example.

1 Like

Alright, although I am not sure whether this is helpful or not, my two structs are

# This struct is needed to fake 
# https://github.com/SciML/OrdinaryDiffEq.jl/blob/0c2048a502101647ac35faabd80da8a5645beac7/src/integrators/type.jl#L77
# This implements the interface components described at
# https://diffeq.sciml.ai/v6.8/basics/integrator/#Handing-Integrators-1
# which are used in Trixi.
mutable struct Integrator{RealT <: Real, uType, Params, Sol, F, Alg,
                          IntegratorOptions}
    u::uType
    du::uType
    u_tmp::uType
    t::RealT
    dt::RealT # current time step
    dtcache::RealT # Used for euler-acoustic coupling
    iter::Int # current number of time steps (iteration)
    p::Params # will be the semidiscretization from Trixi
    sol::Sol # faked
    f::F
    alg::Alg # This is our own class written above; Abbreviation for ALGorithm
    opts::IntegratorOptions
    finalstep::Bool # added for convenience
    # stages:
    k1::uType
    k_higher::uType
    k_S1::uType
    t_stage::RealT
end

and

# This struct is needed to fake 
# https://github.com/SciML/OrdinaryDiffEq.jl/blob/0c2048a502101647ac35faabd80da8a5645beac7/src/integrators/type.jl#L77
# This implements the interface components described at
# https://diffeq.sciml.ai/v6.8/basics/integrator/#Handing-Integrators-1
# which are used in Trixi.
mutable struct Multi_Integrator{RealT <: Real, uType, Params, Sol, F, Alg,
                                IntegratorOptions}
    u::uType
    du::uType
    u_tmp::uType
    t::RealT
    dt::RealT # current time step
    dtcache::RealT # Used for euler-acoustic coupling
    iter::Int # current number of time steps (iteration)
    p::Params # will be the semidiscretization from Trixi
    sol::Sol # faked
    f::F
    alg::Alg # This is our own class written above; Abbreviation for ALGorithm
    opts::IntegratorOptions
    finalstep::Bool # added for convenience
    # stages:
    k1::uType
    k_higher::uType
    k_S1::uType
    
    # Variables managing level-depending integration
    level_info_elements::Vector{Vector{Int64}}
    level_info_elements_acc::Vector{Vector{Int64}}
    level_info_interfaces_acc::Vector{Vector{Int64}}
    level_info_boundaries_acc::Vector{Vector{Int64}}
    level_info_boundaries_orientation_acc::Vector{Vector{Vector{Int64}}}
    level_info_mortars_acc::Vector{Vector{Int64}}
    level_u_indices_elements::Vector{Vector{Int64}}
    
    t_stage::RealT
    coarsest_lvl::Int64
    n_levels::Int64
end

where alg::Alg is the only argument that is different for the structures out of RealT <: Real, uType, Params, Sol, F, Alg, IntegratorOptions.

The function corresponding to foo() is given by

function k1k2!(integrator, p, c)
    integrator.f(integrator.du, integrator.u, p, integrator.t)

    @threaded for i in eachindex(integrator.du)
        integrator.k1[i] = integrator.du[i] * integrator.dt
    end

    integrator.t_stage = integrator.t + c[2] * integrator.dt
    @threaded for i in eachindex(integrator.u)
        integrator.u_tmp[i] = integrator.u[i] + c[2] * integrator.k1[i]
    end
end

where @threaded is a macro essentially equivalent to @batch from Polyester.jl.