How to encourage compiler optimizations?

DanDoe · April 22, 2024, 10:05am

I have two functions which contain an identical block of code foo followed by different blocks of code bar1, bar2. I time these functions using our timing tool which is essentially a derivate of _timer_expr from TimerOutputs.jl.

What I observe is that depending on whether foo is followed by bar1 or bar2 the timings

@trixi_timeit timer() "foo" begin
    foo
end
<bar1, bar2>

vary for the default optimization level 2 by a factor of three, i.e., the unchanged block of code foo is for one case turned into machine code that is three times slower.

If I, however, run the code only with --optimize=1 the timer outputs are equal (up to system noise).

How can I achieve the same optimization for both cases? Should I split foo into a separate function?

sgaure · April 22, 2024, 10:39am

Compilation is done for functions, not individual code blocks. So if there are some kind of dependency between your foo and bar1 or bar2, optimizations can be different. It’s hard to say what exactly happens here without further details.

DanDoe · April 22, 2024, 12:55pm

That is good to know!

Unfortunately, extracting a separate function foo

function foo()
    foo
end

@trixi_timeit timer() "foo" begin
    foo()
end
<bar1, bar2>

still gives the same results in terms of performance.

ufechner7 · April 22, 2024, 1:01pm

Is it possible that in one case it is inlined and in the other case not? Did you try the @inline decorator?

sgaure · April 22, 2024, 1:14pm

Alternatively, also try @noinline. Also, a code snippet, or a function, can behave differently if the types of variables are different.

DanDoe · April 22, 2024, 1:30pm

For both b1, b2 the outer function is not inlined.

DanDoe · April 22, 2024, 1:30pm

Hm, @noinline function foo() gives the same as @inline function foo() and function foo() in terms of performance.

sgaure · April 22, 2024, 7:47pm

So, then, are the arguments to foo(), and any global variables used, of the same type and size in both cases?

nsajko · April 22, 2024, 11:10pm

A small-enough reproducible example would be nice.

DanDoe · April 23, 2024, 6:58am

I have one (smaller) struct struct1 and one superset of this, struct2 which stores some additional quantities.

foo() receives instances of struct1 and struct2, respectively, but all operations within foo() are done on the variables which both structs have.

nsajko · April 23, 2024, 8:28am

As long as you’re not giving us a reproducer this is just guesswork, however:

It’s expected that optimization would reorder code, that’s one of the things that optimizers do to make our code faster.

So, you’re asking “how to encourage optimizations”, but if you want reliable timers what you actually want to ask is “how to inhibit optimizations”.

sgaure · April 23, 2024, 11:23am

With two different structs a lot of things can happen. Are they both mutable or both immutable? Are the fields you use identically declared in the structs, and concrete? A factor of 3 is quite a lot. It could be because with one of the structs it’s possible to vectorize the computation, or stack allocate, or a field you use is abstract in one of the structs, leading to bad performance. It’s hard to speculate about without an example.

DanDoe · April 23, 2024, 1:32pm

Alright, although I am not sure whether this is helpful or not, my two structs are

# This struct is needed to fake 
# https://github.com/SciML/OrdinaryDiffEq.jl/blob/0c2048a502101647ac35faabd80da8a5645beac7/src/integrators/type.jl#L77
# This implements the interface components described at
# https://diffeq.sciml.ai/v6.8/basics/integrator/#Handing-Integrators-1
# which are used in Trixi.
mutable struct Integrator{RealT <: Real, uType, Params, Sol, F, Alg,
                          IntegratorOptions}
    u::uType
    du::uType
    u_tmp::uType
    t::RealT
    dt::RealT # current time step
    dtcache::RealT # Used for euler-acoustic coupling
    iter::Int # current number of time steps (iteration)
    p::Params # will be the semidiscretization from Trixi
    sol::Sol # faked
    f::F
    alg::Alg # This is our own class written above; Abbreviation for ALGorithm
    opts::IntegratorOptions
    finalstep::Bool # added for convenience
    # stages:
    k1::uType
    k_higher::uType
    k_S1::uType
    t_stage::RealT
end

and

# This struct is needed to fake 
# https://github.com/SciML/OrdinaryDiffEq.jl/blob/0c2048a502101647ac35faabd80da8a5645beac7/src/integrators/type.jl#L77
# This implements the interface components described at
# https://diffeq.sciml.ai/v6.8/basics/integrator/#Handing-Integrators-1
# which are used in Trixi.
mutable struct Multi_Integrator{RealT <: Real, uType, Params, Sol, F, Alg,
                                IntegratorOptions}
    u::uType
    du::uType
    u_tmp::uType
    t::RealT
    dt::RealT # current time step
    dtcache::RealT # Used for euler-acoustic coupling
    iter::Int # current number of time steps (iteration)
    p::Params # will be the semidiscretization from Trixi
    sol::Sol # faked
    f::F
    alg::Alg # This is our own class written above; Abbreviation for ALGorithm
    opts::IntegratorOptions
    finalstep::Bool # added for convenience
    # stages:
    k1::uType
    k_higher::uType
    k_S1::uType
    
    # Variables managing level-depending integration
    level_info_elements::Vector{Vector{Int64}}
    level_info_elements_acc::Vector{Vector{Int64}}
    level_info_interfaces_acc::Vector{Vector{Int64}}
    level_info_boundaries_acc::Vector{Vector{Int64}}
    level_info_boundaries_orientation_acc::Vector{Vector{Vector{Int64}}}
    level_info_mortars_acc::Vector{Vector{Int64}}
    level_u_indices_elements::Vector{Vector{Int64}}
    
    t_stage::RealT
    coarsest_lvl::Int64
    n_levels::Int64
end

where alg::Alg is the only argument that is different for the structures out of RealT <: Real, uType, Params, Sol, F, Alg, IntegratorOptions.

The function corresponding to foo() is given by

function k1k2!(integrator, p, c)
    integrator.f(integrator.du, integrator.u, p, integrator.t)

    @threaded for i in eachindex(integrator.du)
        integrator.k1[i] = integrator.du[i] * integrator.dt
    end

    integrator.t_stage = integrator.t + c[2] * integrator.dt
    @threaded for i in eachindex(integrator.u)
        integrator.u_tmp[i] = integrator.u[i] + c[2] * integrator.k1[i]
    end
end

where @threaded is a macro essentially equivalent to @batch from Polyester.jl.

Topic		Replies	Views
Function that executes two functions slower than two function separately New to Julia optimization	12	496	July 27, 2023
The huge difference of two functions for the same goal inside and outside of function Performance	5	271	October 27, 2022
Significant difference in execution speed between similar functions with nested loops Performance question	4	247	March 8, 2023
Compiler optimization for variables and functions Performance	16	1412	September 5, 2018
Understanding compilation with Optim.jl and OrdinaryDiffEq.jl Performance question , compilation	7	178	January 13, 2025

How to encourage compiler optimizations?

Related topics