Inlining behavior: constant prop. fails, I need help in understanding "why"

Datseris · November 4, 2018, 9:07am

MWE:

using StaticArrays, BenchmarkTools
struct DelayEmbedding{D}
    delays::SVector{D, Int}
end

@inline DelayEmbedding(D::Int) = DelayEmbedding(Val{D}())
@inline function DelayEmbedding(::Val{D}) where {D}
    idxs = [k for k in 1:D]
    return DelayEmbedding{D}(SVector{D, Int}(idxs...))
end

@generated function (r::DelayEmbedding{D})(s::AbstractArray{T}, i) where {D, T}
    gens = [:(s[i + r.delays[$k]]) for k=1:D]
    quote
        @inbounds return SVector{$D+1,T}(s[i], $(gens...))
    end
end

This works and does not allocate:

x = rand(10000);
e = DelayEmbedding(1)
@btime $e($x, 4);
  # 2.052 ns (0 allocations: 0 bytes)

Buuut, when I create a minimal wrapper function that instantiates DelayEmbedding, I get allocations:

@inline function reconstruct(s::AbstractVector{T}, D::Int) where {T}
  de::DelayEmbedding{D} = DelayEmbedding(Val{D}())

  c = 0.0
  for i in 1:100
    data = de(s, i)
    c += data[1]
  end

  return c
end

I now benchmark this reconstruct function:

D = 1
@btime reconstruct($x, $D);
  # 8.348 μs (304 allocations: 6.39 KiB)

@btime reconstruct($x, 1);
  # 736.273 ns (2 allocations: 112 bytes)

It is “weird” that if D is a literal the performance difference is massive…

To my understanding, it is possible to write such code as above in Julia 1.0 because of constant propagation. On the other hand it is obvious that I have not understood how it works, since my code fails.

Is there any way to make my code be as fast as in the second case, even when giving in a variable instead of literal?

foobar_lv2 · November 4, 2018, 9:57am

Your code is not copy-pasteable (first two blocks: using StaticArrays, BenchmarkTools, third block: globals tau and D).

Datseris · November 4, 2018, 12:37pm

Thanks, I’ve edited the code. I have also removed everything irrelevant, like the parameter τ. What is left is the least amount of code I need to replicate my problem.

kristoffer.carlsson · November 4, 2018, 3:21pm

This is how I usually benchmark constant propagation:

julia> function f(x)
           D = 1
           reconstruct(x, D)
       end
f (generic function with 1 method)

julia> @btime f($x)
  871.847 ns (2 allocations: 112 bytes)
50.48725909006644

Non-Contradiction · November 4, 2018, 3:39pm

The global variable is not a constant unless making it a constant,

const DD = 1;
@btime reconstruct($x, DD);

works for me.

Datseris · November 7, 2018, 11:09am

@kristoffer.carlsson so if I understood correctly, constant propagation does not work the way I thought. I thought it was about some values that are constant between functions and/or local scopes and are understood as “some kind of literals”. The value D in my case is e.g. coming from a for loop,

for D in [1, 2]
   @btime reconstruct(x, D)
end

Benchmarking the above I still don’t get the “good” performance, I only get the “bad” one.

So, to conclude, it is not possible to get the “good” performance if D is not a literal? In your above example D = 1 inside a function is equivalent with it being a literal.

@Non-Contradiction I don’t really know what you are talking about…

julia> const f = 2
2

julia> @btime reconstruct($x, $f);
  9.813 μs (305 allocations: 6.47 KiB)

gives the bad performance, not the good one.

Non-Contradiction · November 7, 2018, 1:58pm

Please pay attention, that in my code, @btime reconstruct($x, DD); is used, not @btime reconstruct($x, $DD);, and @btime reconstruct($x, DD); gives good performance, because DD is a constant already, so no need to use $DD, just as there is no need of $ before function names. But I don’t know what happens with $ before DD, it gives bad performance for me too.

Datseris · November 7, 2018, 2:41pm

Sorry, I overlooked this! Thanks for clarifying and illustrating!

foobar_lv2 · November 7, 2018, 4:35pm

Well, const-prop should not be relied upon anyway so the correct spelling of your construction could be

julia> @noinline function reconstruct_inner(s::AbstractVector{T}, de) where {T}
         c = 0.0
         for i in 1:100
           data = de(s, i)
           c += data[1]
         end

         return c
         end
julia> @noinline reconstruct_outer(s,D)=reconstruct_inner(s, DelayEmbedding(D))

julia> x = rand(10000); D=2;
julia> @btime reconstruct_outer($x,$D);
  6.846 μs (5 allocations: 208 bytes)

julia> @btime reconstruct($x,$D);
  14.249 μs (305 allocations: 6.47 KiB)

julia> de = DelayEmbedding(D);
julia> @btime reconstruct_outer($x,$D);
  5.661 μs (5 allocations: 208 bytes)


julia> de = DelayEmbedding(D);

julia> @btime reconstruct_inner($x,$de);
  112.078 ns (0 allocations: 0 bytes)

julia> @btime reconstruct($x,$D);
  13.708 μs (305 allocations: 6.47 KiB)

julia> @btime reconstruct($x,$2);
  13.818 μs (305 allocations: 6.47 KiB)

julia> @btime reconstruct($x,2);
  835.253 ns (3 allocations: 160 bytes)

In other words: You problem is that you switch between value-domain and type-domain, which const-prop can sometimes eliminate, but sometimes not. Just construct your delay-embedding in some outer function, make a `@noinline boundary to an inner function where the delay embedding is fixed and its type is part of the signature, and be happy.

If you can get away with this, consider

julia> struct DelayEmbedding2{D} end
julia> de2=DelayEmbedding2{D}();

julia> @generated function (r::DelayEmbedding2{D})(s::AbstractArray{T}, i) where {D, T}
           gens = [:(s[i + $k]) for k=1:D]
           quote
               @inbounds return SVector{$D+1,T}(s[i], $(gens...))
           end
           end

Datseris · November 8, 2018, 4:20pm

Thanks for the detailed reply foobar. Independent of your reply I have also come to the same conclusion (that I have to split my top level function), although I didn’t have time to answer before!

Now it is almost as fast as the “fastest” version and this is truly great!

Topic		Replies	Views
Understanding allocs/constant propagation issue? General Usage	7	270	January 4, 2023
Detect if constant General Usage constant-propagation	49	1711	June 16, 2023
Julia v1.8 -> v1.9 makes this code 1000x slower Performance	6	576	June 6, 2023
Benchmarking with interpolated variable faster than hardcoded literal value Performance	1	557	September 18, 2019
Optimising function with const instead of struct? Performance	9	747	July 7, 2018

Inlining behavior: constant prop. fails, I need help in understanding "why"

Related topics