Inlining behavior: constant prop. fails, I need help in understanding "why"


#1

MWE:

using StaticArrays, BenchmarkTools
struct DelayEmbedding{D}
    delays::SVector{D, Int}
end

@inline DelayEmbedding(D::Int) = DelayEmbedding(Val{D}())
@inline function DelayEmbedding(::Val{D}) where {D}
    idxs = [k for k in 1:D]
    return DelayEmbedding{D}(SVector{D, Int}(idxs...))
end

@generated function (r::DelayEmbedding{D})(s::AbstractArray{T}, i) where {D, T}
    gens = [:(s[i + r.delays[$k]]) for k=1:D]
    quote
        @inbounds return SVector{$D+1,T}(s[i], $(gens...))
    end
end

This works and does not allocate:

x = rand(10000);
e = DelayEmbedding(1)
@btime $e($x, 4);
  # 2.052 ns (0 allocations: 0 bytes)

Buuut, when I create a minimal wrapper function that instantiates DelayEmbedding, I get allocations:

@inline function reconstruct(s::AbstractVector{T}, D::Int) where {T}
  de::DelayEmbedding{D} = DelayEmbedding(Val{D}())

  c = 0.0
  for i in 1:100
    data = de(s, i)
    c += data[1]
  end

  return c
end

I now benchmark this reconstruct function:

D = 1
@btime reconstruct($x, $D);
  # 8.348 μs (304 allocations: 6.39 KiB)

@btime reconstruct($x, 1);
  # 736.273 ns (2 allocations: 112 bytes)

It is “weird” that if D is a literal the performance difference is massive…


To my understanding, it is possible to write such code as above in Julia 1.0 because of constant propagation. On the other hand it is obvious that I have not understood how it works, since my code fails.

Is there any way to make my code be as fast as in the second case, even when giving in a variable instead of literal?


#2

Your code is not copy-pasteable (first two blocks: using StaticArrays, BenchmarkTools, third block: globals tau and D).


#3

Thanks, I’ve edited the code. I have also removed everything irrelevant, like the parameter τ. What is left is the least amount of code I need to replicate my problem.


#4

This is how I usually benchmark constant propagation:

julia> function f(x)
           D = 1
           reconstruct(x, D)
       end
f (generic function with 1 method)

julia> @btime f($x)
  871.847 ns (2 allocations: 112 bytes)
50.48725909006644

#5

The global variable is not a constant unless making it a constant,

const DD = 1;
@btime reconstruct($x, DD);

works for me.


#6

@kristoffer.carlsson so if I understood correctly, constant propagation does not work the way I thought. I thought it was about some values that are constant between functions and/or local scopes and are understood as “some kind of literals”. The value D in my case is e.g. coming from a for loop,

for D in [1, 2]
   @btime reconstruct(x, D)
end

Benchmarking the above I still don’t get the “good” performance, I only get the “bad” one.

So, to conclude, it is not possible to get the “good” performance if D is not a literal? In your above example D = 1 inside a function is equivalent with it being a literal.

@Non-Contradiction I don’t really know what you are talking about…

julia> const f = 2
2

julia> @btime reconstruct($x, $f);
  9.813 μs (305 allocations: 6.47 KiB)

gives the bad performance, not the good one.


#7

Please pay attention, that in my code, @btime reconstruct($x, DD); is used, not @btime reconstruct($x, $DD);, and @btime reconstruct($x, DD); gives good performance, because DD is a constant already, so no need to use $DD, just as there is no need of $ before function names. But I don’t know what happens with $ before DD, it gives bad performance for me too.


#8

Sorry, I overlooked this! Thanks for clarifying and illustrating! :slight_smile:


#9

Well, const-prop should not be relied upon anyway so the correct spelling of your construction could be

julia> @noinline function reconstruct_inner(s::AbstractVector{T}, de) where {T}
         c = 0.0
         for i in 1:100
           data = de(s, i)
           c += data[1]
         end

         return c
         end
julia> @noinline reconstruct_outer(s,D)=reconstruct_inner(s, DelayEmbedding(D))

julia> x = rand(10000); D=2;
julia> @btime reconstruct_outer($x,$D);
  6.846 μs (5 allocations: 208 bytes)

julia> @btime reconstruct($x,$D);
  14.249 μs (305 allocations: 6.47 KiB)

julia> de = DelayEmbedding(D);
julia> @btime reconstruct_outer($x,$D);
  5.661 μs (5 allocations: 208 bytes)


julia> de = DelayEmbedding(D);

julia> @btime reconstruct_inner($x,$de);
  112.078 ns (0 allocations: 0 bytes)

julia> @btime reconstruct($x,$D);
  13.708 μs (305 allocations: 6.47 KiB)

julia> @btime reconstruct($x,$2);
  13.818 μs (305 allocations: 6.47 KiB)

julia> @btime reconstruct($x,2);
  835.253 ns (3 allocations: 160 bytes)

In other words: You problem is that you switch between value-domain and type-domain, which const-prop can sometimes eliminate, but sometimes not. Just construct your delay-embedding in some outer function, make a `@noinline boundary to an inner function where the delay embedding is fixed and its type is part of the signature, and be happy.

If you can get away with this, consider

julia> struct DelayEmbedding2{D} end
julia> de2=DelayEmbedding2{D}();

julia> @generated function (r::DelayEmbedding2{D})(s::AbstractArray{T}, i) where {D, T}
           gens = [:(s[i + $k]) for k=1:D]
           quote
               @inbounds return SVector{$D+1,T}(s[i], $(gens...))
           end
           end

#10

Thanks for the detailed reply foobar. Independent of your reply I have also come to the same conclusion (that I have to split my top level function), although I didn’t have time to answer before!

Now it is almost as fast as the “fastest” version and this is truly great!