Wow, I am amazed. Does anyone know what improvement was responsible for that?
An example that demonstrates that small-union optimization is cool, but still expensive:
julia> @noinline function mymax(init, x)
s = init
for i in 1:length(x)
xi = x[i]
s = s<xi ? xi : s
end
return s
end
julia> r=rand(Int, 1000);
julia> @btime mymax(-Inf32, r)
9.255 μs (1 allocation: 16 bytes)
julia> @btime mymax(0x00, r);
12.723 μs (1 allocation: 16 bytes)
julia> @btime mymax(-Inf, r);
2.731 μs (1 allocation: 16 bytes)
julia> @btime mymax(typemin(Int), r)
910.500 ns (1 allocation: 16 bytes)
In theory, inference could figure out that the type of s
in the loop is a state machine with transitions typeof(init)->typeof(init)
, typeof(init)->eltype(x)
, eltype(x)->eltype(x)
and lacks the transition eltype(x)->typeof(init)
. Hence, the loop could be split into two loops, where the second is type-stable and the first has better type-stability (the comparison s<xi
has known types, but the result of the conditional expression is unstable and a branch on its type is needed to decide whether to jump into the second loop)
That optimization appears to be missing or unreliable.
PS: The optimized version (that indeed avoids most of the penalty) would be
julia> @noinline function mymax_loopsplit(init, x)
s = init
i=1
while i <= length(x) && s isa typeof(init)
xi = x[i]
s = s<xi ? xi : s
i += 1
end
while i <= length(x) && s isa eltype(x)
xi = x[i]
s = s<xi ? xi : s
i += 1
end
return s
end
It would be cool if the compiler could learn that (e.g.: during inference, record possible transitions of types, where initially everything is Union{}
. Then collapse strong components, such that we get a nice DAG; now we can improve type stability by duplicating code and adding the requisite control flow. I’m not knowledgeable enough about compilers to have an opinion on how hard this would be in julia, how expensive this would be, or how this optimization is called in the literature).