Performance penalty of `>(1)` vs `x -> x >1`?

Why is there this performance penalty in using >(1) vs x -> x > 1 in these examples? (happens in 1.8.5 and 1.9-rc3):

julia> using BenchmarkTools

julia> mat = randn(1000, 1000);

julia> @btime sum(>(1), $mat)
  557.740 μs (0 allocations: 0 bytes)
158600

julia> @btime sum(x -> x > 1, $mat)
  303.875 μs (0 allocations: 0 bytes)
158600

julia> @btime count(>(1), $mat)
  289.280 μs (0 allocations: 0 bytes)
158600

julia> @btime count(x -> x > 1, $mat)
  247.086 μs (0 allocations: 0 bytes)
158600
2 Likes

I can’t answer the question but if you use count instead of sum both versions are fast, maybe even a touch faster (on 1.9rc3).

Yes, count is faster. Initially I saw the performance difference with it as well, but now I tried again and it does not seem to be consistent. But the difference in sum is large. The llvm codes are:

julia> @code_llvm sum(>(1), mat)
;  @ reducedim.jl:995 within `sum`
define i64 @julia_sum_911({ i64 }* nocapture noundef nonnull readonly align 8 dereferenceable(8) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
; ┌ @ reducedim.jl:995 within `#sum#808`
; │┌ @ reducedim.jl:999 within `_sum`
; ││┌ @ reducedim.jl:999 within `#_sum#810`
; │││┌ @ reducedim.jl:357 within `mapreduce`
; ││││┌ @ reducedim.jl:357 within `#mapreduce#800`
; │││││┌ @ reducedim.jl:365 within `_mapreduce_dim`
        %2 = call i64 @j__mapreduce_913({ i64 }* nocapture nonnull readonly %0, {}* nonnull %1) #0
; └└└└└└
  ret i64 %2
}

julia> @code_llvm sum(x -> x > 1, mat)
;  @ reducedim.jl:995 within `sum`
define i64 @julia_sum_925({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
; ┌ @ reducedim.jl:995 within `#sum#808`
; │┌ @ reducedim.jl:999 within `_sum`
; ││┌ @ reducedim.jl:999 within `#_sum#810`
; │││┌ @ reducedim.jl:357 within `mapreduce`
; ││││┌ @ reducedim.jl:357 within `#mapreduce#800`
; │││││┌ @ reducedim.jl:365 within `_mapreduce_dim`
        %1 = call i64 @j__mapreduce_927({}* nonnull %0) #0
; └└└└└└
  ret i64 %1
}

I get an even more striking difference between the first two

julia> @btime sum(>(1), $mat)
  250.915 μs (0 allocations: 0 bytes)
159024

julia> @btime sum(x -> x > 1, $mat)
  76.234 μs (0 allocations: 0 bytes)
159024

and the difference persists if I place the expressions inside one additional layer of functions

julia> f1(mat) = sum(>(1), mat)
f1 (generic function with 1 method)

julia> f2(mat) = sum(x -> x > 1, mat)
f2 (generic function with 1 method)

julia> @btime f1($mat)
  238.681 μs (0 allocations: 0 bytes)
159024

julia> @btime f2($mat)
  77.437 μs (0 allocations: 0 bytes)
159024

The problem appears to be related to the Int in the >(1), because with a float I get the first one to be as fast as the second

julia> @btime sum(>(1.0), $mat)
  77.888 μs (0 allocations: 0 bytes)
159024
6 Likes

Ok. Enough to open an issue: performance penalty with >(Int(1)) in sum(>(1), array) · Issue #49696 · JuliaLang/julia · GitHub

3 Likes

Here’s a guess: the function form can be inlined so that 1 can be constant propagated and converted to 1.0 by the compiler in order to compare it to the float array. The >(1) form goes via an intermediate struct, and 1 is not inlined and so the conversion to float occurs every time.

Does the difference disappear if you specify aggressive constprop?

5 Likes

I’ve still never seen a case in my life where @costprop :aggressive had any effect. I’m convinced it was just put there to make us feel better.

julia> Base.@constprop :aggressive f1(mat) = sum(>(1), mat)
f1 (generic function with 1 method)

julia> @btime f1($mat)
  248.080 μs (0 allocations: 0 bytes)
158868

julia> @btime f2($mat)
  74.730 μs (0 allocations: 0 bytes)
158868
3 Likes
julia> function f(y, x::Val{T}) where T
           y > T
       end
f (generic function with 1 method)

julia> f(x) = Base.Fix2(f, Val(x))
f (generic function with 2 methods)

julia> mat = rand(1000, 1000);

julia> @btime sum(>(1), $mat)
  530.568 μs (0 allocations: 0 bytes)
0

julia> @btime sum(x-> x > 1, $mat)
  375.517 μs (0 allocations: 0 bytes)
0

julia> @btime sum(f(1), $mat)
  382.111 μs (0 allocations: 0 bytes)
0

2 Likes
julia> f(x) = Base.Fix2(f, Val(x))

julia> @btime sum(>(1), $mat)
  799.135 μs (0 allocations: 0 bytes)
julia> @btime sum(x-> x > 1, $mat)
  500.949 μs (0 allocations: 0 bytes)
julia> @btime sum(f(1), $mat)
  529.702 μs (0 allocations: 0 bytes)

julia> Base.isless(::Val{N},v::T) where {N,T} = isless(N,v)

julia> @btime sum(>(Val(1)), $mat);
  553.309 μs (0 allocations: 0 bytes)

Maybe,

Base.isless(::Val{N},v::T) where {N,T} = isless(N,v)

is a good definition.

1 Like

Any approach based on Val seems like an awful kludge. It addresses this particular case, but (in my opinion) creates a bigger mess than it solves. We should not plan to write (and maintain!) Val versions of the hundreds of functions we can imagine someone wanting to curry, nor suffer a performance trap when someone is more creative than us and curries something we didn’t Val.

Further, [Base.Fix1(+,Val(x)) for x in 1:3] is a heterogeneous container that will yield dynamic dispatch to multiple independently-compiled functions. The whole point of Base.Fix1/Base.Fix2 is to make reproducible closures so we need fewer redundant anonymous functions (that the compiler must compile and maintain).

One should prefer [Base.Fix1(+,x) for x in 1:3] (the promotion problem showcased in this thread needs to be fixed, but at least this container is homogeneous) or [y->x+y for x in 1:3] (which is homogeneous and appears to be resistant to the promotion issue).

2 Likes

Is there a reason you’re suggesting Fix1 instead of the Fix2 OP’s code makes? I don’t know of a difference between them besides argument order.

Also, I don’t think there’s a way to avoid doing the promotion every call while keeping [>(y) for y in 1:3] homogeneous. For each instance to share a type and its underlying method, the fixed value can’t be a constant, so the promotion has to wait for a runtime value.

From the perspective of OP’s problem, this looks like a negative, but it’s really a tradeoff. x -> x > 1 cannot share a type with x -> x > 2, so if you make and call many of these, you have to compile basically the same thing over and over. To compile only once, you could do something like [x -> x > y for y in 1:3], but those need runtime promotions just like >(1). I think we just have to accept that the 1 in >(1) is an argument, not a constant in a function body.

This case is not that bad, though, you can do the promotion in advance even if you don’t make 1 a constant >(convert(eltype(mat), 1)).

1 Like

The difference is not entirely unexpected to me. What I find curious is that apparently >(1) is also not constant propagated when the whole sum is wrapped into another function.

The only difference between Base.Fix1 and Base.Fix2 is the argument position the captured value is placed in. Pay no heed to my choice of one or another.

In fact, it can:

julia> [x -> x > 1, x -> x > 2, x -> x > 3] # different functions with different types
3-element Vector{Function}:
 #13 (generic function with 1 method)
 #14 (generic function with 1 method)
 #15 (generic function with 1 method)

julia> [x -> x > y for y in 1:3] # same function with same type
3-element Vector{var"#20#22"{Int64}}:
 #20 (generic function with 1 method)
 #20 (generic function with 1 method)
 #20 (generic function with 1 method)

julia> [x -> x > y for y in 1:3] |> eltype |> isconcretetype
true

julia> [x -> x > y for y in Any[1,2.0,3//1]] # same function with different type
3-element Vector{var"#24#26"}:
 #24 (generic function with 1 method)
 #24 (generic function with 1 method)
 #24 (generic function with 1 method)

julia> [x -> x > y for y in Any[1,2.0,3//1]] |> eltype |> isconcretetype
false

Also note that the closure type var"#20#22"{Int64} has captured an Int64, not a Float64. But when compiled for a Float64 argument, it manages to constant-prop the Int64 so that it can do the promotion at compile time. How it does this is a bit beyond me. It does seem that the best it could do is to hoist the promotion, since the “same function with the same type” does indeed need multiple versions for different captured values. My only explanation would be that maybe anonymous functions inline better so the hoisting can happen.

As one of the other posters points out, the issue is that the same hoisting does not occur for Base.Fix1/Base.Fix2 closures.

All that said, is there any rational reason why all these arguments are not valid if we use an Int32?

julia> mat = randn(1000*1000);

julia> @btime sum(>(1), $mat)
  453.168 μs (0 allocations: 0 bytes)
158368

julia> @btime sum(>(Int32(1)), $mat)
  273.104 μs (0 allocations: 0 bytes)
158368

julia> @btime sum(x -> x > Int32(1), $mat)
  267.993 μs (0 allocations: 0 bytes)
158368

I see a promotion in the @code_llvm, could someone check my observation?

julia> @code_llvm Float64(1)
;  @ float.jl:146 within `Float64`
define double @julia_Float64_703(i64 signext %0) #0 {
top:
  %1 = sitofp i64 %0 to double
  ret double %1
}

julia> @code_llvm [x -> x > y for y in 1:3][1](2.5)
;  @ REPL[17]:1 within `#6`
define i8 @"julia_#6_677"([1 x i64]* nocapture nonnull readonly align 8 dereferenceable(8) %0, double %1) #0 {
top:
  %2 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
; ┌ @ operators.jl:382 within `>`
; │┌ @ float.jl:453 within `<`
; ││┌ @ float.jl:146 within `Float64`
     %3 = load i64, i64* %2, align 8
     %4 = sitofp i64 %3 to double
...

Sorry. I realized partway through that I was on the wrong track and didn’t manage to fully adjust what I was writing. What I mean is that it should be possible to do the conversion and load once at the top of a loop and then keep using it throughout the loop. That should avoid any recurring cost of the mismatch. It’s not clear that the Base.Fix2 version does that. I’d have to dig into the @code_llvm or maybe @code_native to be sure.

Not sure, the @code_llvm does suggest that the Fix2 instance has a promotion that the anonymous function does not, and @btime reflects a significant performance difference. I’m guessing that your benchmark is additionally being affected by sum.

julia> @btime $(x -> x > Int32(1))(2.5)
  2.181 ns (0 allocations: 0 bytes)
true

julia> @btime $(>(Int32(1)))(2.5)
  3.245 ns (0 allocations: 0 bytes)
true

julia> @code_llvm (x -> x > Int32(1))(2.5)
;  @ REPL[39]:1 within `#17`
define i8 @"julia_#17_1151"(double %0) #0 {
top:
; ┌ @ operators.jl:382 within `>`
; │┌ @ promotion.jl:428 within `<` @ float.jl:412
    %1 = fcmp ogt double %0, 1.000000e+00
; └└
  %2 = zext i1 %1 to i8
  ret i8 %2
}

julia> @code_llvm >(Int32(1))(2.5)
;  @ operators.jl:1113 within `Fix2`
define i8 @julia_Fix2_1149({ i32 }* nocapture nonnull readonly align 4 dereferenceable(4) %0, double %1) #0 {
top:
; ┌ @ Base.jl:38 within `getproperty`
   %2 = getelementptr inbounds { i32 }, { i32 }* %0, i64 0, i32 0
; └
; ┌ @ operators.jl:382 within `>`
; │┌ @ promotion.jl:428 within `<`
; ││┌ @ promotion.jl:359 within `promote`
; │││┌ @ promotion.jl:336 within `_promote`
; ││││┌ @ number.jl:7 within `convert`
; │││││┌ @ float.jl:146 within `Float64`
        %3 = load i32, i32* %2, align 4
        %4 = sitofp i32 %3 to double
; ││└└└└
; ││ @ promotion.jl:428 within `<` @ float.jl:412
    %5 = fcmp olt double %4, %1
; └└
  %6 = zext i1 %5 to i8
  ret i8 %6
}

That doesn’t happen necessarily even if you’re looping over the same number. 2.5 .|> [x -> x > y for y in (1,1,1)] and 2.5 .|> [>(y) for y in (1,1,1)] both do runtime promotions. If the 1 isn’t a type parameter or a constant in a method body, it seems like the type and code must accommodate other values, even if you always provide the same value at runtime.

1 Like

There’s no way the system can know that a bunch of not-necessarily-identical Base.Fix2{typeof(>),Int} have the same value of Int in the closure. My point is that if I call sum(>(1),::Vector{Float64}), it should be able to access the single Base.Fix2’s captured Int once, convert it to Float64 like it knows it needs, and then loop over comparisons to that Float64 (as opposed to promoting the Int to Float64 before every comparison).

1 Like

Whatever is causing this slowdown, I don’t see it happen with a more naive summation:

# Julia Version 1.10.0-DEV.579
# Commit 76557783da (2023-02-11 02:16 UTC)

julia> x = randn(1000);

julia> function myintsum(f,x)
               s = 0
               for xx in x
                       s += f(xx)
               end
               return s
       end
myintsum (generic function with 1 method)

julia> @btime sum(x->x>1,$x)
  67.042 ns (0 allocations: 0 bytes)
173

julia> @btime sum(>(1),$x)
  308.502 ns (0 allocations: 0 bytes)
173

julia> @btime myintsum(x->x>1,$x)
  63.878 ns (0 allocations: 0 bytes)
173

julia> @btime myintsum(>(1),$x)
  63.776 ns (0 allocations: 0 bytes)
173
1 Like