Yes, count is faster. Initially I saw the performance difference with it as well, but now I tried again and it does not seem to be consistent. But the difference in sum is large. The llvm codes are:
julia> @code_llvm sum(>(1), mat)
; @ reducedim.jl:995 within `sum`
define i64 @julia_sum_911({ i64 }* nocapture noundef nonnull readonly align 8 dereferenceable(8) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
; ┌ @ reducedim.jl:995 within `#sum#808`
; │┌ @ reducedim.jl:999 within `_sum`
; ││┌ @ reducedim.jl:999 within `#_sum#810`
; │││┌ @ reducedim.jl:357 within `mapreduce`
; ││││┌ @ reducedim.jl:357 within `#mapreduce#800`
; │││││┌ @ reducedim.jl:365 within `_mapreduce_dim`
%2 = call i64 @j__mapreduce_913({ i64 }* nocapture nonnull readonly %0, {}* nonnull %1) #0
; └└└└└└
ret i64 %2
}
julia> @code_llvm sum(x -> x > 1, mat)
; @ reducedim.jl:995 within `sum`
define i64 @julia_sum_925({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
; ┌ @ reducedim.jl:995 within `#sum#808`
; │┌ @ reducedim.jl:999 within `_sum`
; ││┌ @ reducedim.jl:999 within `#_sum#810`
; │││┌ @ reducedim.jl:357 within `mapreduce`
; ││││┌ @ reducedim.jl:357 within `#mapreduce#800`
; │││││┌ @ reducedim.jl:365 within `_mapreduce_dim`
%1 = call i64 @j__mapreduce_927({}* nonnull %0) #0
; └└└└└└
ret i64 %1
}
Here’s a guess: the function form can be inlined so that 1 can be constant propagated and converted to 1.0 by the compiler in order to compare it to the float array. The >(1) form goes via an intermediate struct, and 1 is not inlined and so the conversion to float occurs every time.
Does the difference disappear if you specify aggressive constprop?
julia> function f(y, x::Val{T}) where T
y > T
end
f (generic function with 1 method)
julia> f(x) = Base.Fix2(f, Val(x))
f (generic function with 2 methods)
julia> mat = rand(1000, 1000);
julia> @btime sum(>(1), $mat)
530.568 μs (0 allocations: 0 bytes)
0
julia> @btime sum(x-> x > 1, $mat)
375.517 μs (0 allocations: 0 bytes)
0
julia> @btime sum(f(1), $mat)
382.111 μs (0 allocations: 0 bytes)
0
Any approach based on Val seems like an awful kludge. It addresses this particular case, but (in my opinion) creates a bigger mess than it solves. We should not plan to write (and maintain!) Val versions of the hundreds of functions we can imagine someone wanting to curry, nor suffer a performance trap when someone is more creative than us and curries something we didn’t Val.
Further, [Base.Fix1(+,Val(x)) for x in 1:3] is a heterogeneous container that will yield dynamic dispatch to multiple independently-compiled functions. The whole point of Base.Fix1/Base.Fix2 is to make reproducible closures so we need fewer redundant anonymous functions (that the compiler must compile and maintain).
One should prefer [Base.Fix1(+,x) for x in 1:3] (the promotion problem showcased in this thread needs to be fixed, but at least this container is homogeneous) or [y->x+y for x in 1:3] (which is homogeneous and appears to be resistant to the promotion issue).
Is there a reason you’re suggesting Fix1 instead of the Fix2 OP’s code makes? I don’t know of a difference between them besides argument order.
Also, I don’t think there’s a way to avoid doing the promotion every call while keeping [>(y) for y in 1:3] homogeneous. For each instance to share a type and its underlying method, the fixed value can’t be a constant, so the promotion has to wait for a runtime value.
From the perspective of OP’s problem, this looks like a negative, but it’s really a tradeoff. x -> x > 1 cannot share a type with x -> x > 2, so if you make and call many of these, you have to compile basically the same thing over and over. To compile only once, you could do something like [x -> x > y for y in 1:3], but those need runtime promotions just like >(1). I think we just have to accept that the 1 in >(1) is an argument, not a constant in a function body.
This case is not that bad, though, you can do the promotion in advance even if you don’t make 1 a constant >(convert(eltype(mat), 1)).
The difference is not entirely unexpected to me. What I find curious is that apparently >(1) is also not constant propagated when the whole sum is wrapped into another function.
The only difference between Base.Fix1 and Base.Fix2 is the argument position the captured value is placed in. Pay no heed to my choice of one or another.
In fact, it can:
julia> [x -> x > 1, x -> x > 2, x -> x > 3] # different functions with different types
3-element Vector{Function}:
#13 (generic function with 1 method)
#14 (generic function with 1 method)
#15 (generic function with 1 method)
julia> [x -> x > y for y in 1:3] # same function with same type
3-element Vector{var"#20#22"{Int64}}:
#20 (generic function with 1 method)
#20 (generic function with 1 method)
#20 (generic function with 1 method)
julia> [x -> x > y for y in 1:3] |> eltype |> isconcretetype
true
julia> [x -> x > y for y in Any[1,2.0,3//1]] # same function with different type
3-element Vector{var"#24#26"}:
#24 (generic function with 1 method)
#24 (generic function with 1 method)
#24 (generic function with 1 method)
julia> [x -> x > y for y in Any[1,2.0,3//1]] |> eltype |> isconcretetype
false
Also note that the closure type var"#20#22"{Int64} has captured an Int64, not a Float64. But when compiled for a Float64 argument, it manages to constant-prop the Int64 so that it can do the promotion at compile time. How it does this is a bit beyond me. It does seem that the best it could do is to hoist the promotion, since the “same function with the same type” does indeed need multiple versions for different captured values. My only explanation would be that maybe anonymous functions inline better so the hoisting can happen.
As one of the other posters points out, the issue is that the same hoisting does not occur for Base.Fix1/Base.Fix2 closures.
Sorry. I realized partway through that I was on the wrong track and didn’t manage to fully adjust what I was writing. What I mean is that it should be possible to do the conversion and load once at the top of a loop and then keep using it throughout the loop. That should avoid any recurring cost of the mismatch. It’s not clear that the Base.Fix2 version does that. I’d have to dig into the @code_llvm or maybe @code_native to be sure.
Not sure, the @code_llvm does suggest that the Fix2 instance has a promotion that the anonymous function does not, and @btime reflects a significant performance difference. I’m guessing that your benchmark is additionally being affected by sum.
That doesn’t happen necessarily even if you’re looping over the same number. 2.5 .|> [x -> x > y for y in (1,1,1)] and 2.5 .|> [>(y) for y in (1,1,1)] both do runtime promotions. If the 1 isn’t a type parameter or a constant in a method body, it seems like the type and code must accommodate other values, even if you always provide the same value at runtime.
There’s no way the system can know that a bunch of not-necessarily-identical Base.Fix2{typeof(>),Int} have the same value of Int in the closure. My point is that if I call sum(>(1),::Vector{Float64}), it should be able to access the single Base.Fix2’s captured Int once, convert it to Float64 like it knows it needs, and then loop over comparisons to that Float64 (as opposed to promoting the Int to Float64 before every comparison).