Performance penalty of `>(1)` vs `x -> x >1`?

Yeah I get it now, I was too focused on the >(1) code, you were talking about inlining and loop-invariant code motion in the sum specialization, and that looks feasible without needing to make 1 a constant or type parameter.

Bear in mind though, anonymous functions are not more inlineable than Fix2, the types of x -> x+1 and x -> x+2 really are different, and they are not structurally equivalent to eltype([x -> x + y for y in 1:2]). The latter is structurally equivalent to Fix2, and it’s just as unoptimized:

julia> @btime sum($(x -> x > 1), $mat)
  342.123 μs (0 allocations: 0 bytes)
158130

julia> @btime sum($([x -> x + y for y in 1:2][1]), $mat)
  571.358 μs (0 allocations: 0 bytes)
158130

julia> @btime sum($(>(1)), $mat)
  571.219 μs (0 allocations: 0 bytes)
158130
1 Like

The two items are the same type as each other when constructed via g1,g2 = [x -> x > y for y in 1:2] (typeof(g1) === var"#10#12"{Int64}, for example). But I don’t think that was your point…

What I did not notice earlier is that an isolated anonymous function really is different (typeof(x -> x > 1) === var"#15#16", for example, notice the lack of type parameter Int64). I had not anticipated that and, as your benchmark shows, g1 has the same performance problem as a Base.Fix2. Very interesting…

1 Like

That‘s because it is a closure over an integer, which is implicitly like a struct with an integer field. Fix2 is similar.

I wonder what happens if we build a custom struct and inline the function for calling it?

Yes, that’s why I was harping on about constants. x -> x > 1 does not have to capture anything, and when you provide a Float64 argument, it can do the promotion to 1.0 at compile-time on its own, no need for inlining or hoisting in sum.

Copied the code for Fix2 and added @inlines, seems like it’s not that simple.

julia> begin
       struct _Fix2{F,T} <: Function
           f::F
           x::T

           _Fix2(f::F, x) where {F} = new{F,Base._stable_typeof(x)}(f, x)
           _Fix2(f::Type{F}, x) where {F} = new{Type{F},Base._stable_typeof(x)}(f, x)
       end

       @inline (f::_Fix2)(y) = @inline f.f(y, f.x)
       end

julia> using BenchmarkTools; mat = randn(1000, 1000);

julia> @btime sum($(_Fix2(>, 1)), $mat)
  548.031 μs (0 allocations: 0 bytes)
158059

julia> @btime sum($(>(1)), $mat)
  535.217 μs (0 allocations: 0 bytes)
158059

Nothing of this happens with Int32s. Why is that comparing a float with a Int64 is much more complex than with a Int32? Is that expected or an issue?

See:

julia> @code_llvm 1.0 > 1
;  @ operators.jl:369 within `>`
define i8 @"julia_>_191"(double %0, i64 signext %1) #0 {
top:
; ┌ @ float.jl:576 within `<`
; │┌ @ float.jl:159 within `Float64`
    %2 = sitofp i64 %1 to double
; │└
; │ @ float.jl:577 within `<` @ float.jl:535
   %3 = fcmp olt double %2, %0
; │ @ float.jl:577 within `<`
; │┌ @ float.jl:533 within `==`
    %4 = fcmp oeq double %2, %0
    %5 = fcmp oeq double %2, 0x43E0000000000000
; │└
; │┌ @ float.jl:335 within `unsafe_trunc`
    %6 = fptosi double %2 to i64
    %7 = freeze i64 %6
; │└
; │ @ float.jl:577 within `<` @ int.jl:83
   %8 = icmp sgt i64 %7, %1
; │ @ float.jl:577 within `<`
; │┌ @ bool.jl:39 within `|`
    %9 = or i1 %5, %8
; │└
; │┌ @ bool.jl:38 within `&`
    %10 = and i1 %4, %9
; │└
; │┌ @ bool.jl:39 within `|`
    %11 = or i1 %3, %10
; └└
  %12 = zext i1 %11 to i8
  ret i8 %12
}

julia> @code_llvm 1.0 > Int32(1)
;  @ operators.jl:369 within `>`
define i8 @"julia_>_193"(double %0, i32 signext %1) #0 {
top:
; ┌ @ promotion.jl:450 within `<`
; │┌ @ promotion.jl:381 within `promote`
; ││┌ @ promotion.jl:358 within `_promote`
; │││┌ @ number.jl:7 within `convert`
; ││││┌ @ float.jl:159 within `Float64`
       %2 = sitofp i32 %1 to double
; │└└└└
; │ @ promotion.jl:450 within `<` @ float.jl:535
   %3 = fcmp olt double %2, %0
; └
  %4 = zext i1 %3 to i8
  ret i8 %4
}
1 Like

yeah I don’t know what %4-11 is doing, the remainder looks identical between the two calls (float promotion, float comparison).

With Int64, I get

julia> using BenchmarkTools

julia> mat = randn(1000, 1000);

julia> @btime sum(>(1), $mat)
  295.130 μs (0 allocations: 0 bytes)
158577

julia> @btime sum(x -> x > 1, $mat)
  276.217 μs (0 allocations: 0 bytes)
158577

julia> @btime count(>(1), $mat)
  265.832 μs (0 allocations: 0 bytes)
158577

julia> @btime count(x -> x > 1, $mat)
  267.103 μs (0 allocations: 0 bytes)
158577

My CPU has AVX512, which can efficiently convert Int64 to Float64 using SIMD instructions. CPUs with AVX2 but not AVX512 can only do this for Int32.

But I don’t think that should be related. The compiler should not have a problem hoisting a conversion out of a loop.
That’s what happened on my system anyway:

vector.ph:                                        ; preds = %L68.preheader
  %n.vec = and i64 %58, -32
  %59 = insertelement <8 x i64> <i64 poison, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0>, i64 %52, i64 0
  %broadcast.splatinsert = insertelement <8 x double> poison, double %36, i64 0
  %broadcast.splat = shufflevector <8 x double> %broadcast.splatinsert, <8 x double> poison, <8 x i32> zeroinitializer
  %broadcast.splatinsert34 = insertelement <8 x i1> poison, i1 %43, i64 0
  %broadcast.splat35 = shufflevector <8 x i1> %broadcast.splatinsert34, <8 x i1> poison, <8 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i64> [ %59, %vector.ph ], [ %89, %vector.body ]
  %vec.phi22 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %90, %vector.body ]
  %vec.phi23 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %91, %vector.body ]
  %vec.phi24 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %92, %vector.body ]
  %60 = add i64 %54, %index
  %61 = getelementptr inbounds double, double* %29, i64 %60
  %62 = bitcast double* %61 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %62, align 8
  %63 = getelementptr inbounds double, double* %61, i64 8
  %64 = bitcast double* %63 to <8 x double>*
  %wide.load25 = load <8 x double>, <8 x double>* %64, align 8
  %65 = getelementptr inbounds double, double* %61, i64 16
  %66 = bitcast double* %65 to <8 x double>*
  %wide.load26 = load <8 x double>, <8 x double>* %66, align 8
  %67 = getelementptr inbounds double, double* %61, i64 24
  %68 = bitcast double* %67 to <8 x double>*
  %wide.load27 = load <8 x double>, <8 x double>* %68, align 8
  %69 = fcmp ogt <8 x double> %wide.load, %broadcast.splat
  %70 = fcmp ogt <8 x double> %wide.load25, %broadcast.splat
  %71 = fcmp ogt <8 x double> %wide.load26, %broadcast.splat
  %72 = fcmp ogt <8 x double> %wide.load27, %broadcast.splat
  %73 = fcmp oeq <8 x double> %wide.load, %broadcast.splat
  %74 = fcmp oeq <8 x double> %wide.load25, %broadcast.splat
  %75 = fcmp oeq <8 x double> %wide.load26, %broadcast.splat
  %76 = fcmp oeq <8 x double> %wide.load27, %broadcast.splat
  %77 = and <8 x i1> %73, %broadcast.splat35
  %78 = and <8 x i1> %74, %broadcast.splat35
  %79 = and <8 x i1> %75, %broadcast.splat35
  %80 = and <8 x i1> %76, %broadcast.splat35
  %81 = or <8 x i1> %69, %77
  %82 = or <8 x i1> %70, %78
  %83 = or <8 x i1> %71, %79
  %84 = or <8 x i1> %72, %80
  %85 = zext <8 x i1> %81 to <8 x i64>
  %86 = zext <8 x i1> %82 to <8 x i64>
  %87 = zext <8 x i1> %83 to <8 x i64>
  %88 = zext <8 x i1> %84 to <8 x i64>
  %89 = add <8 x i64> %vec.phi, %85
  %90 = add <8 x i64> %vec.phi22, %86
  %91 = add <8 x i64> %vec.phi23, %87
  %92 = add <8 x i64> %vec.phi24, %88
  %index.next = add nuw i64 %index, 32
  %93 = icmp eq i64 %index.next, %n.vec
  br i1 %93, label %middle.block, label %vector.body

The x -> x>1 was still faster for me.

vector.ph:                                        ; preds = %L68.preheader
  %n.vec = and i64 %58, -32
  %59 = insertelement <8 x i64> <i64 poison, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0>, i64 %52, i64 0
  %broadcast.splatinsert = insertelement <8 x double> poison, double %36, i64 0
  %broadcast.splat = shufflevector <8 x double> %broadcast.splatinsert, <8 x double> poison, <8 x i32> zeroinitializer
  %broadcast.splatinsert34 = insertelement <8 x i1> poison, i1 %43, i64 0
  %broadcast.splat35 = shufflevector <8 x i1> %broadcast.splatinsert34, <8 x i1> poison, <8 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i64> [ %59, %vector.ph ], [ %89, %vector.body ]
  %vec.phi22 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %90, %vector.body ]
  %vec.phi23 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %91, %vector.body ]
  %vec.phi24 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %92, %vector.body ]
  %60 = add i64 %54, %index
  %61 = getelementptr inbounds double, double* %29, i64 %60
  %62 = bitcast double* %61 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %62, align 8
  %63 = getelementptr inbounds double, double* %61, i64 8
  %64 = bitcast double* %63 to <8 x double>*
  %wide.load25 = load <8 x double>, <8 x double>* %64, align 8
  %65 = getelementptr inbounds double, double* %61, i64 16
  %66 = bitcast double* %65 to <8 x double>*
  %wide.load26 = load <8 x double>, <8 x double>* %66, align 8
  %67 = getelementptr inbounds double, double* %61, i64 24
  %68 = bitcast double* %67 to <8 x double>*
  %wide.load27 = load <8 x double>, <8 x double>* %68, align 8
  %69 = fcmp ogt <8 x double> %wide.load, %broadcast.splat
  %70 = fcmp ogt <8 x double> %wide.load25, %broadcast.splat
  %71 = fcmp ogt <8 x double> %wide.load26, %broadcast.splat
  %72 = fcmp ogt <8 x double> %wide.load27, %broadcast.splat
  %73 = fcmp oeq <8 x double> %wide.load, %broadcast.splat
  %74 = fcmp oeq <8 x double> %wide.load25, %broadcast.splat
  %75 = fcmp oeq <8 x double> %wide.load26, %broadcast.splat
  %76 = fcmp oeq <8 x double> %wide.load27, %broadcast.splat
  %77 = and <8 x i1> %73, %broadcast.splat35
  %78 = and <8 x i1> %74, %broadcast.splat35
  %79 = and <8 x i1> %75, %broadcast.splat35
  %80 = and <8 x i1> %76, %broadcast.splat35
  %81 = or <8 x i1> %69, %77
  %82 = or <8 x i1> %70, %78
  %83 = or <8 x i1> %71, %79
  %84 = or <8 x i1> %72, %80
  %85 = zext <8 x i1> %81 to <8 x i64>
  %86 = zext <8 x i1> %82 to <8 x i64>
  %87 = zext <8 x i1> %83 to <8 x i64>
  %88 = zext <8 x i1> %84 to <8 x i64>
  %89 = add <8 x i64> %vec.phi, %85
  %90 = add <8 x i64> %vec.phi22, %86
  %91 = add <8 x i64> %vec.phi23, %87
  %92 = add <8 x i64> %vec.phi24, %88
  %index.next = add nuw i64 %index, 32
  %93 = icmp eq i64 %index.next, %n.vec
  br i1 %93, label %middle.block, label %vector.body

Seeing that the >(1) version has a lot of extra operations, maybe having a runtime value is forcing it to do a lot of extra checks, because of possible floating point weirdness.

julia> @btime sum(Base.Fix2(@fastmath(>),1), $mat)
  272.517 μs (0 allocations: 0 bytes)
158577

julia> @btime sum(>(1), $mat)
  294.637 μs (0 allocations: 0 bytes)
158577

julia> @btime sum(x -> x > 1, $mat)
  278.829 μs (0 allocations: 0 bytes)
158577

julia> @btime sum(x -> @fastmath(x > 1), $mat)
  274.547 μs (0 allocations: 0 bytes)
158577

@fastmath helps.

Do you see a reason for these weirdnesses do not occur when comparing with an Int32?

julia> mat = randn(1000, 1000);

julia> @btime sum(>(1), $mat)
  213.943 μs (0 allocations: 0 bytes)
158263

julia> @btime sum(x -> x > 1, $mat)
  175.831 μs (0 allocations: 0 bytes)
158263

julia> @btime sum(>(Int32(1)), $mat)
  177.507 μs (0 allocations: 0 bytes)
158263

julia> @btime sum(x -> x > Int32(1), $mat)
  177.426 μs (0 allocations: 0 bytes)
158263

(afaik this cpu also supports avx512)

Exactly. As far as I understand, integers can be converted exactly to float64, so I don’t see why all those checks would be necessary in either case.

Can you show me the LLVM? I used Cthulhu to get it.

I don´t think I know what I’m doing there, but at some inner call I got this:

;  @ int.jl:83 within `<`
define i8 @"julia_<_898"(i64 signext %0, i64 signext %1) #0 {
top:
  %2 = icmp slt i64 %0, %1
  %3 = zext i1 %2 to i8
  ret i8 %3
}

Turn optimizations on (o).

I’m sorry, I really don’t know what to do there. The Cthulhu docs only mention one “optimization” keyword, and seems to be on by default.

I see “type (o)”, you meant. Is this what you want?

 • %1 = < constprop > getproperty(::Base.Fix2{typeof(>), Int64},::Core.Const(:f))::Core.Const(>)
   f::Base.Fix2{typeof(>), Int64}.x
   f.f(y::Float64, f::Base.Fix2{typeof(>), Int64}.x::Int64)
   ↩

;  @ operators.jl:1125 within `Fix2`
define i8 @julia_Fix2_1095({ i64 }* nocapture noundef nonnull readonly align 8 dereferenceable(8) %0, double %1) #0 {
top:
  %2 = call {}*** @julia.get_pgcstack()
  %3 = bitcast {}*** %2 to {}**
  %current_task = getelementptr inbounds {}*, {}** %3, i64 -13
  %4 = bitcast {}** %current_task to i64*
  %world_age = getelementptr inbounds i64, i64* %4, i64 14
; ┌ @ Base.jl:37 within `getproperty`
   %5 = getelementptr inbounds { i64 }, { i64 }* %0, i32 0, i32 0
; └
; ┌ @ operators.jl:369 within `>`
; │┌ @ float.jl:576 within `<`
; ││┌ @ float.jl:159 within `Float64`
     %6 = load i64, i64* %5, align 8
     %7 = sitofp i64 %6 to double
; ││└
; ││ @ float.jl:577 within `<` @ float.jl:535
    %8 = fcmp olt double %7, %1
; ││ @ float.jl:577 within `<`
; ││┌ @ float.jl:533 within `==`
     %9 = fcmp oeq double %7, %1
     %10 = fcmp oeq double %7, 0x43E0000000000000
; ││└
; ││┌ @ float.jl:335 within `unsafe_trunc`
     %11 = fptosi double %7 to i64
     %12 = freeze i64 %11
; ││└
; ││ @ float.jl:577 within `<` @ int.jl:83
    %13 = load i64, i64* %5, align 8
    %14 = icmp slt i64 %13, %12
; ││ @ float.jl:577 within `<`
; ││┌ @ bool.jl:39 within `|`
     %15 = or i1 %10, %14
; ││└
; ││┌ @ bool.jl:38 within `&`
     %16 = and i1 %9, %15
; ││└
; ││┌ @ bool.jl:39 within `|`
     %17 = or i1 %8, %16
; └└└
  %18 = zext i1 %17 to i8
  ret i8 %18
}

My fairly naive guess here is that Int64 has more unique values than Float64, so there’s some truncation necessary while converting an Int64 to a Float64. The situation is different with Int32, as there are fewer bits and fewer unique values. Perhaps each Int32 value may be uniquely mapped to a Float64, so the promotion is easy

1 Like

Just learnt that the first Int64 that is not mapped exactly to a float64 is:

julia> 9007199254740993 == 9007199254740993.0
false

julia> 9007199254740992 == 9007199254740992.0
true

so yes, even if it is big, it will imply additional work in the comparison.

Effectively, the float64 conversion of the two numbers is the same:

julia> float(9007199254740993)
9.007199254740992e15

julia> float(9007199254740992)
9.007199254740992e15

(so effectively, comparing int64 and floats is more complicated).

Anyway, this means that one can use the trick of Int32(VAL) if the value can be represented by an Int32.

1 Like

Yes. Float64 has 53 bits of mantissa and can exactly represent all 32 bit integers but not all 64 bit integers.

2 Likes
# v1.8.0
# base/float.jl
for Ti in (Int64,UInt64,Int128,UInt128)
    for Tf in (Float32,Float64)
        @eval begin
            # stuff
            function <(x::$Tf, y::$Ti)
                fy = ($Tf)(y)
                (x < fy) | ((x == fy) & (fy < $(Tf(typemax(Ti)))) & (unsafe_trunc($Ti,fy) < y))
            end
            # stuff
        end
    end
end

I think this explains it. Comparisons between BitIntegers with 64+ bits and IEEEFloats with 32+ bits are more complicated than simple promotion. It appears that this is to ensure correctness when comparing integers larger than maxintfloat to floats. It looks like Float16 has not been considered here, although it probably should.

Promotion would yield an incorrect answer, as seen here:

julia> maxintfloat(Float64) == Int(maxintfloat(Float64))+1 |> Float64
true

julia> maxintfloat(Float64) == Int(maxintfloat(Float64))+1
false

Since the closure cannot know that the integer it captures is representable losslessly, it has to go through these extra checks. However, it seems permissible for LICM to hoist the (fy < $(Tf(typemax(Ti)))) & (unsafe_trunc($Ti,fy) < y) checks (which would evaluate to true & false for a value such as 1) and kill all the extra logic, but it does not. It would require a separate codepath (for integers beyond maxintfloat) is perhaps why.

3 Likes

To re-iterate, this applies to anonymous closures too:

julia> using BenchmarkTools
       mat = randn(100, 100)

       @btime sum(x->x>1, $mat)
       @btime sum(let a=1; x->x>a end, $mat);
  907.692 ns (0 allocations: 0 bytes)
  3.450 μs (0 allocations: 0 bytes)

Related?

julia> @btime sum(>(1), $mat)
       @btime mapreduce(>(1), +, $mat)
       @btime mapreduce(>(1), +, $mat; init=0)
       @btime mapfoldl(>(1), +, $mat);
  3.450 μs (0 allocations: 0 bytes)
  3.438 μs (0 allocations: 0 bytes)
  865.517 ns (0 allocations: 0 bytes)
  868.421 ns (0 allocations: 0 bytes)

Interpolating causes a slowdown since it defeats const prop:

julia> @btime mapfoldl(>(1), +, $mat)
       @btime mapfoldl($(>(1)), +, $mat);
  865.517 ns (0 allocations: 0 bytes)
  3.225 μs (0 allocations: 0 bytes)
1 Like