Anonymous function in count()

The slowdown of using something like count(z->(edges[1]<z<=edges[2]),x) is substantial, and can be avoided by a wrapping another function (see below).

Is this the same as 10x slowdown when passing function as argument?

using BenchmarkTools

x = randn(1_000_000)
edges = [0.0,1.0]

fn1(x,edges)  = count(z->(edges[1]<z<=edges[2]),x)        #slow
fn2(x)        = count(z->(0<z<=1),x)                      #quick
fn3(x,edges)  = fn3b(x,edges[1],edges[2])                 #quick
fn3b(x,lo,hi) = count(z->(lo<z<=hi),x)

@btime fn1($x,$edges)
@btime fn2($x)
@btime fn3($x,$edges)

gives (on my notebook)
5.007 ms (0 allocations: 0 bytes)
335.300 μs (0 allocations: 0 bytes)
336.300 μs (0 allocations: 0 bytes)

I think what you’re seeing is the cost of getting these values from the array edges at each iteration. Maybe since edges is mutable the compiler does not want to assume they remain the same. If you make it a tuple instead, it becomes fast:

julia> @btime fn1($x,$edges)
  4.360 ms (0 allocations: 0 bytes)
341122

julia> @btime fn1($x,$(Tuple(edges)))
  163.708 μs (0 allocations: 0 bytes)
341122

julia> @btime fn3($x,$edges)
  163.375 μs (0 allocations: 0 bytes)
341122
2 Likes

Not sure if it is really “at each iteration”. I looked at it with:

julia> @code_warntype fn1(x,edges)
Variables
  #self#::Core.Const(fn1)
  x::Vector{Float64}
  edges::Vector{Float64}
  #1::var"#1#2"{Vector{Float64}}

Body::Int64
1 ─ %1 = Main.:(var"#1#2")::Core.Const(var"#1#2")
│   %2 = Core.typeof(edges)::Core.Const(Vector{Float64})
│   %3 = Core.apply_type(%1, %2)::Core.Const(var"#1#2"{Vector{Float64}})
│        (#1 = %new(%3, edges))
│   %5 = #1::var"#1#2"{Vector{Float64}}
│   %6 = Main.count(%5, x)::Int64
└──      return %6

which doesn’t look like so, more like a single costly overhead, but perhaps I am interpreting wrong.

One way to check is to look at the scaling – the extra cost of the array compared to the tuple looks like it grows linearly with the number of iterations:

julia> x = rand(10^5); @belapsed(fn1($x, $edges)) - @belapsed(fn1($x, $(Tuple(edges))))
4.6791e-5

julia> x = rand(10^6); @belapsed(fn1($x, $edges)) - @belapsed(fn1($x, $(Tuple(edges))))
0.000467125

julia> x = rand(10^7); @belapsed(fn1($x, $edges)) - @belapsed(fn1($x, $(Tuple(edges))))
0.004648667
1 Like

It seems that Julia fails to prove that the load from edges is invariant, even count is a pure function. Adding @inbounds also doesn’t help. And LLVM fails to vectorize fn1 function. This is not a good phenomenon since LLVM should trivially do this.

Interestingly, if you check the llvm IR (I test on Julia 1.6.1), you will find that after adding @inbounds, Julia can hoist the load from edges[0], but not edges[1] to the outside of loop. So one additional load is still needed for edges[1].
So if you try this:

fn4(x,edges,lo) = count(z->(@inbounds lo<z<=edges[2]),x) # no vectorization, slow
fn5(x,edges,hi) = count(z->(@inbounds edges[1]<z<=hi),x) # vectorization, fast
@code_llvm fn4(x,edges,1.0) # no-vectorization, has additional load.
@code_llvm fn5(x,edges,0.0) # vectorization, has <4 x float>.
@btime fn4($x,$edges,0.0) # 5ms
@btime fn5($x,$edges,1.0) # 500μs

fn5 is vectorized but fn4 is not. What happens here?

Maybe that’s because the comparison lo < z <= hi has a branch, so the lookup only happens sometimes. If you remove that, it’s much faster:

julia> Meta.@lower 1<2<3
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = 1 < 2
└──      goto #3 if not %1
2 ─ %3 = 2 < 3
└──      return %3
3 ─      return false
))))

julia> fn6(x,edges)  = count(z->((edges[1]<z) & (z<=edges[2])),x);

julia> @btime fn6($(randn(10^6)), $edges);
  631.042 μs (0 allocations: 0 bytes)

julia> fn6i(x,edges)  = count(z->@inbounds((edges[1]<z) & (z<=edges[2])),x);

julia> @btime fn6i($(randn(10^6)), $edges);
  164.208 μs (0 allocations: 0 bytes)

(Edit – added @inbounds since this makes a big difference.)

It seems that without the branch the LLVM can vectorize the code again. If you check the code of fn6 (with inbounds), then it’s as fast as other one (and vectorized)…

So the reason is that, it’s unsafe to move the code to the head of loop, since the branch z<edges[2] will never be taken in some cases and edges[2] might be undefined. If we move the load out of the loop, then it will cause out of boundary error.

Consider this code:

function f(x::Vector{Int},y)
  z = 0
  for i in 1:100
    if y
      z += x[1]
    end
  end
  return z
end

It’s possible that someone invoke f with f(Int[],false), and if compiler move x[1] out of the loop, then it will read the element of an empty array. Replacing the branch by conditional forces compiler rules out this possibility (since both branches must be taken) and optimize this loop.

So, in conclusion, this has nothing to do with closure, it’s a compiler optimization problem. What’s counter-intuitive here is that x<y<z is not eager but short-circuit evaluation.

2 Likes

Yes. This short-circuit behaviour is documented here, although perhaps it should also be in the docstring for < or >.

Actually, after I read the document you linked carefully, it says that

However, the order of evaluations in a chained comparison is undefined

instead of “short-circuit”. So maybe in this case, compiler should still have freedom to optimize the code (theoretically?). Since we can evaluate edges[2] and edges[1] before comparision, this will permit compiler to optimize the load out of loop.