Performance of generator expressions and higher order functions

performance
memory-allocation

#1

Hello,

while trying to optimize some Julia code I wrote, I encountered several performance issues all following the same kind of pattern, related to using higher-order functions and generator expressions nested within an outer loop.

Below is a somewhat stupid example, but I think it illustrates my point well enough is a self-contained way. These are three implementations of the same algorithm, in which two loops are involved. All timings below were obtained using BenchmarkTools and julia -O3.

First variant: both loops are written with for loops.

function v1(n :: Int)
    prime = Array{Bool}(undef, n)
    fill!(prime, true)
    for i in 2:n
        for j in 2:i-1
            if i%j==0
                prime[i] = false
                break
            end
        end
    end
end
julia> @btime v1(1000)
  307.409 μs (1 allocation: 1.06 KiB)

Second variant: the inner loop has been replaced by a call to a higher-order function (any in this case) and a generator expression. Notice how the run time and number of allocations increase.

function v2(n :: Int)
    prime = Array{Bool}(undef, n)
    fill!(prime, true)
    for i in 2:n
        if any(i%j==0 for j in 2:i-1)
                prime[i] = false
        end
    end
end
julia> @btime v2(1000)
  4.536 ms (22201 allocations: 379.16 KiB)

I can imagine why this would be the case, since maybe temporary objects are needed to instantiate the generators, and the optimizer does not see how to avoid them. What puzzles me most is this third variant, in which both loops are using higher-order functions. Replacing the outer loop by a call to foreach gives the performance back!

function v3(n :: Int)
    prime = Array{Bool}(undef, n)
    fill!(prime, true)
    foreach(2:n) do i
        if any(i%j==0 for j in 2:i-1)
                prime[i] = false
        end
    end
end
julia> @btime v3(1000)
  313.988 μs (1 allocation: 1.06 KiB)

Does anybody know what happens in this case? Or am I doing anything wrong? Would there be a way to improve version 2 in such a way that it can use higher-order functions to replace the inner loop, but still not kill the performance when used in a “regular” (i.e. for or while) outer loop?

Thanks in advance


#2

You are unfortunately hitting one of the compiler performance “bugs” regarding variables that are closed over (https://github.com/JuliaLang/julia/issues/15276). It can be seen using @code_warntype that we have the dreaded Core.Box as described in that issue. We can fix it by helping julia a bit:

function v2(n :: Int)
    prime = Array{Bool}(undef, n)
    fill!(prime, true)
    for i in 2:n
        let i = i
        if any(i%j==0 for j in 2:i-1)
            prime[i] = false
        end
        end
    end
end
julia> @btime v2(1000)
  327.984 μs (1 allocation: 1.06 KiB)

#3

Thanks!

So this also explains why v3 works better: the do block creates a function taking i as argument, which has the same effect as introducing the let i=i block. Did I understand things correctly?

BTW, for future reference, following the bug you mentioned led me to the Performance of captured variable paragraph of the Julia manual, in which this problem is explained in more details.

Thanks again!


#4

I don’t get it. I used to believe that the following two versions are lowered identically (I just made all scopes and loops explicit):

julia> function v5a(n :: Int)
           prime = Array{Bool}(undef, n)
           fill!(prime, true)
           rr=2:n
           it = iterate(rr)
           @label loopstart
           (it === nothing) && @goto loopend
           let (i,s) = it
           if any(i%j==0 for j in 2:i-1)
               prime[i]=false
           end
           it = iterate(rr,s)
           @goto loopstart
           end
           @label loopend
           return prime
       end

julia> function v5b(n :: Int)
           prime = Array{Bool}(undef, n)
           fill!(prime, true)
           rr=2:n
           it = iterate(rr)
           @label loopstart
           (it === nothing) && @goto loopend
           let 
           (i,s) = it
           if any(i%j==0 for j in 2:i-1)
               prime[i]=false
           end
           it = iterate(rr,s)
           @goto loopstart
           end
           @label loopend
           return prime
       end

However,

julia> N=1000; v5a(N);v5b(N);@time v5a(N);@time v5b(N);
  0.000459 seconds (5 allocations: 1.219 KiB)
  0.007087 seconds (22.20 k allocations: 379.313 KiB)

Imo both variants should be identical after lowering (before inference)?

Or is there any difference in semantics between let a = b \n body end and let \n a = b \n body end, under the assumption that the identifier a is not present in the outer function scope (which is know during lowering!)?

If there is no semantic difference, then there is quite a bit of pre-inference optimization potential.


#5

Yeah.

julia> Meta.lower(Main, quote
           let x = x
               nothing
           end
       end)
:($(Expr(:thunk, CodeInfo(

2 1 ─ %1 = x                                                                                                                                                                                                                              │
  │        x = %1                                                                                                                                                                                                                         │
3 └──      return nothing                                                                                                                                                                                                                 │
))))

vs.

julia> Meta.lower(Main, quote
           let
               x = x
               nothing
           end
       end)
:($(Expr(:thunk, CodeInfo(
3 1 ─     x = x                                                                                                                                                                                                                           │
4 └──     return nothing                                                                                                                                                                                                                  │
))))

#6

This is indeed a very good point. From my understanding of the documentation of let, there should be no difference between both forms: if the relevant variable does not exist in the outer function scope, I don’t see what could be done except create a new location.

The only thing which would explain the difference in performance is if the compiler is not aware (yet) of the non-existence of the variable in the outer scope, and can thus not know whether the assignment will modify an existing location or create a new one… At least that is how I understand things.


That being said, this behavior might explain the difference between versions 2 and 3 above, but only if a for loop introduces a let block around its body. And I’m not sure this is what gets done. On the one hand, the documentation for the Iteration interface states that a for loop is translated into:

next = iterate(iter)
while next !== nothing
    (i, state) = next
    # body
    next = iterate(iter, state)
end

without mentioning any let block. On the other hand, the Scope of variables paragraph mentions that

for loops, while loops, and Comprehensions have the following behavior: any new variables introduced in their body scopes are freshly allocated for each loop iteration, as if the loop body were surrounded by a let block.

and the example just below that effectively shows that closures taken from the body loop capture different variable locations. So that now I’m confused as to why a let block surrounding the whole iteration body changes anything…

So I guess my question becomes: is there an implicit let block around a for loop body, and if so, what form does it take?


#7

But by assigning x = x, your example supposes that there exists an outer binding for the variable x (in the right-hand side). Whereas @foobar_lv2’s example supposed that there exist no outer binding, which is exactly why both forms of the let block should behave in the same way.

And indeed, there isn’t much difference between

julia> Meta.lower(Main, quote
    let x = b
        nothing
    end
end)

:($(Expr(:thunk, CodeInfo(
│2 1 ─     x = b
│3 └──     return nothing
))))

and

julia> Meta.lower(Main, quote
    let
        x = b
        nothing
    end
end)

:($(Expr(:thunk, CodeInfo(
│3 1 ─     x = b
│4 └──     return nothing
))))

(but I’m not sure exactly what the numbers on the left mean, or if they are relevant here…)

BTW, thanks for Meta.lower: I only knew of @code_lowered


#8

As @ffevotte already remarked, your example lowers mostly identically:

julia> Meta.lower(Main, quote
           let x = b
        nothing
           end
       end)
:($(Expr(:thunk, CodeInfo(
2 1 ─     x = b                                                                          │
3 └──     return nothing                                                                 │
))))

julia> Meta.lower(Main, quote
           let 
           x = b
        nothing
           end
       end)
:($(Expr(:thunk, CodeInfo(
3 1 ─     x = b                                                                          │
4 └──     return nothing                                                                 │
))))

So the question remains: Are there semantic differences? Or is it possible to do a pure AST optimization of an entire function that removes some closure-boxes by moving all scoped variables into explicit let-clauses?

If the latter, then shouldn’t this be done during lowering (ie as a very early compiler pass), instead of by macro or by hand?