Performance problem of count?


Writing code to count the number of inversions of a permutation, I was surprised that it allocates. Thus I wrote what I thought equivalent code with a for loop, and it does not allocate. Here is the code; why is count so slow?

function inversions(p::Vector{Int})
  for j in 2:length(p) 
    for i in 1:j-1 

function inversions1(p::Vector{Int})
  count(p[i]>p[j] for j in 2:length(p) for i in 1:j-1) 


and here is the timing:

@btime inversions1($p)
  514.890 ns (111 allocations: 3.44 KiB)

@btime inversions($p)
  39.365 ns (0 allocations: 0 bytes)


What is the best way to look at the source code for the count method in Julia?


@less count(...)


For me is was @less count("")


I’m not an expert on when or why an allocation happens, but it seems to me that in inversions1 you’re creating a generator expression depending on p (which may need allocation) and pass that to count, whereas in inversions all you need is the known length of the vector and two loops with the condition. A quick look at @code_lowered seems to confirm that:

julia> @code_lowered inversions1(p)
2 1 ─ %1  = Main.:(##4#5)
  │   %2  = (Core.typeof)(p)
  │   %3  = (Core.apply_type)(%1, %2)
  │         #4 = %new(%3, p)
  │   %5  = #4
  │   %6  = (Main.length)(p)
  │   %7  = 2:%6
  │   %8  = (Base.Generator)(%5, %7)
  │   %9  = (Base.Flatten)(%8)
  │   %10 = (Main.count)(%9)
  └──       return %10

julia> @code_lowered inversions(p)
2 1 ─       l = 0
3 │   %2  = (Main.length)(p)
  │   %3  = 2:%2
  │         #temp#@_4 = (Base.iterate)(%3)
  │   %5  = #temp#@_4 === nothing
  │   %6  = (Base.not_int)(%5)
  └──       goto #7 if not %6
  2 ┄ %8  = #temp#@_4
  │         j = (Core.getfield)(%8, 1)
  │   %10 = (Core.getfield)(%8, 2)
4 │   %11 = j - 1
  │   %12 = 1:%11
  │         #temp#@_6 = (Base.iterate)(%12)
  │   %14 = #temp#@_6 === nothing
  │   %15 = (Base.not_int)(%14)
  └──       goto #5 if not %15
  3 ┄ %17 = #temp#@_6
  │         i = (Core.getfield)(%17, 1)
  │   %19 = (Core.getfield)(%17, 2)
5 │   %20 = l
  │   %21 = (Base.getindex)(p, i)
  │   %22 = (Base.getindex)(p, j)
  │   %23 = %21 > %22
  │         l = %20 + %23
  │         #temp#@_6 = (Base.iterate)(%12, %19)
  │   %26 = #temp#@_6 === nothing
  │   %27 = (Base.not_int)(%26)
  └──       goto #5 if not %27
  4 ─       goto #3
  5 ─       #temp#@_4 = (Base.iterate)(%3, %10)
  │   %31 = #temp#@_4 === nothing
  │   %32 = (Base.not_int)(%31)
  └──       goto #7 if not %32
  6 ─       goto #2
8 7 ─       return l

This is also a great example for why shorter code not always means faster execution.


See also: which makes the implicit flatten faster. Using this PR your example looks like this:

julia> @btime inversions(p)
  57.567 ns (0 allocations: 0 bytes)
@btime inversions1(p)
  903.217 ns (111 allocations: 3.44 KiB)

# with new PR:
julia> @btime inversions1(p)
  238.940 ns (21 allocations: 640 bytes)

Its still not as fast as the manual written loop, but definitely a step forward.


Yes. The argument for selling Julia is that the cost of abstraction is negligible. We have to make this true.


I really don’t see the problem here - to me, the loop is much more clear and explicit about what’s going on than the generator expression. It does precisely what you expect without any hidden allocation, whereas the generator by lieu of existing already allocates. You can even go as far as to say that they are distinct by what they’re doing - the second version explicitly constructs that generator (maybe with some rewrite to return it, should all those permutations be needed lazily) whereas the first version only gives the static count of inversions.


In this case a generator without a double for loop does not allocate.

function inversions1(p::Vector{Int})
  count(p[j-1]>p[j] for j in 2:length(p)) 
julia> @btime inversions1(p)  
15.059 ns (0 allocations: 0 bytes)

I’m aware that this is a different calculation, but the point is, that a generator per se doesn’t necessarily allocate.


I’m not quite sure @btime is accurate here, since the @code_lowered does see a new:

julia> function inversions2(p::Vector{Int})
         count(p[j-1]>p[j] for j in 2:length(p))
inversions2 (generic function with 1 method)

julia> @code_lowered inversions2(p)
2 1 ─ %1 = Main.:(##7#8)
  │   %2 = (Core.typeof)(p)
  │   %3 = (Core.apply_type)(%1, %2)
  │        #7 = %new(%3, p)
  │   %5 = #7
  │   %6 = (Main.length)(p)
  │   %7 = 2:%6
  │   %8 = (Base.Generator)(%5, %7)
  │   %9 = (Main.count)(%8)
  └──      return %9

and the regular @time also sees allocations (even after precompilation):

julia> inversions2(p)

julia> @time inversions2(p)
  0.000003 seconds (4 allocations: 160 bytes)

My point is just that a “shorter” generator expression does not necessarily lead to the same code as writing an explicit loop getting just what you want. I suspect that in the OP what lead to the allocation was the call to Base.Flatten in the lowered code when generating that generator, which stems from the fact that there are two for in there.

@Jean_Michel It’s not that count allocates, but the generators you’re creating (and that are being flattened) allocate. In Julia, loops are fast so if you just need the count and don’t care about which pair of elements it’s true for, just writing the loop is probably going to be both faster and simpler to write.


I think that it is more subtle than this: the cost of many abstractions can be made negligible or in some cases zero, but this does not always hold and/or is not always done automatically (requiring a bit of prodding, or extra work).

Writing very fast code still requires some work and experience. Coming to Julia and expecting everything to be magically faster is a setup for disappointment.


Wouldn’t be usefult to propose then in “benchmark tips” to not use generators?

In other words - could we find examples where using generators could be useful and performant?

Topological sort (performance)

No, not at all! Why have a language feature when you’re discouraged from using it? For some things generator syntax is indeed more useful e.g. single inline for or when the runtime is dominated by something else or when you can reuse a generator for multiple things. The example here just doesn’t benefit from one of those cases, but rather the double for generator causes an allocation which dominates the runtime. Additionally, no one in this thread has tested whether that generator function has constant overhead or whether it is in O(n). Just 10 elements in p are surely not representative for the general case.

Topological sort (performance)
Topological sort (performance)

a) It is context dependent. In case running performance is not important it could be written with less keystrokes (ie could have better “finger” performance).
b) it is implementation dependent. Clever compiler could translate this count(p[j-1]>p[j] for j in 2:length(p)) without allocation, couldn’t it? Till than we probably have to warn people not to use it if performance is important.

Would you mind to add MWE? (I don’t want to disturb you and will really understand you don’t! :slight_smile: )


a) My post was very much in response to you saying “Wouldn’t be usefult [sic] to propose then in “benchmark tips” to not use generators?”. Of course running time is important. That’s why I mentioned the case where the runtime of the generator is dominated by some other thing. Please don’t twist what I said into something I did not say.

b) I don’t know, I’m not a compiler writer. I can imagine though that not unrolling and leaving that generator there could possibly have benefits e.g. lazy evaluation - the compiler doesn’t really know if evaluating that generator is a complex operation or not.


Did you try a sum() in place of count()?
possibly recommended by the >?reduce docstring


@code_lowered is just the lowered code. It is purely a syntactic transformation of the original code. Of course there will be new there since that is what is in the syntax of the code. It is the optimizer that has the job of eliminating those, so look at @code_typed instead.

Yes because you are running it in global scope

julia> @time 1+1
  0.000003 seconds (4 allocations: 160 bytes)


The generator is 10 times slower whatever the length:

julia> p=100:-1:1
julia> @btime inversions(p)
  4.400 μs (0 allocations: 0 bytes)

julia> @btime inversions1(p)
  48.518 μs (10101 allocations: 315.63 KiB)


sum has exactly the same problems as count. I did try


Many benchmarks tips reflect ‘bugs’ in the current version of Julia: like the disaster with closures,etc…
Since Julia promises the possibility to use functional style without cost, these must absolutely be fixed.
The functional code is superior, promising easy parallelization for instance (and less bugs).