Performance problem of count?

Jean_Michel · October 27, 2018, 7:54pm

Writing code to count the number of inversions of a permutation, I was surprised that it allocates. Thus I wrote what I thought equivalent code with a for loop, and it does not allocate. Here is the code; why is count so slow?

function inversions(p::Vector{Int})
  l=0
  for j in 2:length(p) 
    for i in 1:j-1 
      l+=(p[i]>p[j])
    end 
  end
  l
end

function inversions1(p::Vector{Int})
  count(p[i]>p[j] for j in 2:length(p) for i in 1:j-1) 
end

p=collect(10:-1:1)

and here is the timing:

@btime inversions1($p)
  514.890 ns (111 allocations: 3.44 KiB)

@btime inversions($p)
  39.365 ns (0 allocations: 0 bytes)

oliver · October 27, 2018, 8:14pm

What is the best way to look at the source code for the count method in Julia?

Jean_Michel · October 27, 2018, 8:15pm

@less count(...)

oliver · October 27, 2018, 8:16pm

For me is was @less count("")

Sukera · October 27, 2018, 8:32pm

I’m not an expert on when or why an allocation happens, but it seems to me that in inversions1 you’re creating a generator expression depending on p (which may need allocation) and pass that to count, whereas in inversions all you need is the known length of the vector and two loops with the condition. A quick look at @code_lowered seems to confirm that:

julia> @code_lowered inversions1(p)
CodeInfo(
2 1 ─ %1  = Main.:(##4#5)
  │   %2  = (Core.typeof)(p)
  │   %3  = (Core.apply_type)(%1, %2)
  │         #4 = %new(%3, p)
  │   %5  = #4
  │   %6  = (Main.length)(p)
  │   %7  = 2:%6
  │   %8  = (Base.Generator)(%5, %7)
  │   %9  = (Base.Flatten)(%8)
  │   %10 = (Main.count)(%9)
  └──       return %10
)

julia> @code_lowered inversions(p)
CodeInfo(
2 1 ─       l = 0
3 │   %2  = (Main.length)(p)
  │   %3  = 2:%2
  │         #temp#@_4 = (Base.iterate)(%3)
  │   %5  = #temp#@_4 === nothing
  │   %6  = (Base.not_int)(%5)
  └──       goto #7 if not %6
  2 ┄ %8  = #temp#@_4
  │         j = (Core.getfield)(%8, 1)
  │   %10 = (Core.getfield)(%8, 2)
4 │   %11 = j - 1
  │   %12 = 1:%11
  │         #temp#@_6 = (Base.iterate)(%12)
  │   %14 = #temp#@_6 === nothing
  │   %15 = (Base.not_int)(%14)
  └──       goto #5 if not %15
  3 ┄ %17 = #temp#@_6
  │         i = (Core.getfield)(%17, 1)
  │   %19 = (Core.getfield)(%17, 2)
5 │   %20 = l
  │   %21 = (Base.getindex)(p, i)
  │   %22 = (Base.getindex)(p, j)
  │   %23 = %21 > %22
  │         l = %20 + %23
  │         #temp#@_6 = (Base.iterate)(%12, %19)
  │   %26 = #temp#@_6 === nothing
  │   %27 = (Base.not_int)(%26)
  └──       goto #5 if not %27
  4 ─       goto #3
  5 ─       #temp#@_4 = (Base.iterate)(%3, %10)
  │   %31 = #temp#@_4 === nothing
  │   %32 = (Base.not_int)(%31)
  └──       goto #7 if not %32
  6 ─       goto #2
8 7 ─       return l
)

This is also a great example for why shorter code not always means faster execution.

laborg · October 28, 2018, 7:46am

See also: https://github.com/JuliaLang/julia/pull/29786 which makes the implicit flatten faster. Using this PR your example looks like this:

julia> @btime inversions(p)
  57.567 ns (0 allocations: 0 bytes)
@btime inversions1(p)
  903.217 ns (111 allocations: 3.44 KiB)

# with new PR:
julia> @btime inversions1(p)
  238.940 ns (21 allocations: 640 bytes)

Its still not as fast as the manual written loop, but definitely a step forward.

Jean_Michel · October 28, 2018, 9:19am

Yes. The argument for selling Julia is that the cost of abstraction is negligible. We have to make this true.

Sukera · October 28, 2018, 9:28am

I really don’t see the problem here - to me, the loop is much more clear and explicit about what’s going on than the generator expression. It does precisely what you expect without any hidden allocation, whereas the generator by lieu of existing already allocates. You can even go as far as to say that they are distinct by what they’re doing - the second version explicitly constructs that generator (maybe with some rewrite to return it, should all those permutations be needed lazily) whereas the first version only gives the static count of inversions.

laborg · October 28, 2018, 9:32am

In this case a generator without a double for loop does not allocate.

function inversions1(p::Vector{Int})
  count(p[j-1]>p[j] for j in 2:length(p)) 
end
julia> @btime inversions1(p)  
15.059 ns (0 allocations: 0 bytes)

I’m aware that this is a different calculation, but the point is, that a generator per se doesn’t necessarily allocate.

Sukera · October 28, 2018, 10:37am

I’m not quite sure @btime is accurate here, since the @code_lowered does see a new:

julia> function inversions2(p::Vector{Int})
         count(p[j-1]>p[j] for j in 2:length(p))
         end
inversions2 (generic function with 1 method)

julia> @code_lowered inversions2(p)
CodeInfo(
2 1 ─ %1 = Main.:(##7#8)
  │   %2 = (Core.typeof)(p)
  │   %3 = (Core.apply_type)(%1, %2)
  │        #7 = %new(%3, p)
  │   %5 = #7
  │   %6 = (Main.length)(p)
  │   %7 = 2:%6
  │   %8 = (Base.Generator)(%5, %7)
  │   %9 = (Main.count)(%8)
  └──      return %9
)

and the regular @time also sees allocations (even after precompilation):

julia> inversions2(p)
9

julia> @time inversions2(p)
  0.000003 seconds (4 allocations: 160 bytes)
9

My point is just that a “shorter” generator expression does not necessarily lead to the same code as writing an explicit loop getting just what you want. I suspect that in the OP what lead to the allocation was the call to Base.Flatten in the lowered code when generating that generator, which stems from the fact that there are two for in there.

@Jean_Michel It’s not that count allocates, but the generators you’re creating (and that are being flattened) allocate. In Julia, loops are fast so if you just need the count and don’t care about which pair of elements it’s true for, just writing the loop is probably going to be both faster and simpler to write.

Tamas_Papp · October 28, 2018, 11:50am

I think that it is more subtle than this: the cost of many abstractions can be made negligible or in some cases zero, but this does not always hold and/or is not always done automatically (requiring a bit of prodding, or extra work).

Writing very fast code still requires some work and experience. Coming to Julia and expecting everything to be magically faster is a setup for disappointment.

Liso · October 28, 2018, 12:22pm

Wouldn’t be usefult to propose then in “benchmark tips” to not use generators?

In other words - could we find examples where using generators could be useful and performant?

Sukera · October 28, 2018, 12:29pm

No, not at all! Why have a language feature when you’re discouraged from using it? For some things generator syntax is indeed more useful e.g. single inline for or when the runtime is dominated by something else or when you can reuse a generator for multiple things. The example here just doesn’t benefit from one of those cases, but rather the double for generator causes an allocation which dominates the runtime. Additionally, no one in this thread has tested whether that generator function has constant overhead or whether it is in O(n). Just 10 elements in p are surely not representative for the general case.

Liso · October 28, 2018, 12:52pm

a) It is context dependent. In case running performance is not important it could be written with less keystrokes (ie could have better “finger” performance).
b) it is implementation dependent. Clever compiler could translate this count(p[j-1]>p[j] for j in 2:length(p)) without allocation, couldn’t it? Till than we probably have to warn people not to use it if performance is important.

Would you mind to add MWE? (I don’t want to disturb you and will really understand you don’t! )

Sukera · October 28, 2018, 1:39pm

a) My post was very much in response to you saying “Wouldn’t be usefult [sic] to propose then in “benchmark tips” to not use generators?”. Of course running time is important. That’s why I mentioned the case where the runtime of the generator is dominated by some other thing. Please don’t twist what I said into something I did not say.

b) I don’t know, I’m not a compiler writer. I can imagine though that not unrolling and leaving that generator there could possibly have benefits e.g. lazy evaluation - the compiler doesn’t really know if evaluating that generator is a complex operation or not.

y4lu · October 28, 2018, 2:13pm

Did you try a sum() in place of count()?
possibly recommended by the >?reduce docstring

kristoffer.carlsson · October 28, 2018, 2:14pm

@code_lowered is just the lowered code. It is purely a syntactic transformation of the original code. Of course there will be new there since that is what is in the syntax of the code. It is the optimizer that has the job of eliminating those, so look at @code_typed instead.

Yes because you are running it in global scope

julia> @time 1+1
  0.000003 seconds (4 allocations: 160 bytes)
2

Jean_Michel · October 28, 2018, 3:02pm

The generator is 10 times slower whatever the length:

julia> p=100:-1:1
julia> @btime inversions(p)
  4.400 μs (0 allocations: 0 bytes)

julia> @btime inversions1(p)
  48.518 μs (10101 allocations: 315.63 KiB)

Jean_Michel · October 28, 2018, 3:03pm

sum has exactly the same problems as count. I did try

Jean_Michel · October 28, 2018, 3:06pm

Many benchmarks tips reflect ‘bugs’ in the current version of Julia: like the disaster with closures,etc…
Since Julia promises the possibility to use functional style without cost, these must absolutely be fixed.
The functional code is superior, promising easy parallelization for instance (and less bugs).

Topic		Replies	Views
Generators vs loops vs broadcasting: Calculate PI via Monte Carlo Sampling Performance	7	1501	June 20, 2021
Best way to count element pairs (x,y) satisfying a condition? Performance question	19	580	October 22, 2023
Prefered pattern for generators? Performance	12	949	January 3, 2019
Combinatorics.permutations creating a lot of allocations Performance question	6	150	July 14, 2025
Reducing allocations to optimize (to match Java speed) Performance	22	1341	December 10, 2018

Performance problem of count?

Related topics