Performance differences when using a mutable struct that contains an array


#1

Hi,

I found some strange performance issues in a simple code, using a mutable struct that contains an array and an index to the array as fields. In the simplified example below I compare four different function that fill all array elements with a constant.

const CELLS = 1024*1024

mutable struct Box
    v   :: Vector{Int}
    ind :: Int
end

function fill1!(box,n)
# code without helper function runs rather fast
   box.ind = 0
   for i=1:CELLS
     box.ind += 1
     box.v[box.ind] = n
   end
end

function fill2!(box,n)
# calling a small function to fill one element
# -> this take much longer
   box.ind = 0
   for i=1:CELLS
     fill_elem!(box,n)
   end
end

function fill_elem!(box, n)
   box.ind += 1
   box.v[box.ind] = n
end

function fill3!(box,n)
# subtle change to fill1! but about double runtime
   box.ind = 1
   for i=1:CELLS
     box.v[box.ind] = n
     box.ind += 1
   end
end

function fill4!(v,n)
# avoiding the mutable struct is the fastest
  ind = 0
  for i=1:CELLS
     ind += 1
     v[ind] = n
  end
end

function main(method)  # benchmark
   v = zeros(Int,CELLS)
   box = Box(v,1)
   if method == 1
     for n = 1:1000
        fill1!(box,n)
     end
  elseif method == 2
     for n = 1:1000
        fill2!(box,n)
     end
  elseif method == 3
     for n = 1:1000
        fill3!(box,n)
     end
  else
     for n = 1:1000
       fill4!(v,n)
     end
  end
end

In this example I observe very different run-times, depending on which fill-function is used. Probably I miss something simple, but I cannot understand the reason for the performance differences, in particular why using the small helper function makes such a difference.
In a straightforward translation to C-code all four possibilities result in comparable run-time.

Of course I can always store the array and the index in separate variables (avoiding these issues), but what is wrong to put them into a single struct?

@time main(1)  # 0.634483 seconds (10.90 k allocations: 8.557 MiB, 0.26% gc time)
@time main(2)  # 3.529522 seconds (7 allocations: 8.000 MiB)
@time main(3)  # 1.214788 seconds (7 allocations: 8.000 MiB)
@time main(4)  # 0.509509 seconds (7 allocations: 8.000 MiB, 0.23% gc time)

Note, this is not about the most efficient way to initialize an array. The code above is a contrived (simplified) example of some more involved code where I stumbled across these unexpected performance issues.


#2
  1. Rather than writing your own benchmarking code, you should use BenchmarkTools.jl which will give more precise answers for less work:
julia> using BenchmarkTools

julia> @btime fill1!($box, 1)
  741.349 μs (0 allocations: 0 bytes)

julia> @btime fill2!($box, 1)
  6.554 ms (0 allocations: 0 bytes)

julia> @btime fill3!($box, 1)
  1.338 ms (0 allocations: 0 bytes)

julia> @btime fill4!($(box.v), 1)
  638.872 μs (0 allocations: 0 bytes)
  1. It looks like you’re seeing some cost due to function calls. The compiler can inline functions automatically, but it doesn’t always do so. It can occasionally help to provide a hint that a function should be inlined (if that function is called very often within a fast loop):
julia> @inline function fill_elem!(box, n)
          box.ind += 1
          box.v[box.ind] = n
       end
fill_elem! (generic function with 1 method)

with that change, fill2! becomes as fast as fill1!:

julia> @btime fill2!($box, 1)
  740.249 μs (0 allocations: 0 bytes)

You may also want to play around with @inbounds to avoid bounds checks on your array accesses (but only when you’re absolutely sure the bounds will never be violated).


#3

Thanks a lot!

It never came to my mind that inlining could be the problem. In fact, @inline also solves the issue with my larger code.

Is there any rule of thumb under which conditions inlining hints should be given to the compiler? Or is this just try and error? I thought inlining hints would be important for larger functions, but fill_elem! seems rather inconspicuous.

I never came across this issue before in my Julia code - but then, maybe I never realized that with this simple macro I could have gained big speed-ups.

On a side note, any ideas why fill3! is so much slower than fill1!? It just reverses the order when the index is incremented, so it should be just a single assignment longer.


#4

If there was a truly simple rule of thumb then we’d make it a heuristic in the compiler for automatic inlining, so it’s hard to come up with something that doesn’t require some analysis.


#5

If you add inbounds annotations to the loop, you get a performance boost in both fill1 and fill3, and they have the same performance.


#6

If you add @inbounds then the difference goes away.

During compilation, llvm attempts to prove that the accesses are always inbounds; if it can prove so, then no code for bounds checking is emitted. Also, llvm attempts to rotate the loops around, etc. The static bounds-checking is, however, very very finicky. Afaik it is more of a side-effect of general optimizations (loop hoisting, dead code elimination), and the existing llvm module for this purpose (inductive range check elemination) is not even used in julia (why?).

And with finicky I mean “depends on the phase of the moon”. So, what julia version are you using? On my 0.62, fill1! was double the speed of fill3!.

So that’s what I thought happens first. But then, I decided to look at the @code_native to see what is going on. Let’s see the inner loop.

Fast version:

L80:
	leaq	1(%rdx), %rcx
	cmpq	%rax, %rdx
	jae	L125                                    #out-of-bounds error
	movq	%rsi, (%r10,%rdx,8)    #mutate box.v[n]
	cmpq	$1048575, %rdx          # imm = 0xFFFFF
	movq	%rcx, %rdx
	jne	L80
        #loop done
...
L125:
	movq	%rcx, 8(%r9)               #mutate box.ind
       #ordinary exception handling
...

Slow version:

L80:
	cmpq	%rax, %rdx
	jae	L125 #out-of-bounds error
	movq	%rsi, (%r10,%rdx,8)   #mutate box.v[n]
	leaq	2(%rdx), %rcx
	movq	%rcx, 8(%r9)               #mutate box.ind
	incq	%rdx
	cmpq	$1048576, %rdx          # imm = 0x100000
	jne	L80
        #loop done
...
L125:
      #ordinary exception handling

So we have bound checks in both variants. Damn.

The thing that actually happened is that the fast version figured out that writing to box.ind does not need to be done inside the loop, whereas the slow version dutifully writes the new index to memory in every iteration. Now, why the hell could the compiler figure this out in one case but not the other?

We know that it has to do with bounds checking (with @inbounds, the difference goes away). Due to the phase of the moon, one version got loop-rotated in a way that the memory write could obviously be pushed out of the loop, while this possibility was insufficiently obvious in the other variant.

I’ll now stop trying to read the optimizer’s mind (any better optimizer-therapist here?). Just remember, bound checks mess up a lot of optimizations (complicated control flow! compiler needs to guarantee that exactly the right mutations are visible and exceptions are raised in the correct order). @inbounds is your friend.

Edit, postscript: I guess the take-away is that if your loop has multiple exits, then hoisting mutation out of it is unreliable.