# Performance differences when using a mutable struct that contains an array

Hi,

I found some strange performance issues in a simple code, using a mutable struct that contains an array and an index to the array as fields. In the simplified example below I compare four different function that fill all array elements with a constant.

``````const CELLS = 1024*1024

mutable struct Box
v   :: Vector{Int}
ind :: Int
end

function fill1!(box,n)
# code without helper function runs rather fast
box.ind = 0
for i=1:CELLS
box.ind += 1
box.v[box.ind] = n
end
end

function fill2!(box,n)
# calling a small function to fill one element
# -> this take much longer
box.ind = 0
for i=1:CELLS
fill_elem!(box,n)
end
end

function fill_elem!(box, n)
box.ind += 1
box.v[box.ind] = n
end

function fill3!(box,n)
# subtle change to fill1! but about double runtime
box.ind = 1
for i=1:CELLS
box.v[box.ind] = n
box.ind += 1
end
end

function fill4!(v,n)
# avoiding the mutable struct is the fastest
ind = 0
for i=1:CELLS
ind += 1
v[ind] = n
end
end

function main(method)  # benchmark
v = zeros(Int,CELLS)
box = Box(v,1)
if method == 1
for n = 1:1000
fill1!(box,n)
end
elseif method == 2
for n = 1:1000
fill2!(box,n)
end
elseif method == 3
for n = 1:1000
fill3!(box,n)
end
else
for n = 1:1000
fill4!(v,n)
end
end
end

``````

In this example I observe very different run-times, depending on which fill-function is used. Probably I miss something simple, but I cannot understand the reason for the performance differences, in particular why using the small helper function makes such a difference.
In a straightforward translation to C-code all four possibilities result in comparable run-time.

Of course I can always store the array and the index in separate variables (avoiding these issues), but what is wrong to put them into a single struct?

``````@time main(1)  # 0.634483 seconds (10.90 k allocations: 8.557 MiB, 0.26% gc time)
@time main(2)  # 3.529522 seconds (7 allocations: 8.000 MiB)
@time main(3)  # 1.214788 seconds (7 allocations: 8.000 MiB)
@time main(4)  # 0.509509 seconds (7 allocations: 8.000 MiB, 0.23% gc time)
``````

Note, this is not about the most efficient way to initialize an array. The code above is a contrived (simplified) example of some more involved code where I stumbled across these unexpected performance issues.

1. Rather than writing your own benchmarking code, you should use BenchmarkTools.jl which will give more precise answers for less work:
``````julia> using BenchmarkTools

julia> @btime fill1!(\$box, 1)
741.349 ÎĽs (0 allocations: 0 bytes)

julia> @btime fill2!(\$box, 1)
6.554 ms (0 allocations: 0 bytes)

julia> @btime fill3!(\$box, 1)
1.338 ms (0 allocations: 0 bytes)

julia> @btime fill4!(\$(box.v), 1)
638.872 ÎĽs (0 allocations: 0 bytes)
``````
1. It looks like youâ€™re seeing some cost due to function calls. The compiler can inline functions automatically, but it doesnâ€™t always do so. It can occasionally help to provide a hint that a function should be inlined (if that function is called very often within a fast loop):
``````julia> @inline function fill_elem!(box, n)
box.ind += 1
box.v[box.ind] = n
end
fill_elem! (generic function with 1 method)
``````

with that change, `fill2!` becomes as fast as `fill1!`:

``````julia> @btime fill2!(\$box, 1)
740.249 ÎĽs (0 allocations: 0 bytes)
``````

You may also want to play around with `@inbounds` to avoid bounds checks on your array accesses (but only when youâ€™re absolutely sure the bounds will never be violated).

1 Like

Thanks a lot!

It never came to my mind that inlining could be the problem. In fact, @inline also solves the issue with my larger code.

Is there any rule of thumb under which conditions inlining hints should be given to the compiler? Or is this just try and error? I thought inlining hints would be important for larger functions, but fill_elem! seems rather inconspicuous.

I never came across this issue before in my Julia code - but then, maybe I never realized that with this simple macro I could have gained big speed-ups.

On a side note, any ideas why fill3! is so much slower than fill1!? It just reverses the order when the index is incremented, so it should be just a single assignment longer.

If there was a truly simple rule of thumb then weâ€™d make it a heuristic in the compiler for automatic inlining, so itâ€™s hard to come up with something that doesnâ€™t require some analysis.

1 Like

If you add inbounds annotations to the loop, you get a performance boost in both fill1 and fill3, and they have the same performance.

1 Like

If you add `@inbounds` then the difference goes away.

During compilation, llvm attempts to prove that the accesses are always inbounds; if it can prove so, then no code for bounds checking is emitted. Also, llvm attempts to rotate the loops around, etc. The static bounds-checking is, however, very very finicky. Afaik it is more of a side-effect of general optimizations (loop hoisting, dead code elimination), and the existing llvm module for this purpose (inductive range check elemination) is not even used in julia (why?).

And with finicky I mean â€śdepends on the phase of the moonâ€ť. So, what julia version are you using? On my 0.62, `fill1!` was double the speed of `fill3!`.

So thatâ€™s what I thought happens first. But then, I decided to look at the `@code_native` to see what is going on. Letâ€™s see the inner loop.

Fast version:

``````L80:
leaq	1(%rdx), %rcx
cmpq	%rax, %rdx
jae	L125                                    #out-of-bounds error
movq	%rsi, (%r10,%rdx,8)    #mutate box.v[n]
cmpq	\$1048575, %rdx          # imm = 0xFFFFF
movq	%rcx, %rdx
jne	L80
#loop done
...
L125:
movq	%rcx, 8(%r9)               #mutate box.ind
#ordinary exception handling
...
``````

Slow version:

``````L80:
cmpq	%rax, %rdx
jae	L125 #out-of-bounds error
movq	%rsi, (%r10,%rdx,8)   #mutate box.v[n]
leaq	2(%rdx), %rcx
movq	%rcx, 8(%r9)               #mutate box.ind
incq	%rdx
cmpq	\$1048576, %rdx          # imm = 0x100000
jne	L80
#loop done
...
L125:
#ordinary exception handling
``````

So we have bound checks in both variants. Damn.

The thing that actually happened is that the fast version figured out that writing to `box.ind` does not need to be done inside the loop, whereas the slow version dutifully writes the new index to memory in every iteration. Now, why the hell could the compiler figure this out in one case but not the other?

We know that it has to do with bounds checking (with `@inbounds`, the difference goes away). Due to the phase of the moon, one version got loop-rotated in a way that the memory write could obviously be pushed out of the loop, while this possibility was insufficiently obvious in the other variant.

Iâ€™ll now stop trying to read the optimizerâ€™s mind (any better optimizer-therapist here?). Just remember, bound checks mess up a lot of optimizations (complicated control flow! compiler needs to guarantee that exactly the right mutations are visible and exceptions are raised in the correct order). `@inbounds` is your friend.

Edit, postscript: I guess the take-away is that if your loop has multiple exits, then hoisting mutation out of it is unreliable.

4 Likes