Immutable struct with an Array slower than a constant global Array

Hello,
I’m trying to write a Lattice Boltzmann Solver in Julia. My initial version strongly relies on global constants (mostly numeric values and Arrays). I’ve tried to use immutable structures that hold the state of my simulator, but found a significant difference in performance.

My minimal working example focus on essentially calculating sum over one of array’s dimension.
I see that when the function operates on a constant global it is significantly faster than when the same data are accessed through a structure’s property.

I am comparing four approaches: `moments1()` and `moments2()` use nested loops but differ the accessed array (structure’s property vs. constant global array); `moments3()` and `moments4()` are similar to the former two, but this time I use `Einsum` package instead of nested loops.

``````const H  = 20
const W  = 20
const f  = zeros(H, W, 9)
const rho = ones(H, W)

struct Lattice
f  :: Array{Float64, 3}
end

const lattice = Lattice(f)

function moments1()
lf = lattice.f
let n, m, i
@inbounds for m=1:W, n=1:H
local a = 0.0
for i=1:9 a += lf[n, m, i] end
rho[n, m] = a
end
end
return nothing
end

function moments2()
let n, m, i
@inbounds for m=1:W, n=1:H
local a = 0.0
for i=1:9 a += f[n, m, i] end
rho[n, m] = a
end
end
return nothing
end

using Einsum
function moments3()
lf = lattice.f
@einsum rho[n, m] = lf[n, m, i]
return nothing
end
function moments4()
@einsum rho[n, m] = f[n, m, i]
return nothing
end

using BenchmarkTools
@btime moments1() # 423.995 ns (0 allocations: 0 bytes)
@btime moments2() # 214.657 ns (0 allocations: 0 bytes)
@btime moments3() # 1471.   ns (0 allocations: 0 bytes)
@btime moments4() # 214.623 ns (0 allocations: 0 bytes)
``````

I was inspecting the LLVM code of all the four methods, and of course they differ, but I don’t know the cause.
Is accessing the property really that a big deal or is it something else that I am missing?

Something in there makes it not SIMD, at least that’s what I see from my side.
I would avoid using any kind of globals and pass those arrays as parameters of the functions somehow because I find that globals (that aren’t just constants like gravity) make code harder to read, performance issues notwidthstanding.

3 Likes

Yes, that was my first instinct (to pass the array as a parameter), but the results are the same:

``````function moments3(lf)
@einsum rho[n, m] = lf[n, m, i]
return nothing
end

@btime moments1(f) # 424.412 ns (0 allocations: 0 bytes)
@btime moments2() # 214.563 ns (0 allocations: 0 bytes)
@btime moments3(f) # 1471. ns (0 allocations: 0 bytes)
@btime moments4() # 214.641 ns (0 allocations: 0 bytes)
``````

That still has to write to the global `rho` though, right? What about

``````function moments5!(rho, lf)
@einsum rho[n, m] = lf[n, m, i]
return nothing
end

``````

I’ve also tried to reorganize memory access pattern as maybe it prevents compiler for using SIMD:
And strange thing is, now the both approaches (constant global vs. array as parameter) take similar time (global is slightly faster). However, the time is slower than the previous memory layout with global.

``````const H  = 20
const W  = 20
const f  = zeros(9, H, W)
const rho = ones(H, W)

struct Lattice
f  :: Array{Float64, 3}
end

const lattice = Lattice(f)

function moments1(lf)
let n, m, i
@inbounds for m=1:W, n=1:H
local a = 0.0
for i=1:9 a += lf[i, n, m] end
rho[n, m] = a
end
end
return nothing
end

function moments2()
let n, m, i
@inbounds for m=1:W, n=1:H
local a = 0.0
for i=1:9 a += f[i, n, m] end
rho[n, m] = a
end
end
return nothing
end

using Einsum
function moments3(lf)
@einsum rho[n, m] = lf[i, n, m]
return nothing
end
function moments4()
@einsum rho[n, m] = f[i, n, m]
return nothing
end

using BenchmarkTools
@btime moments1() # 391.292 ns (0 allocations: 0 bytes)
@btime moments2(f) # 387.315 ns (0 allocations: 0 bytes)
@btime moments3() # 1467.   ns (0 allocations: 0 bytes)
@btime moments4(f) # 387.108 ns (0 allocations: 0 bytes)
``````

Thanks, but it hasn’t changed the results.

Are you certain you ran these in a new session? The `moments3` function you’ve written up there takes an argument, while the one in your benchmark does not. The reverse is the case for `moments4`. Same is the case for `moments1` and `moments2`.

I also don’t see any use of your `lattice` global variable anymore, so where did that as part of your comparison go?

Also, why have the `let` blocks there at all? It doesn’t seem necessary and introduces a new scope, which may mess with things as well.

The results are OK, I was modifying the reply and put the arguments for the wrong functions (`moments2` and `moments4` instead of `moments1` and `moments3`). Here is the last code that I run:

``````const H  = 20
const W  = 20
const f  = zeros(H, W, 9)
const rho = ones(H, W)

struct Lattice
f  :: Array{Float64, 3}
end

const lattice = Lattice(f)

function moments1!(rho, lattice)
lf = lattice.f
@inbounds for m=1:W, n=1:H
local a = 0.0
for i=1:9 a += lf[n, m, i] end
rho[n, m] = a
end
return nothing
end

function moments2!(rho)
@inbounds for m=1:W, n=1:H
local a = 0.0
for i=1:9 a += f[n, m, i] end
rho[n, m] = a
end
return nothing
end

using Einsum
function moments3!(rho, lattice)
lf = lattice.f
@einsum rho[n, m] = lf[n, m, i]
return nothing
end
function moments4!(rho)
@einsum rho[n, m] = f[n, m, i]
return nothing
end

using BenchmarkTools
@btime moments1!(rho, lattice) # 423.995 ns (0 allocations: 0 bytes)
@btime moments2!(rho)          # 215.017 ns (0 allocations: 0 bytes)
@btime moments3!(rho, lattice) # 1.483   μs (0 allocations: 0 bytes)
@btime moments4!(rho)          # 215.487 ns (0 allocations: 0 bytes)
``````

It uses lattice structure, but when I pass only `lattice.f` the results are the same. I’ve removed the let block. I saw that `@einsum` creates that and hoped it helps.

I am able to match your `moments2` function while avoiding globals by using `SizedArray` from the StaticArrays package, so that the array dimensions are known at compile time:

``````using Einsum
using BenchmarkTools
using StaticArrays

const H  = 20
const W  = 20
const f  = zeros(H, W, 9)
const rho = ones(H, W)

function moments2()
let n, m, i
@inbounds for m=1:W, n=1:H
local a = 0.0
for i=1:9 a += f[n, m, i] end
rho[n, m] = a
end
end
return nothing
end

# This function doesn't use any globals
function moments1!(rho, f)
@inbounds for m ∈ axes(f, 2), n ∈ axes(f, 1)
a = zero(eltype(f))
for i ∈ axes(f, 3)
a += f[n, m, i]
end
rho[n, m] = a
end
return nothing
end

@btime moments2()  #  276.107 ns (0 allocations: 0 bytes)
@btime moments1!(\$rho, \$f)  # 1.971 μs (0 allocations: 0 bytes)

# Now try passing SizedArrays to the same function
rho_s = SizedArray{Tuple{H, W}}(rho)
f_s = SizedArray{Tuple{H, W, 9}}(f)

@btime moments1!(\$rho_s, \$f_s)  #  277.280 ns (0 allocations: 0 bytes)
``````
2 Likes

That’s interesting, thank you. From a comparison of the output LLVM the code is almost identical. So, can the Julia compiler figure out the size of constant global Array?

I’m not sure it can, and even if it can, `lattice.f` should access the exact same constant global array; even if `.f` was annotated abstractly `::Array`, the `const lattice` can’t change `.f`, so its type is inferred. There really shouldn’t be a difference. Speaking on the original code, the `@code_warntype` are nearly exactly the same, and if you edit `moments2` to make a local `ff = f` and do `a += ff[n, m, i]`, the `@code_warntype` only differs at the first line. Yet, the `@code_llvm` turns out different and `moments1` is slower.

1 Like

I wonder: how one could investigate if this is an intended behavior or not?

To be more precise, an `Array` by itself does not contain its size in its type, so I don’t expect the compiler to know the size. Is it possible that the compiler is smart enough to check the size of a `const` global Array where `N>1`? Maybe, but I’m betting the size was known from the `const H`, `const W`, and `i=1:9`.

@jipolanco figured out that you can get the same performance as `moments2()` if you instead used no globals and put the size in the array’s type, so a good hypothesis is that `moments2()` was also optimized given compile-time size. It’s just strange that the nearly identical `moments1()` wasn’t.

Inscrutable compilation optimizations aside, I think the best takeaway here is that you can put compile-time information in types instead of `const` globals for the same optimization benefit. In your case the `const` globals are inputs, so to work on an array with a different size and values, you would have to manually edit the script, restart the session, rerun the script, and recompile all methods. In the non-`global` version, you only have to recompile a specific method for new `SizedArray`s. Don’t put information in types if they should be in instances, though; if you’re generating arbitrarily many types rather than reusing a few, your performance can be overall worse because more time is spent compiling methods than reusing optimized compiled code.

SizedArrays aside, it is not good practice to supply the iteration limits like this, especially in combination with `@inbounds`. This is highly risky.

It is good practice to use `eachindex` or `axes`, and less good (but still better than the above) to use `size`.

2 Likes

Without using StaticArrays, I was able to get the same performance, using Tullio.jl and LoopVectorization.jl:

``````using Tullio, LoopVectorization

function moments3!(rho, f)
# @turbo is from LoopVectorization
@turbo for m ∈ axes(f, 2), n ∈ axes(f, 1)
a = zero(eltype(f))
for i ∈ axes(f, 3)
a += f[n, m, i]
end
rho[n, m] = a
end
return rho
end

moments4!(rho, f) = (@tullio rho[n, m] = f[n, m, i])
``````
``````julia> @btime moments3!(\$rho, \$f);
183.836 ns (0 allocations: 0 bytes)

julia> @btime moments4!(\$rho, \$f);
196.779 ns (0 allocations: 0 bytes)
``````

Wit SizedArrays there further gains;

``````julia> @btime moments3!(\$rho_s, \$f_s);
145.838 ns (0 allocations: 0 bytes)

julia> @btime moments4!(\$rho_s, \$f_s);
178.248 ns (0 allocations: 0 bytes)
``````

BTW, and this is more of a style thing: it’s often advantageous to return the modified object from the method, (see return statement in `moments3!`). For example, it was a bit inconvenient to compare the outputs of the different functions, but with `return rho` it’s just

``````julia> moments3!(rho, f) == moments4!(rho, f)
true
``````

As far as I know, Tullio.jl is far more actively maintained than Einsum.jl, and more performant, so should be preferred. It also supports multi-threading, and uses LoopVectorization for better simd.

3 Likes