Immutable struct with an Array slower than a constant global Array

Hello,
I’m trying to write a Lattice Boltzmann Solver in Julia. My initial version strongly relies on global constants (mostly numeric values and Arrays). I’ve tried to use immutable structures that hold the state of my simulator, but found a significant difference in performance.

My minimal working example focus on essentially calculating sum over one of array’s dimension.
I see that when the function operates on a constant global it is significantly faster than when the same data are accessed through a structure’s property.

I am comparing four approaches: moments1() and moments2() use nested loops but differ the accessed array (structure’s property vs. constant global array); moments3() and moments4() are similar to the former two, but this time I use Einsum package instead of nested loops.

const H  = 20
const W  = 20
const f  = zeros(H, W, 9)
const rho = ones(H, W)

struct Lattice
    f  :: Array{Float64, 3}
end

const lattice = Lattice(f)

function moments1()
    lf = lattice.f
    let n, m, i
        @inbounds for m=1:W, n=1:H
                local a = 0.0
                for i=1:9 a += lf[n, m, i] end
                rho[n, m] = a
        end
    end
    return nothing
end

function moments2()
    let n, m, i
        @inbounds for m=1:W, n=1:H
                local a = 0.0
                for i=1:9 a += f[n, m, i] end
                rho[n, m] = a
        end
    end
    return nothing
end

using Einsum
function moments3()
    lf = lattice.f
    @einsum rho[n, m] = lf[n, m, i]
    return nothing
end
function moments4()
    @einsum rho[n, m] = f[n, m, i]
    return nothing
end

using BenchmarkTools
@btime moments1() # 423.995 ns (0 allocations: 0 bytes)
@btime moments2() # 214.657 ns (0 allocations: 0 bytes)
@btime moments3() # 1471.   ns (0 allocations: 0 bytes)
@btime moments4() # 214.623 ns (0 allocations: 0 bytes)

I was inspecting the LLVM code of all the four methods, and of course they differ, but I don’t know the cause.
Is accessing the property really that a big deal or is it something else that I am missing?

Something in there makes it not SIMD, at least that’s what I see from my side.
I would avoid using any kind of globals and pass those arrays as parameters of the functions somehow because I find that globals (that aren’t just constants like gravity) make code harder to read, performance issues notwidthstanding.

3 Likes

Yes, that was my first instinct (to pass the array as a parameter), but the results are the same:

function moments3(lf)
    @einsum rho[n, m] = lf[n, m, i]
    return nothing
end

@btime moments1(f) # 424.412 ns (0 allocations: 0 bytes)
@btime moments2() # 214.563 ns (0 allocations: 0 bytes)
@btime moments3(f) # 1471. ns (0 allocations: 0 bytes)
@btime moments4() # 214.641 ns (0 allocations: 0 bytes)

That still has to write to the global rho though, right? What about

function moments5!(rho, lf)
    @einsum rho[n, m] = lf[n, m, i]
    return nothing
end

I’ve also tried to reorganize memory access pattern as maybe it prevents compiler for using SIMD:
And strange thing is, now the both approaches (constant global vs. array as parameter) take similar time (global is slightly faster). However, the time is slower than the previous memory layout with global.

const H  = 20
const W  = 20
const f  = zeros(9, H, W)
const rho = ones(H, W)

struct Lattice
    f  :: Array{Float64, 3}
end

const lattice = Lattice(f)

function moments1(lf)
    let n, m, i
        @inbounds for m=1:W, n=1:H
                local a = 0.0
                for i=1:9 a += lf[i, n, m] end
                rho[n, m] = a
        end
    end
    return nothing
end

function moments2()
    let n, m, i
        @inbounds for m=1:W, n=1:H
                local a = 0.0
                for i=1:9 a += f[i, n, m] end
                rho[n, m] = a
        end
    end
    return nothing
end

using Einsum
function moments3(lf)
    @einsum rho[n, m] = lf[i, n, m]
    return nothing
end
function moments4()
    @einsum rho[n, m] = f[i, n, m]
    return nothing
end

using BenchmarkTools
@btime moments1() # 391.292 ns (0 allocations: 0 bytes)
@btime moments2(f) # 387.315 ns (0 allocations: 0 bytes)
@btime moments3() # 1467.   ns (0 allocations: 0 bytes)
@btime moments4(f) # 387.108 ns (0 allocations: 0 bytes)

Thanks, but it hasn’t changed the results.

Are you certain you ran these in a new session? The moments3 function you’ve written up there takes an argument, while the one in your benchmark does not. The reverse is the case for moments4. Same is the case for moments1 and moments2.

I also don’t see any use of your lattice global variable anymore, so where did that as part of your comparison go?

Also, why have the let blocks there at all? It doesn’t seem necessary and introduces a new scope, which may mess with things as well.

The results are OK, I was modifying the reply and put the arguments for the wrong functions (moments2 and moments4 instead of moments1 and moments3). Here is the last code that I run:

const H  = 20
const W  = 20
const f  = zeros(H, W, 9)
const rho = ones(H, W)

struct Lattice
    f  :: Array{Float64, 3}
end

const lattice = Lattice(f)

function moments1!(rho, lattice)
    lf = lattice.f
    @inbounds for m=1:W, n=1:H
        local a = 0.0
        for i=1:9 a += lf[n, m, i] end
        rho[n, m] = a
    end
    return nothing
end

function moments2!(rho)
    @inbounds for m=1:W, n=1:H
        local a = 0.0
        for i=1:9 a += f[n, m, i] end
        rho[n, m] = a
    end
    return nothing
end

using Einsum
function moments3!(rho, lattice)
    lf = lattice.f
    @einsum rho[n, m] = lf[n, m, i]
    return nothing
end
function moments4!(rho)
    @einsum rho[n, m] = f[n, m, i]
    return nothing
end

using BenchmarkTools
@btime moments1!(rho, lattice) # 423.995 ns (0 allocations: 0 bytes)
@btime moments2!(rho)          # 215.017 ns (0 allocations: 0 bytes)
@btime moments3!(rho, lattice) # 1.483   μs (0 allocations: 0 bytes)
@btime moments4!(rho)          # 215.487 ns (0 allocations: 0 bytes)

It uses lattice structure, but when I pass only lattice.f the results are the same. I’ve removed the let block. I saw that @einsum creates that and hoped it helps.

I am able to match your moments2 function while avoiding globals by using SizedArray from the StaticArrays package, so that the array dimensions are known at compile time:

using Einsum
using BenchmarkTools
using StaticArrays

const H  = 20
const W  = 20
const f  = zeros(H, W, 9)
const rho = ones(H, W)

function moments2()
    let n, m, i
        @inbounds for m=1:W, n=1:H
                local a = 0.0
                for i=1:9 a += f[n, m, i] end
                rho[n, m] = a
        end
    end
    return nothing
end

# This function doesn't use any globals
function moments1!(rho, f)
    @inbounds for m ∈ axes(f, 2), n ∈ axes(f, 1)
        a = zero(eltype(f))
        for i ∈ axes(f, 3)
            a += f[n, m, i]
        end
        rho[n, m] = a
    end
    return nothing
end

@btime moments2()  #  276.107 ns (0 allocations: 0 bytes)
@btime moments1!($rho, $f)  # 1.971 μs (0 allocations: 0 bytes)

# Now try passing SizedArrays to the same function
rho_s = SizedArray{Tuple{H, W}}(rho)
f_s = SizedArray{Tuple{H, W, 9}}(f)

@btime moments1!($rho_s, $f_s)  #  277.280 ns (0 allocations: 0 bytes)
2 Likes

That’s interesting, thank you. From a comparison of the output LLVM the code is almost identical. So, can the Julia compiler figure out the size of constant global Array?

I’m not sure it can, and even if it can, lattice.f should access the exact same constant global array; even if .f was annotated abstractly ::Array, the const lattice can’t change .f, so its type is inferred. There really shouldn’t be a difference. Speaking on the original code, the @code_warntype are nearly exactly the same, and if you edit moments2 to make a local ff = f and do a += ff[n, m, i], the @code_warntype only differs at the first line. Yet, the @code_llvm turns out different and moments1 is slower.

1 Like

I wonder: how one could investigate if this is an intended behavior or not?

To be more precise, an Array by itself does not contain its size in its type, so I don’t expect the compiler to know the size. Is it possible that the compiler is smart enough to check the size of a const global Array where N>1? Maybe, but I’m betting the size was known from the const H, const W, and i=1:9.

@jipolanco figured out that you can get the same performance as moments2() if you instead used no globals and put the size in the array’s type, so a good hypothesis is that moments2() was also optimized given compile-time size. It’s just strange that the nearly identical moments1() wasn’t.

Inscrutable compilation optimizations aside, I think the best takeaway here is that you can put compile-time information in types instead of const globals for the same optimization benefit. In your case the const globals are inputs, so to work on an array with a different size and values, you would have to manually edit the script, restart the session, rerun the script, and recompile all methods. In the non-global version, you only have to recompile a specific method for new SizedArrays. Don’t put information in types if they should be in instances, though; if you’re generating arbitrarily many types rather than reusing a few, your performance can be overall worse because more time is spent compiling methods than reusing optimized compiled code.

SizedArrays aside, it is not good practice to supply the iteration limits like this, especially in combination with @inbounds. This is highly risky.

It is good practice to use eachindex or axes, and less good (but still better than the above) to use size.

3 Likes

Without using StaticArrays, I was able to get the same performance, using Tullio.jl and LoopVectorization.jl:

using Tullio, LoopVectorization

function moments3!(rho, f)
    # @turbo is from LoopVectorization
    @turbo for m ∈ axes(f, 2), n ∈ axes(f, 1)
        a = zero(eltype(f))
        for i ∈ axes(f, 3)
            a += f[n, m, i]
        end
        rho[n, m] = a
    end
    return rho
end

moments4!(rho, f) = (@tullio rho[n, m] = f[n, m, i])
julia> @btime moments3!($rho, $f);
  183.836 ns (0 allocations: 0 bytes)

julia> @btime moments4!($rho, $f);
  196.779 ns (0 allocations: 0 bytes)

Wit SizedArrays there further gains;

julia> @btime moments3!($rho_s, $f_s);
  145.838 ns (0 allocations: 0 bytes)

julia> @btime moments4!($rho_s, $f_s);
  178.248 ns (0 allocations: 0 bytes)

BTW, and this is more of a style thing: it’s often advantageous to return the modified object from the method, (see return statement in moments3!). For example, it was a bit inconvenient to compare the outputs of the different functions, but with return rho it’s just

julia> moments3!(rho, f) == moments4!(rho, f)
true

As far as I know, Tullio.jl is far more actively maintained than Einsum.jl, and more performant, so should be preferred. It also supports multi-threading, and uses LoopVectorization for better simd.

3 Likes