Performance of mutable static arrays and compilation cost

PetrKryslUCSD · July 16, 2018, 6:59pm

Here is an example. Regular matrices and static arrays are used in identical loops. The static arrays are orders of magnitude slower. The code using static arrays also allocates. Any idea what’s going on?

The code:

module mbuffers13

using FinEtools
using StaticArrays

function test_tda(N, NLOOP)
    Kedim = N
    mdim = 2
    gradN = rand(N, 2)
    Ke = fill(0.0, Kedim, Kedim)
    multiplier = 2.0
    t = @elapsed for loop = 1:NLOOP
        for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
            @inbounds for px = 1:mdim
                a = (multiplier)*gradN[nx, px]
                @inbounds for mx = 1:nx # only the upper triangle
                    Ke[mx, nx] +=  gradN[mx, px] * a
                end
            end
        end
    end 
    return t ./ NLOOP
end 

function test_tsa(N, NLOOP)
    Kedim = N
    mdim = 2
    gradN = MMatrix{N, 2, Float64}(rand(N, 2))
    Ke = MMatrix{N, N, Float64}(fill(0.0, Kedim, Kedim))
    multiplier = 2.0
    t = @elapsed for loop = 1:NLOOP
        for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
            @inbounds for px = 1:mdim
                a = (multiplier)*gradN[nx, px]
                @inbounds for mx = 1:nx # only the upper triangle
                    Ke[mx, nx] +=  gradN[mx, px] * a
                end
            end
        end
    end 
    return t ./ NLOOP
end 

function test(N)
    println("N = $(N)")
    NLOOP = 10000
    @time tda = test_tda(N, NLOOP)
    @time tsa = test_tsa(N, NLOOP)
    vec([tda tsa])
end

end

using .mbuffers13

NS = [3, 9, 16, 25, 36] # , 225, 900
ts = []
for N in NS
    push!(ts, mbuffers13.test(N))
end 
@show ts

The results:

julia> include("test/test_buffers5.jl")
N = 3
  0.000173 seconds (2 allocations: 288 bytes)
  0.504165 seconds (1.55 M allocations: 54.700 MiB, 2.43% gc time)
N = 9
  0.001661 seconds (2 allocations: 960 bytes)
  0.411852 seconds (5.09 M allocations: 86.726 MiB, 2.85% gc time)
N = 16
  0.002520 seconds (2 allocations: 2.516 KiB)
  0.970008 seconds (14.67 M allocations: 239.903 MiB, 1.46% gc time)
N = 25
  0.003897 seconds (2 allocations: 5.594 KiB)
  2.162746 seconds (34.20 M allocations: 549.774 MiB, 1.07% gc time)
N = 36
  0.007959 seconds (2 allocations: 10.906 KiB)
 15.484236 seconds (69.35 M allocations: 1.083 GiB, 0.34% gc time)
ts = Any[[1.65101e-8, 1.11647e-5], [1.6589e-7, 2.52754e-5], [2.5164e-7, 6.97781e-5], [3.8924e-7, 0.000139628], [7.94341e-7, 0.000328319]]

Obtained with Version 0.7.0-beta2.12 on Windows.

kristoffer.carlsson · July 16, 2018, 7:05pm

If possible, please try to make examples that don’t require too many external packages. Edit: (actually doesn’t require FinEtools to run)

Anyway, the problem here is likely that N is a runtime parameter so the size of the MMatrix is not known statically. Either use a function barrier or change so that the size is a type parameter, e.g. function test_tsa(::Val{N}, NLOOP) where {N}

Fixing the type stability, I get e.g.

julia> @btime test_tda(8, 10000)
  718.643 μs (2 allocations: 832 bytes)
7.87577e-8

julia> @btime test_tsa(Val(8), 10000)
  454.158 μs (1 allocation: 144 bytes)
5.0467899999999994e-8

Also, note that you can do

    gradN = rand(MMatrix{N, 2, Float64})
    Ke = zero(MMatrix{N, N, Float64})

Elrod · July 16, 2018, 7:08pm

Mutable static arrays will allocate a little anyway, because they’re mutable structs. But the type instability is the #1 issue.

kristoffer.carlsson · July 16, 2018, 8:37pm

Julia 0.7 is much better at eliminating the allocation as long as the object doesn’t escape the function.

Elrod · July 16, 2018, 8:53pm

Oh wow, can’t believe I hadn’t noticed this yet:

julia> m = randn(8, 20);

julia> n = randn(8, 20);

julia> @inline function dot2(a, b)
           out = zero(eltype(a))
           @inbounds @simd for i ∈ eachindex(a)
               out += a[i] * b[i]
           end
           out
       end
dot2 (generic function with 1 method)

julia> function test_alloc(m, n)
           out = zero(eltype(m))
           @boundscheck @assert size(m) == size(n)
           @inbounds for i ∈ 1:size(m,2)
               @views out += dot2(m[:,i], n[:,i])
           end
           out
       end
test_alloc (generic function with 1 method)

julia> @benchmark @inbounds test_alloc($m, $n)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     77.093 ns (0.00% GC)
  median time:      77.795 ns (0.00% GC)
  mean time:        78.936 ns (0.00% GC)
  maximum time:     154.300 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     970

Views unfortunately still allocate with base dot product, but they don’t when the function is inlined. (I get 40 allocations when I replace @inline with @noinline).

I’m going to have to experiment with MVectors and MMatrices. I’d been using the likes of Base.Cartesian.@nexprs a lot. I think just creating the arrays will normally be a lot cleaner.

PetrKryslUCSD · July 17, 2018, 4:36am

Sorry, I forgot to delete the (unnecessary, the example did not need it) using FinEtools.

Thanks for the solution: I did not realize that for the static array to work N had to be a compile-time constant.

PetrKryslUCSD · July 17, 2018, 4:52am

Just to clarify: when the size of the MMatrix is a runtime variable, the compiler does not complain, the code runs, but the compiler cannot place the object on the stack (which makes the matrix much slower than a regular array). Is that correct? If so, what precisely is the difference in the composition of the MMatrix data structure. Does it become resizable when its size is not a type parameter? What exactly causes the trouble?

Elrod · July 17, 2018, 5:25am

If you create a function barrier, there’d be a single dynamic dispatch, and everything else would be fine (would definitely have to allocate for the creation of the matrix, though).

Without it, your code is not type stable.
It also is not a small union of N values, so I think inference just fails on the matrix, but you’d have to check with @code_warntype (maybe it still knows getindex returns Float64?).
It’s that type instability that is the problem, not the data structure itself (which is just a miracle struct holding an NTuple). With type unstable code, dispatches are resolved at run time, which also means the compiler doesn’t know what that function call returns, propagating that type instability.

And you definitely will not get any of the speed benefits, because those require the compiler to use the array’s dimensions to make informed decisions in automatically unrolling, vectorization, etc.

mohamed82008 · July 17, 2018, 5:41am

In case it wasn’t clear so far, the size of the MMatrix is part of its type. In particular, an M by N MMatrix of element type T has a type MArray{Tuple{M,N}, T, 2, M*N}, 2 being the dimension of the MArray since it is a matrix.

M = 2; N = 3; T = Float64
zero(MArray{Tuple{M,N}, T, 2, M*N})

PetrKryslUCSD · July 17, 2018, 5:47am

Precisely. So what is the difference between gradN and gradN2?

julia> N=5
5
julia> using StaticArrays
julia> gradN = MMatrix{N, 2, Float64}(rand(N, 2))
5×2 MArray{Tuple{5,2},Float64,2,10}:
 0.614984   0.000340008
 0.99798    0.522036
 0.31254    0.968401
 0.595764   0.912941
 0.0620653  0.70537
julia> gradN2 = MMatrix{5, 2, Float64}(rand(5, 2))
5×2 MArray{Tuple{5,2},Float64,2,10}:
 0.693747  0.617432
 0.573811  0.662767
 0.222344  0.859323
 0.301762  0.570657
 0.908096  0.601919

PetrKryslUCSD · July 17, 2018, 5:55am

Another instructive experiment.

module mbuffers13

using StaticArrays

function test_tda(N, NLOOP)
    Kedim = N
    mdim = 2
    gradN = rand(N, 2)
    Ke = fill(0.0, Kedim, Kedim)
    multiplier = 2.0
    t = @elapsed for loop = 1:NLOOP
        for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
            @inbounds for px = 1:mdim
                a = (multiplier)*gradN[nx, px]
                @inbounds for mx = 1:nx # only the upper triangle
                    Ke[mx, nx] +=  gradN[mx, px] * a
                end
            end
        end
    end 
    return t ./ NLOOP
end 

function test_tsa(::Val{N}, NLOOP) where {N}
    mdim = 2
    gradN = MMatrix{N, 2, Float64}(rand(N, 2))
    Ke = MMatrix{N, N, Float64}(fill(0.0, N, N))
    multiplier = 2.0
    t = @elapsed for loop = 1:NLOOP
        for nx = 1:N # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
            @inbounds for px = 1:mdim
                a = (multiplier)*gradN[nx, px]
                @inbounds for mx = 1:nx # only the upper triangle
                    Ke[mx, nx] +=  gradN[mx, px] * a
                end
            end
        end
    end 
    return t ./ NLOOP
end 

function test_tsa2(N, NLOOP) 
    mdim = 2
    gradN = MMatrix{N, 2, Float64}(rand(N, 2))
    Ke = MMatrix{N, N, Float64}(fill(0.0, N, N))
    multiplier = 2.0
    function multiplyem(gradN, Ke, multiplier)
        t = @elapsed for loop = 1:NLOOP
            for nx = 1:N # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
                @inbounds for px = 1:mdim
                    a = (multiplier)*gradN[nx, px]
                    @inbounds for mx = 1:nx # only the upper triangle
                        Ke[mx, nx] +=  gradN[mx, px] * a
                    end
                end
            end
        end 
        return t
    end 
    return multiplyem(gradN, Ke, multiplier) ./ NLOOP
end 

function test(N)
    println("N = $(N)")
    NLOOP = 100000
    @time tda = test_tda(N, NLOOP)
    @time tsa = test_tsa(Val(N), NLOOP)
    @time tsa2 = test_tsa2(N, NLOOP)
    vec([tda tsa tsa2])
end

end

using .mbuffers13

NS = [3, 9, 16, 25, 36, 49, 64] # , 225, 900
ts = []
for N in NS
    push!(ts, mbuffers13.test(N))
end 
@show ts

using Gaston
set(axis="loglog", plotstyle="linespoints", linewidth=2, pointsize = 1, color = "black", xlabel = "N", ylabel = "Time [microseconds]", grid="on", title = "")
f = figure()
# TS = [1.0e6 * t[1] for t in ts] # Time in Microseconds
# plot(NS, TS, legend = "Complete triangle")
TS = [1.0e6 * t[1] for t in ts] # Time in Microseconds
plot(NS, TS, legend = "Dynamic", gpcom = """set terminal wxt font ",6" """, box = "left top")
TS = [1.0e6 * t[2] for t in ts] # Time in Microseconds
plot!(NS, TS, legend = "Static" )
TS = [1.0e6 * t[3] for t in ts] # Time in Microseconds
plot!(NS, TS, legend = "Static 2" )
figure(f)

I am guessing that somehow the internal function multiplyem makes it possible for the compiler the reason about the arguments so that the performance is not a disaster. However it is not as fast as the case where I call test_tsa(Val(N), NLOOP).

The compile times are eye-opening:

N = 3
  0.003520 seconds (2 allocations: 288 bytes)
  0.091928 seconds (99.51 k allocations: 5.204 MiB)
  0.078540 seconds (41.66 k allocations: 2.063 MiB)
N = 9
  0.023951 seconds (2 allocations: 960 bytes)
  0.345833 seconds (118.31 k allocations: 5.870 MiB)
  0.140733 seconds (41.66 k allocations: 2.064 MiB)
N = 16
  0.068175 seconds (2 allocations: 2.516 KiB)
  0.409460 seconds (161.29 k allocations: 7.350 MiB, 2.24% gc time)
  0.086349 seconds (41.66 k allocations: 2.067 MiB)
N = 25
  0.081707 seconds (2 allocations: 5.594 KiB)
  0.583046 seconds (106.73 k allocations: 5.451 MiB)
  0.151244 seconds (41.66 k allocations: 2.073 MiB)
N = 36
  0.113666 seconds (2 allocations: 10.906 KiB)
  1.080065 seconds (111.66 k allocations: 5.636 MiB)
  0.151877 seconds (41.66 k allocations: 2.083 MiB)
N = 49
  0.167099 seconds (3 allocations: 19.766 KiB)
  4.040710 seconds (116.59 k allocations: 5.780 MiB, 0.36% gc time)
  0.258091 seconds (41.66 k allocations: 2.101 MiB)
N = 64
  0.297501 seconds (3 allocations: 33.219 KiB)
 26.949391 seconds (3.88 M allocations: 204.397 MiB, 0.57% gc time)
419.663937 seconds (43.97 k allocations: 2.276 MiB)

There is a lot of compilation time to amortize!

PetrKryslUCSD · July 17, 2018, 6:02am

By the way, this is what the bad case of (mis) using the static matrices looks like:

The code:

function test_tsabad(N, NLOOP)
    mdim = 2
    gradN = MMatrix{N, 2, Float64}(rand(N, 2))
    Ke = MMatrix{N, N, Float64}(fill(0.0, N, N))
    multiplier = 2.0
    t = @elapsed for loop = 1:NLOOP
        for nx = 1:N # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
            @inbounds for px = 1:mdim
                a = (multiplier)*gradN[nx, px]
                @inbounds for mx = 1:nx # only the upper triangle
                    Ke[mx, nx] +=  gradN[mx, px] * a
                end
            end
        end
    end 
    return t ./ NLOOP
end

mohamed82008 · July 17, 2018, 6:50am

Let’s make a simpler example:

using StaticArrays

function make_matrix(N)
    return zero(MMatrix{N,N,Float64,N^2})
end

@generated function make_matrix_type_stable(::Val{N}) where N
    return quote
        return zero(MMatrix{$N,$N,Float64,$(N*N)})
    end
end

The first function above takes N as input and returns an N by N MMatrix of the expected type. The problem with this is that N is not known when the function is compiled, it is just a variable. So the same method is run when N is 1 or 2 or 3, etc. In other words, the compiler only has access to the information that N is of type Int, not its value.

So one workaround is to use ::Val{N} as an input argument instead, so the input will be Val{N}(). Val{3}() is an instance of the type Val{3}, so the compiler now has access to the value of N just from the types of the input arguments. Another workaround is to use ::Type{Val{N}} so you pass Val{N} directly not an instance of it if you find that cleaner.

Whichever workaround you use, once the compiler can tell the value of N from the types of the input, it will now specialize the function for each value of N. If this specialization is not strong enough, you can also specialize it yourself using the generated function feature. In the above example I used a generated function because the compiler in v0.6 had a hard time figuring out that N*N is known only from the types (i.e. at compile time) so the last type parameter was still not inferred properly. Using a generated function, I can go the extra mile and specialize the function however I want for the input types. In StaticArrays for example, generated functions are used to unroll the loop when doing multiplications customizing the function for each matrix size lower than some threshold. If N is higher than the threshold then a normal loop can be used, etc. So you get far more power with generated functions, and you don’t have to rely on the compiler and hints to the compiler for customization.

Val(N) is just a type unstable way of constructing Val{N}(). The idea of a function barrier is that outside the function, N can be a variable, but there is only 1 point of type instability, e.g. calling Val(N) and passing it to the function. Once inside the function, N is now type information and the function is specialized for it. Obviously N should not be changing too often and the function should not be called too often or this small type instability will not be small any more.

@code_warntype results:

@code_warntype make_matrix(3)

#=
Variables:
  #self# <optimized out>
  N::Int64

Body:
  begin 
      return (Main.zero)((Core.apply_type)(Main.MMatrix, N::Int64, N::Int64, Main.Float64, (Base.mul_int)(N::Int64, N::Int64)::Int64)::Type{StaticArrays.MArray{Tuple{_,_},Float64,2,_}} where _ where _ where _)::Any
  end::Any
=#

@code_warntype make_matrix_type_stable(Val{3}())

#=
Body:
  begin  # line 2:
      # meta: location In[30] # line 3:
      $(Expr(:inbounds, false))
      # meta: location C:\Users\user\.julia\v0.6\StaticArrays\src\linalg.jl zero 142
      # meta: location C:\Users\user\.julia\v0.6\StaticArrays\src\arraymath.jl zeros 3
      # meta: location C:\Users\user\.julia\v0.6\StaticArrays\src\arraymath.jl # line 12:
      SSAValue(2) = (Base.sitofp)(Float64, 0)::Float64
      # meta: location C:\Users\user\.julia\v0.6\StaticArrays\src\arraymath.jl _fill 36
      # meta: location C:\Users\user\.julia\v0.6\StaticArrays\src\arraymath.jl # line 39:
      SSAValue(3) = $(Expr(:new, StaticArrays.MArray{Tuple{3,3},Float64,2,9}, :((StaticArrays.tuple)(SSAValue(2), SSAValue(2), SSAValue(2), SSAValue(2), SSAValue(2), SSAValue(2), SSAValue(2), SSAValue(2), SSAValue(2))::NTuple{9,Float64})))
      # meta: pop location
      # meta: pop location
      # meta: pop location
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      # meta: pop location
      return SSAValue(3)
  end::StaticArrays.MArray{Tuple{3,3},Float64,2,9}

=#

kristoffer.carlsson · July 17, 2018, 7:21am

There is no need to do that if you used the zero(MMatrix{N,N,T}) constructor (no need to specify the total number of elements).

Elrod · July 17, 2018, 8:30am

There is a lot of compilation time to amortize!

Yeah. Maybe I should create a bunch of PRs so that MMatrices better support large sizes.
For example,

julia> using StaticArrays, BenchmarkTools

julia> @time zero(MMatrix{150,150,Float64});
 15.781910 seconds (17.63 M allocations: 780.125 MiB, 2.30% gc time)

julia> @btime zero(MMatrix{150,150,Float64});
  14.356 μs (1 allocation: 175.88 KiB)

versus a simple for loop:

julia> function zerov2(::Type{MMatrix{M,N,T}}) where {M,N,T}
           out = MMatrix{M,N,T}(undef)
           @inbounds for i ∈ eachindex(out)
               out[i] = zero(T)
           end
           out
       end

julia> @time zerov2(MMatrix{150,150,Float64});
  0.087669 seconds (187.09 k allocations: 10.238 MiB)

julia> @btime zerov2(MMatrix{150,150,Float64});
  13.404 μs (1 allocation: 175.88 KiB)

The first run was about 180 times slower for the default version on my computer, and didn’t even lead to better run times after compiling.

For sizes greater than 150x150, the default version actually gave me segfaults.
The for loop on the other hand had no issues.

If you replace your MMatrix(fill()) and MMatrix(randn()) functions with simply creating uninitialized matrices and filling them with for loops like my example above, that should really help your compile time.

kristoffer.carlsson · July 17, 2018, 9:22am

They both compile to a memset so no compilation time in the first case leads to something useful.
Most unrolling should be fairly pointless for MMatrixes since LLVM should just unroll it by itself.

The different reduce functions (https://github.com/JuliaArrays/StaticArrays.jl/blob/master/src/mapreduce.jl#L71 & co) would also likely gain by just being rewritten as loops instead of forcing unrolling.

Topic		Replies	Views
Improve the performance of multiplication of an arbitrary number of matrices Performance	36	2543	May 16, 2018
Type stability issues when using StaticArrays Performance type-stability , staticarrays	24	1916	January 3, 2019
The cost of size() in for-loops Performance	11	1853	July 20, 2018
Strange performance of a loop Performance	37	3328	July 21, 2018
Static array over 1,024 elements is slow to compile? General Usage	11	3182	March 25, 2021

Performance of mutable static arrays and compilation cost

Related topics