# Great performance hit on (simple) functions

#1

I’m part of a group writting a Finite-Element-Solver for a project at our university.
We were totally fine writting the Julia program until we tried to optimise our code in terms of performance.

After initialisation we are going through a foor-loop filling a sparse matrix (“Assembly”). In every loop-cycle we call the sima3 function which consumes nearly 90% of the total computation time.

``````function stima3(vertices)
d = size(vertices,2)
G = \([ones(1,d+1);vertices'],[zeros(1,d);eye(d)])
return /(det([ones(1,d+1);vertices']) * *(G, G'), prod(1:d))
end
``````

The only idea we have is that there is a problem with memory allocation. But is there any posibility to boost this code’s performance?
After a few tests we think that the eye-function is part of the problem. But is this even possible because why should a eye-function consume that much resources?
(vertices is a Array{Float64,2}(3,2) and we want return a Array{Float64,2}(3,3))

#2

eye(d) creates a dxd matrix of Int’s every time it’s called. Then those Ints have to be converted to floats in the `[zeros(1,d);eye(d)]` command.

Note that things like `\` are binary operators, so you can write `A\B` instead of `\(A,B)` and have be much easier to read.

#3

Maybe it’s my fault but if write

``````display(eye(d))
``````

i get

``````2×2 Array{Float64,2}:
1.0  0.0
0.0  1.0
``````

So it should be Float64?

#4

Oh I guess it is floats… ignore me. But it’s still allocating a lot.

You might want to re-write the formula in terms of something that won’t allocate. Here’s an assembly implementation for a stiffness matrix based on bubble functions you might want to look at:

I wrote this before I “knew how to Julia”, so it’s vectorized (though with the broadcast changes this will be fixed by adding a few `.`'s in v0.6, and adding some `@view` around). You can make that non-allocating pretty easily though. But as you can see, re-working the algorithm to not require a (small) matrix multiplication is doable and let’s you get to a computationally better solution.

#5

Your code can be optimized by avoiding creating temporary matrices.
If this function is being used again and again you can create those matrix in advance and reuse it in the function.
Here is an example

``````verticles = rand(3,2)
n, d = size(verticles)
G1 = ones(n, d+1)
G2 = zeros(n, d)

for j in 1:d
G2[j+1, j] = 1.0
end

function stima3!(G1, G2, verticles)
n, d = size(verticles)
for i in 1:n
for j in 1:d
G1[i, j+1] = verticles[i,j]
end
end
G = At_ldiv_B(G1, G2)

return A_mul_Bt(G, G) * det(G1) / prod(1:d)
end
function stima3(vertices)
d = size(vertices,2)
G = \([ones(1,d+1);vertices'],[zeros(1,d);eye(d)])

return /(det([ones(1,d+1);vertices']) * *(G, G'), prod(1:d))
end

using BenchmarkTools
@benchmark stima3!(G1, G2, verticles)
@benchmark stima3(verticles)
``````

There is a 5X speed up.

``````julia> @benchmark stima3!(G1, G2, verticles)
BenchmarkTools.Trial:
samples:          10000
evals/sample:     9
time tolerance:   5.00%
memory tolerance: 1.00%
memory estimate:  1.58 kb
allocs estimate:  21
minimum time:     2.13 μs (0.00% GC)
median time:      2.29 μs (0.00% GC)
mean time:        2.56 μs (6.93% GC)
maximum time:     347.45 μs (97.55% GC)

julia> @benchmark stima3(verticles)
BenchmarkTools.Trial:
samples:          10000
evals/sample:     1
time tolerance:   5.00%
memory tolerance: 1.00%
memory estimate:  3.36 kb
allocs estimate:  51
minimum time:     14.05 μs (0.00% GC)
median time:      16.01 μs (0.00% GC)
mean time:        16.77 μs (2.37% GC)
maximum time:     4.06 ms (98.10% GC)
``````

The code can be further optimized.

#6

I agree with @Lanfeng that you are wasting most of your time allocating the matrices `[ones(1,d+1);vertices']` and `[zeros(1,d);eye(d)]`, and in an inefficient way. The first matrix you even allocate twice, for no reason.

If you for some reason do not like keeping two matrices `G1` and `G2` around, and you know for sure the dimensions of `vertices`, you can get pretty much the same speedup with this code:

``````function stima3b(vertices)
T = eltype(vertices)
G1 = T[1 1 1; 0 0 0; 0 0 0]
G2 = T[0 0; 1 0; 0 1]
fd = factorial(2)  # or just "2"
G1[2:3, :] = vertices'
G = G1 \ G2
return (G * G') * (det(G1) / fd)
end
``````
``````julia> @benchmark stima3(vert)
BenchmarkTools.Trial:
samples:          10000
evals/sample:     1
time tolerance:   5.00%
memory tolerance: 1.00%
memory estimate:  3.33 kb
allocs estimate:  51
minimum time:     13.99 μs (0.00% GC)
median time:      16.69 μs (0.00% GC)
mean time:        17.42 μs (2.28% GC)
maximum time:     4.02 ms (99.00% GC)

julia> @benchmark stima3b(vert)
BenchmarkTools.Trial:
samples:          10000
evals/sample:     9
time tolerance:   5.00%
memory tolerance: 1.00%
memory estimate:  1.84 kb
allocs estimate:  27
minimum time:     3.03 μs (0.00% GC)
median time:      3.26 μs (0.00% GC)
mean time:        3.65 μs (7.21% GC)
maximum time:     414.57 μs (97.47% GC)
``````

Now, if you really want to speed things up, you should look into the package `StaticArrays.jl`, which offers blazingly fast operations on small, fixed-size, arrays:

``````function stima3c{T}(vertices::SMatrix{3, 2, T})
G1 = @MMatrix T[1 1 1; 0 0 0; 0 0 0]
G2 = @SMatrix T[0 0; 1 0; 0 1]
fd = factorial(2)
G1[2:3, :] = vertices'
G = [G1\G2[:,1] G1\G2[:,2]]  # StaticArrays does not support '\' on matrices
return (G * G') * (det(G1) / fd)
end
``````

Here’s the benchmark for that:

``````julia> @benchmark stima3c(SMatrix{3,2}(vert))
BenchmarkTools.Trial:
samples:          10000
evals/sample:     911
time tolerance:   5.00%
memory tolerance: 1.00%
memory estimate:  176.00 bytes
allocs estimate:  3
minimum time:     114.00 ns (0.00% GC)
median time:      119.00 ns (0.00% GC)
mean time:        139.95 ns (9.73% GC)
maximum time:     2.96 μs (90.83% GC)
``````

Thats >100x speedup over the original code, and you can still optimize further, e.g. by letting `vert` be a `SMatrix` in the first place, avoid transposition, and maybe allocating `G1` and `G2` outside the function.

I fact, I would consider whether to use `StaticArrays.jl` throughout your code, since you can get dramatic speedups. Perhaps it is still a bit immature to let your entire codebase rely on it (don’t take my word for it), but you can at least use it here to fix your bottleneck.

One more note: Please use infix operators in the infix position. It makes your code so much easier to read Also, I don’t see any reason to use `A_mul_Bt(G, G)` et. al. over `G * G'`; the latter is much more readable, and should be as fast.

#7

`vcat` is, unfortunately, type-unstable at the moment and this explains the memory allocation you are seeing.

``````julia> f() = [ones(4); 3.4]
f (generic function with 1 method)

julia> @code_warntype f()
Variables:
#self#::#f

Body:
begin
return \$(Expr(:invoke, LambdaInfo for vcat(::Array{Float64,1}, ::Float64), :(Main.vcat), :(\$(Expr(:invoke, LambdaInfo for fill!(::Array{Float64,1}, ::Float64), :(Base.fill!), :((Core.ccall)(:jl_alloc_array_1d,(Core.apply_type)(Core.Array,Float64,1)::Type{Array{Float64,1}},(Core.svec)(Core.Any,Core.Int)::SimpleVector,Array{Float64,1},0,4,0)::Array{Float64,1}), :((Base.box)(Float64,(Base.sitofp)(Float64,1)))))), 3.4))
end::Any
``````

#8

You can get a better speed up from StaticArrays.jl if you get the sizes of the matrices from the type parameters:

``````using StaticArrays

function stima3(vertices::Matrix)
d = size(vertices,2)
G = \([ones(1,d+1);vertices'],[zeros(1,d);eye(d)])
return /(det([ones(1,d+1);vertices']) * *(G, G'), prod(1:d))
end

function stima3{N,D}(vertices::SMatrix{N,D})
U = [ones(SMatrix{1,N}); vertices']
# unfortunately StaticArrays doesn't define \(::SArray, ::SArray) yet.
G = inv(U) * [zeros(SMatrix{1,D}); eye(SMatrix{D,D})]
return (det(U) * G * G') / prod(1:D)
end

vertices = randn(3,2)
svertices = SMatrix{3,2}(vertices)

stima3(vertices)
stima3(svertices)

using BenchmarkTools

@benchmark stima3(\$vertices)
@benchmark stima3(\$svertices)
``````

which gives:

``````julia> @benchmark stima3(\$vertices)
BenchmarkTools.Trial:
samples:          10000
evals/sample:     1
time tolerance:   5.00%
memory tolerance: 1.00%
memory estimate:  3.36 kb
allocs estimate:  51
minimum time:     13.30 μs (0.00% GC)
median time:      20.35 μs (0.00% GC)
mean time:        28.24 μs (1.57% GC)
maximum time:     10.13 ms (0.00% GC)

julia> @benchmark stima3(\$svertices)
BenchmarkTools.Trial:
samples:          10000
evals/sample:     994
time tolerance:   5.00%
memory tolerance: 1.00%
memory estimate:  0.00 bytes
allocs estimate:  0
minimum time:     31.00 ns (0.00% GC)
median time:      32.00 ns (0.00% GC)
mean time:        33.08 ns (0.00% GC)
maximum time:     497.00 ns (0.00% GC)
``````

which is almost an 1000x speed up!

(also, note the complete lack of any memory allocations).

#9

This is cool. Is there some general advice on what format to use if I have a fixed size sparse array - use sparse or StaticArray? (sorry if this derails the conversation slightly).

#10

Yes these two formats optimized are for opposite use cases. You want to use sparse for large arrays and Static for small (like 3x3 matrices etc.) ones.

#11

Thanks! Is there anything better than a really clear recommendation?