Great performance hit on (simple) functions

question

#1

I’m part of a group writting a Finite-Element-Solver for a project at our university.
We were totally fine writting the Julia program until we tried to optimise our code in terms of performance.

After initialisation we are going through a foor-loop filling a sparse matrix (“Assembly”). In every loop-cycle we call the sima3 function which consumes nearly 90% of the total computation time.

function stima3(vertices)
  d = size(vertices,2)
  G = \([ones(1,d+1);vertices'],[zeros(1,d);eye(d)])
  return /(det([ones(1,d+1);vertices']) * *(G, G'), prod(1:d))
end

The only idea we have is that there is a problem with memory allocation. But is there any posibility to boost this code’s performance?
After a few tests we think that the eye-function is part of the problem. But is this even possible because why should a eye-function consume that much resources?
(vertices is a Array{Float64,2}(3,2) and we want return a Array{Float64,2}(3,3))


#2

eye(d) creates a dxd matrix of Int’s every time it’s called. Then those Ints have to be converted to floats in the [zeros(1,d);eye(d)] command.

Note that things like \ are binary operators, so you can write A\B instead of \(A,B) and have be much easier to read.


#3

Maybe it’s my fault but if write

display(eye(d))

i get

2×2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0

So it should be Float64?


#4

Oh I guess it is floats… ignore me. But it’s still allocating a lot.

You might want to re-write the formula in terms of something that won’t allocate. Here’s an assembly implementation for a stiffness matrix based on bubble functions you might want to look at:

https://github.com/JuliaDiffEq/FiniteElementDiffEq.jl/blob/master/src/solvers/fem_assembly.jl#L32

I wrote this before I “knew how to Julia”, so it’s vectorized (though with the broadcast changes this will be fixed by adding a few .'s in v0.6, and adding some @view around). You can make that non-allocating pretty easily though. But as you can see, re-working the algorithm to not require a (small) matrix multiplication is doable and let’s you get to a computationally better solution.


#5

Your code can be optimized by avoiding creating temporary matrices.
If this function is being used again and again you can create those matrix in advance and reuse it in the function.
Here is an example

verticles = rand(3,2)
n, d = size(verticles)
G1 = ones(n, d+1)
G2 = zeros(n, d)

for j in 1:d
    G2[j+1, j] = 1.0
end

function stima3!(G1, G2, verticles)
    n, d = size(verticles)
    for i in 1:n
        for j in 1:d
            G1[i, j+1] = verticles[i,j]
        end
    end
    G = At_ldiv_B(G1, G2)
    
    return A_mul_Bt(G, G) * det(G1) / prod(1:d)
end
function stima3(vertices)
  d = size(vertices,2)
  G = \([ones(1,d+1);vertices'],[zeros(1,d);eye(d)])
  
  return /(det([ones(1,d+1);vertices']) * *(G, G'), prod(1:d))
end

using BenchmarkTools
@benchmark stima3!(G1, G2, verticles)
@benchmark stima3(verticles)

There is a 5X speed up.

julia> @benchmark stima3!(G1, G2, verticles)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.58 kb
  allocs estimate:  21
  minimum time:     2.13 μs (0.00% GC)
  median time:      2.29 μs (0.00% GC)
  mean time:        2.56 μs (6.93% GC)
  maximum time:     347.45 μs (97.55% GC)

julia> @benchmark stima3(verticles)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  3.36 kb
  allocs estimate:  51
  minimum time:     14.05 μs (0.00% GC)
  median time:      16.01 μs (0.00% GC)
  mean time:        16.77 μs (2.37% GC)
  maximum time:     4.06 ms (98.10% GC)

The code can be further optimized.


#6

I agree with @Lanfeng that you are wasting most of your time allocating the matrices [ones(1,d+1);vertices'] and [zeros(1,d);eye(d)], and in an inefficient way. The first matrix you even allocate twice, for no reason.

If you for some reason do not like keeping two matrices G1 and G2 around, and you know for sure the dimensions of vertices, you can get pretty much the same speedup with this code:

function stima3b(vertices)
    T = eltype(vertices)
    G1 = T[1 1 1; 0 0 0; 0 0 0]
    G2 = T[0 0; 1 0; 0 1]
    fd = factorial(2)  # or just "2"
    G1[2:3, :] = vertices'
    G = G1 \ G2
    return (G * G') * (det(G1) / fd)
end
julia> @benchmark stima3(vert)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  3.33 kb
  allocs estimate:  51
  minimum time:     13.99 μs (0.00% GC)
  median time:      16.69 μs (0.00% GC)
  mean time:        17.42 μs (2.28% GC)
  maximum time:     4.02 ms (99.00% GC)

julia> @benchmark stima3b(vert)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.84 kb
  allocs estimate:  27
  minimum time:     3.03 μs (0.00% GC)
  median time:      3.26 μs (0.00% GC)
  mean time:        3.65 μs (7.21% GC)
  maximum time:     414.57 μs (97.47% GC)

Now, if you really want to speed things up, you should look into the package StaticArrays.jl, which offers blazingly fast operations on small, fixed-size, arrays:

function stima3c{T}(vertices::SMatrix{3, 2, T})
    G1 = @MMatrix T[1 1 1; 0 0 0; 0 0 0]
    G2 = @SMatrix T[0 0; 1 0; 0 1]
    fd = factorial(2)
    G1[2:3, :] = vertices'
    G = [G1\G2[:,1] G1\G2[:,2]]  # StaticArrays does not support '\' on matrices
    return (G * G') * (det(G1) / fd)
end

Here’s the benchmark for that:

julia> @benchmark stima3c(SMatrix{3,2}(vert))
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     911
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  176.00 bytes
  allocs estimate:  3
  minimum time:     114.00 ns (0.00% GC)
  median time:      119.00 ns (0.00% GC)
  mean time:        139.95 ns (9.73% GC)
  maximum time:     2.96 μs (90.83% GC)

Thats >100x speedup over the original code, and you can still optimize further, e.g. by letting vert be a SMatrix in the first place, avoid transposition, and maybe allocating G1 and G2 outside the function.

I fact, I would consider whether to use StaticArrays.jl throughout your code, since you can get dramatic speedups. Perhaps it is still a bit immature to let your entire codebase rely on it (don’t take my word for it), but you can at least use it here to fix your bottleneck.

One more note: Please use infix operators in the infix position. It makes your code so much easier to read :slight_smile: Also, I don’t see any reason to use A_mul_Bt(G, G) et. al. over G * G'; the latter is much more readable, and should be as fast.


#7

vcat is, unfortunately, type-unstable at the moment and this explains the memory allocation you are seeing.

julia> f() = [ones(4); 3.4]
f (generic function with 1 method)

julia> @code_warntype f()
Variables:
  #self#::#f

Body:
  begin
      return $(Expr(:invoke, LambdaInfo for vcat(::Array{Float64,1}, ::Float64), :(Main.vcat), :($(Expr(:invoke, LambdaInfo for fill!(::Array{Float64,1}, ::Float64), :(Base.fill!), :((Core.ccall)(:jl_alloc_array_1d,(Core.apply_type)(Core.Array,Float64,1)::Type{Array{Float64,1}},(Core.svec)(Core.Any,Core.Int)::SimpleVector,Array{Float64,1},0,4,0)::Array{Float64,1}), :((Base.box)(Float64,(Base.sitofp)(Float64,1)))))), 3.4))
  end::Any

#8

You can get a better speed up from StaticArrays.jl if you get the sizes of the matrices from the type parameters:

using StaticArrays

function stima3(vertices::Matrix)
  d = size(vertices,2)
  G = \([ones(1,d+1);vertices'],[zeros(1,d);eye(d)])
  return /(det([ones(1,d+1);vertices']) * *(G, G'), prod(1:d))
end

function stima3{N,D}(vertices::SMatrix{N,D})
    U = [ones(SMatrix{1,N}); vertices']
    # unfortunately StaticArrays doesn't define \(::SArray, ::SArray) yet.
    G = inv(U) * [zeros(SMatrix{1,D}); eye(SMatrix{D,D})]
    return (det(U) * G * G') / prod(1:D)
end

vertices = randn(3,2)
svertices = SMatrix{3,2}(vertices)

stima3(vertices)
stima3(svertices)


using BenchmarkTools

@benchmark stima3($vertices)
@benchmark stima3($svertices)

which gives:

julia> @benchmark stima3($vertices)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  3.36 kb
  allocs estimate:  51
  minimum time:     13.30 μs (0.00% GC)
  median time:      20.35 μs (0.00% GC)
  mean time:        28.24 μs (1.57% GC)
  maximum time:     10.13 ms (0.00% GC)

julia> @benchmark stima3($svertices)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     994
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     31.00 ns (0.00% GC)
  median time:      32.00 ns (0.00% GC)
  mean time:        33.08 ns (0.00% GC)
  maximum time:     497.00 ns (0.00% GC)

which is almost an 1000x speed up!

(also, note the complete lack of any memory allocations).


#9

This is cool. Is there some general advice on what format to use if I have a fixed size sparse array - use sparse or StaticArray? (sorry if this derails the conversation slightly).


#10

Yes these two formats optimized are for opposite use cases. You want to use sparse for large arrays and Static for small (like 3x3 matrices etc.) ones.


#11

Thanks! Is there anything better than a really clear recommendation? :smile: