 # Why Julia and linear algebra module are incredibly slow, compare to C++ and Eigen

#1

Hi,
I developed a julia file to compare its performance with C++. I implemented the homography estimation based on DLT algortihm. In brief, I created a 8*9 matrix A from two bundle of four 3D points and solve Ah=0. So we just need to compute nullspace vectors of matrix A and finish.
For this one, I implemented the following julia code and one equivalence C++ code that you can find here. I tried to have same number of variables and function’s calls in both languages. the linear algebra part of C++ was implemented by eigen library.

``````using LinearAlgebra

function position_in_world(x::Float64, y::Float64)
target_width_px = 1123
target_height_px = 791
target_width_mm = 297
target_height_mm = 210.025
X = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
Y = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
Z = 0.0
return [X, Y, Z ]
end

# function position_in_world2!(x::Float64, y::Float64, vec::Array{Float64,1})
#       target_width_px = 1123
#       target_height_px = 791
#       target_width_mm = 297
#       target_height_mm = 210.025
#       vec = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
#       vec = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
#       vec = 0.0
# end

function π_projection_function(K::Array{Float64,2}, R::Array{Float64,2}, t::Array{Float64,1}, point::Array{Float64,1})
point_in_camera = R * point + t
point_in_image = K * point_in_camera
return point_in_image / point_in_camera[end]
end

function compute_homograpy_DLT(M, P)
A = zeros(Float64, 8, 9)
for i in 1:size(M)
A[2 * i - 1, 1:3] = M[i,:]
A[2 * i - 1, 7:9] = - P[i,1] * M[i,:]
A[2 * i, 4:6] = M[i,:]
A[2 * i, 7:9] = - P[i,2] * M[i,:]
end
return nullspace(A)
end

@timev begin
K = [ 1169.19630        0.0 652.98743;
0.0 1169.61014 528.83429;
0.0        0.0       1.0]

T = [0.961255 -0.275448 0.0108487    112.79;
0.171961  0.629936   0.75737  -217.627;
-0.21545  -0.72616  0.652895   1385.13]

R = T[1:3, 1:3]
t = T[:,end]

tl = position_in_world(0.0, 0.0)
tr = position_in_world(1123.0, 0.0)
br = position_in_world(1123.0, 791.0)
bl = position_in_world(0.0, 791.0)

# println("pos in world 2")
# @timev begin
# tl = Float64[0.0, 0.0, 0.0]
# tr = Float64[0.0, 0.0, 0.0]
# br = Float64[0.0, 0.0, 0.0]
# bl = Float64[0.0, 0.0, 0.0]
# position_in_world2!(0.0, 0.0,tl)
# position_in_world2!(1123.0, 0.0,tr)
# position_in_world2!(1123.0, 791.0,br)
# position_in_world2!(0.0, 791.0,bl)
# end

p1 = π_projection_function(K, R, t, tl)
p2 = π_projection_function(K, R, t, tr)
p3 = π_projection_function(K, R, t, br)
p4 = π_projection_function(K, R, t, bl)
P = [p1 p2 p3 p4]'

m1 = [0.0, 0.0, 1.0]
m2 = [1123.0, 0.0, 1.0]
m3 = [1123.0, 791.0, 1.0]
m4 = [0.0, 791.0, 1.0]
M = [m1 m2 m3 m4]'

res = compute_homograpy_DLT(M , P)
homography = reshape(res, (3,3))'
println("Homography: ", homography)
end
``````

And the result:

``````Julia (version 1.1.0)
Homography: [0.000244416 -0.000198718 0.915494; 2.16745e-5 8.80403e-5 0.40233; -5.35586e-8 -1.81231e-7 0.00140359]
0.073007 seconds (218.81 k allocations: 11.512 MiB)
elapsed time (ns): 73007284
bytes allocated:   12071266
pool allocs:       218781
non-pool GC allocs:33

C++ ( clang++ version: 7.0.1,  release-flags: "-march=native -O3 -DNDEBUG",  Eigen version: 3.3.4)
Homography:
0.000244416 -0.000198718     0.915494
2.16745e-05  8.80403e-05      0.40233
-5.35586e-08 -1.81231e-07   0.00140359
elapsed time (ns): 275887
``````

As can be seen, the homography results are same but c++ and eigen are approximately 250! times faster.
It is noticeable that I am beginner in julia and there is a possibility that I did not implement my code in efficient way. So the main question is, how can I improve the speed of my julia code?

#2

Before anything you should take a look at the Performance tips.

To get you started, you should really try to avoid all those temporary allocations. For example, a slice, like `M[i,:]`, on the rhs of an assignment allocates a whole new array. Here you might want to consider `@views`.

#3

I remember reading somewhere that Eigen is even faster than MKL in small matrix computations, however such advantage will disappear when the dimension of matrix increases. Could you try a larger matrix?

#4

Apart from pre-allocating, consider using `@btime` from `BenchmarkTools`, also looks like your test may have included in the compile time of the jit. You may try running `@btime` or just run your @timev block a second time to exclude the compile time. Refer to e.g here for tips on how to use `@btime`. Additionally, looks like your matrix size is small, you may try https://github.com/JuliaArrays/StaticArrays.jl

#5

You’re using `Eigen::Vector3d` for points in C++, yet you’re using the equivalent of `Eigen::VectorXd` in Julia (i.e., a `Vector{Float64}`). You’ll definitely want to use StaticArrays.

#6

Just wrapping your main code in a function and timing it with `@btime` from BenchmarkTools I got a runtime of 21us, which is several thousand times faster than what you’re seeing.

I also tried using StaticArrays, in which case runtime is entirely dominated by by the call to `nullspace` (btw. `nullspace` doesn’t work for static arrays, unfortunately): all the code, except `nullspace`, took 150ns to run, but then `nullspace(A)` takes 18us.

Here’s a simple benchmark without static arrays, but using BenchmarkTools:

``````function foo()
K = [ 1169.19630    0.0     652.98743;
0.0     1169.61014 528.83429;
0.0        0.0       1.0]

T = [0.961255 -0.275448 0.0108487  112.79;
0.171961  0.629936 0.75737   -217.627;
-0.21545  -0.72616  0.652895  1385.13]

R = T[:, SVector(1,2,3)]
t = T[:,end]

tl = position_in_world(0.0, 0.0)
tr = position_in_world(1123.0, 0.0)
br = position_in_world(1123.0, 791.0)
bl = position_in_world(0.0, 791.0)

p1 = π_projection_function(K, R, t, tl)
p2 = π_projection_function(K, R, t, tr)
p3 = π_projection_function(K, R, t, br)
p4 = π_projection_function(K, R, t, bl)
P = [p1 p2 p3 p4]'

M = [0.0    0.0 1.0;
1123.0    0.0 1.0;
1123.0 791.0 1.0;
0.0 791.0 1.0]

res = compute_homograpy_DLT(M, P)
homography = reshape(res, (3,3))'
return homography
end
``````

Then benchmark it:

``````julia> @btime bar()
20.643 μs (65 allocations: 14.48 KiB)
0.000244416  -0.000198718  0.915494
2.16745e-5    8.80403e-5   0.40233
-5.35586e-8   -1.81231e-7   0.00140359
``````

Edit: To clarify, the problem isn’t with your code, but with your benchmarking.

#7

I think the comment about StaticArrays may still stand for a thorough comparison. Eigen is pretty explicit about this sort of thing and I believe it is where libraries like tensorflow get a lot of their speed because I read earlier today about someone having trouble getting past the 32 dimension limit that’s hard coded into tensorflow.

#8

Yes it it’s not possible to make a super fast type stable `nullspace` because the output size depends on the elements rather than the dimension of the input. Eigen would have the same problem of course. Even so, it’s a pity that the `Base` version of `nullspace` doesn’t work out of the box with `SMatrix`. Here, I fixed this:

(Hmm. It would be possible to return an SMatrix wrapped in a view from `nullspace` if somebody cared enough to try it out. I’m not sure that would be useful though.)

#9

Would it help if the output were a `Vector` of `SVector{N}`s, where the length `N` is known?

#10

I don’t think so. But regardless of that, we don’t have a statically sized svd implementation anyway so there’s much more room for improvement there. Currently we fall back to `svd(::AbstractMatrix)` and simply wrap the output to make sure we preserve the size information. We could probably do much better by using `MMatrix` and passing that into `LinearAlgebra.LAPACK.gesdd!` directly for `BlasFloat` types, but nobody has bothered to dig in that deep yet.

#11

I think that 3x3 has a more or less manageable closed form, at least I have seen it hardcoded.

#12

To be honest, before this post, I didn’t know about the overhead of JIT in julia. I read all of your comments and fixed my code. First of all, the BenchmarkTools is applied to evaluate the execution speed and I had significant improvement after JIT warm-up. Secondly, all possible variables of type `AbstractArray` (dynamic memory allocation) are replaced with `SVector` and `SMatrix`. Finnaly, I used `@view` macro for `SubArray` slicing.

``````function compute_homograpy_DLT(M, P)
A = zeros(8,9)
for i in 1:size(M)
A[2 * i - 1, 1:3] = @view M[i,:]
A[2 * i - 1, 7:9] = - P[i,1] * @view M[i,:]
A[2 * i, 4:6] = @view M[i,:]
A[2 * i, 7:9] = - P[i,2] * @view M[i,:]
end
return nullspace(A)
end
``````

and

``````K = @SMatrix [ 1169.19630        0.0 652.98743;
0.0 1169.61014 528.83429;
0.0        0.0       1.0]
T = @SMatrix [0.961255 -0.275448 0.0108487    112.79;
0.171961  0.629936   0.75737  -217.627;
-0.21545  -0.72616  0.652895   1385.13]

R = @view T[1:3, 1:3]
t = @view T[:,end]

tl = position_in_world(0.0, 0.0)
tr = position_in_world(1123.0, 0.0)
br = position_in_world(1123.0, 791.0)
bl = position_in_world(0.0, 791.0)

p1 = π_projection_function(K, R, t, tl)
p2 = π_projection_function(K, R, t, tr)
p3 = π_projection_function(K, R, t, br)
p4 = π_projection_function(K, R, t, bl)
P = [p1 p2 p3 p4]'

M = @SMatrix [0.0     0.0    1.0;
1123.0  0.0    1.0;
1123.0  791.0  1.0;
0.0     791.0  1.0]
``````

and re-execute the code:

``````151.783 μs (133 allocations: 13.27 KiB)
``````

BUT, for a fair comparison between julia and eigen, I also modified my c++ code. For this one, I ran the code for 100,000 and the output average time is:

``````elapsed time (microseconds): 34.2004
``````

So the C++ code is still 5 times faster and I think the reason is coming from null space computation.

#13

I have never installed MKL and consequently, I don’t have any estimation about the performance of it. But for Eigen, there is one benchmark, which is done by eigen team. They evaluates the performance of Eigen for different dense decomposition methods, based on the size of input matrix.

#14

The following runs in ~11 microseconds on my machine:

``````using LinearAlgebra
using StaticArrays

function position_in_world(x::Float64, y::Float64)
target_width_px = 1123
target_height_px = 791
target_width_mm = 297
target_height_mm = 210.025
X = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
Y = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
Z = 0.0
return SVector(X, Y, Z)
end

function compute_homograpy_DLT(M, P)
T = promote_type(eltype(M), eltype(P))
A = zeros(T, 8, 9)
for i in 1:size(M, 1)
Mi = M[i,:]
A[2 * i - 1, 1:3] = Mi
A[2 * i - 1, 7:9] = - P[i,1] * Mi
A[2 * i, 4:6] = Mi
A[2 * i, 7:9] = - P[i,2] * Mi
end
return nullspace(A)
end

function π_projection_function(K::AbstractMatrix, R::AbstractMatrix, t::AbstractVector, point::AbstractVector)
point_in_camera = R * point + t
point_in_image = K * point_in_camera
return point_in_image / point_in_camera[end]
end

function foo()
K = @SMatrix [ 1169.19630    0.0     652.98743;
0.0     1169.61014 528.83429;
0.0        0.0       1.0]

T = @SMatrix [0.961255 -0.275448 0.0108487  112.79;
0.171961  0.629936 0.75737   -217.627;
-0.21545  -0.72616  0.652895  1385.13]

R = T[:, SOneTo(3)]
t = T[:,end]

tl = position_in_world(0.0, 0.0)
tr = position_in_world(1123.0, 0.0)
br = position_in_world(1123.0, 791.0)
bl = position_in_world(0.0, 791.0)

p1 = π_projection_function(K, R, t, tl)
p2 = π_projection_function(K, R, t, tr)
p3 = π_projection_function(K, R, t, br)
p4 = π_projection_function(K, R, t, bl)
P = [p1 p2 p3 p4]'

M = @SMatrix [0.0    0.0 1.0;
1123.0    0.0 1.0;
1123.0 791.0 1.0;
0.0 791.0 1.0]

res = compute_homograpy_DLT(M, P)
homography = reshape(res, (3,3))'
return homography
end

using BenchmarkTools
@btime foo()
``````

Note that everything up to `compute_homography_DLT` is completely constant-folded (computed once, at compile time), which is fair game for the compiler since `foo` takes no arguments and all of the functions called on the constants defined in `foo` are pure. In fact, ideally the whole thing would get constant-folded, but the creation of dynamically-sized `Array`s in `compute_homograpy_DLT` is preventing this from happening. It depends on your actual use case whether this is a good benchmark or not.

#15

In the previous evaluation, I did not remove the print line which you did. By the way, for a practical and more fair test, I also removed the print line from both julia and C++ codes.
Julia : ~ 17 micro
C++ : ~ 13 micro
and we can ignore this 4 microseconds. Acceptable. Thanks.