Why Julia and linear algebra module are incredibly slow, compare to C++ and Eigen

Hi,
I developed a julia file to compare its performance with C++. I implemented the homography estimation based on DLT algortihm. In brief, I created a 8*9 matrix A from two bundle of four 3D points and solve Ah=0. So we just need to compute nullspace vectors of matrix A and finish.
For this one, I implemented the following julia code and one equivalence C++ code that you can find here. I tried to have same number of variables and function’s calls in both languages. the linear algebra part of C++ was implemented by eigen library.

using LinearAlgebra

function position_in_world(x::Float64, y::Float64)
      target_width_px = 1123
      target_height_px = 791
      target_width_mm = 297
      target_height_mm = 210.025
      X = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
      Y = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
      Z = 0.0
      return [X, Y, Z ]
end

# function position_in_world2!(x::Float64, y::Float64, vec::Array{Float64,1})
#       target_width_px = 1123
#       target_height_px = 791
#       target_width_mm = 297
#       target_height_mm = 210.025
#       vec[1] = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
#       vec[2] = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
#       vec[3] = 0.0
# end


function π_projection_function(K::Array{Float64,2}, R::Array{Float64,2}, t::Array{Float64,1}, point::Array{Float64,1})
      point_in_camera = R * point + t
      point_in_image = K * point_in_camera
      return point_in_image / point_in_camera[end]
end


function compute_homograpy_DLT(M, P)
      A = zeros(Float64, 8, 9)
      for i in 1:size(M)[1]
            A[2 * i - 1, 1:3] = M[i,:]
            A[2 * i - 1, 7:9] = - P[i,1] * M[i,:]
            A[2 * i, 4:6] = M[i,:]
            A[2 * i, 7:9] = - P[i,2] * M[i,:]
      end
      return nullspace(A)
end

@timev begin
K = [ 1169.19630        0.0 652.98743;
      0.0 1169.61014 528.83429;
      0.0        0.0       1.0]

T = [0.961255 -0.275448 0.0108487    112.79;
 0.171961  0.629936   0.75737  -217.627;
 -0.21545  -0.72616  0.652895   1385.13]

R = T[1:3, 1:3]
t = T[:,end]

tl = position_in_world(0.0, 0.0)
tr = position_in_world(1123.0, 0.0)
br = position_in_world(1123.0, 791.0)
bl = position_in_world(0.0, 791.0)

# println("pos in world 2")
# @timev begin
# tl = Float64[0.0, 0.0, 0.0]
# tr = Float64[0.0, 0.0, 0.0]
# br = Float64[0.0, 0.0, 0.0]
# bl = Float64[0.0, 0.0, 0.0]
# position_in_world2!(0.0, 0.0,tl)
# position_in_world2!(1123.0, 0.0,tr)
# position_in_world2!(1123.0, 791.0,br)
# position_in_world2!(0.0, 791.0,bl)
# end


p1 = π_projection_function(K, R, t, tl)
p2 = π_projection_function(K, R, t, tr)
p3 = π_projection_function(K, R, t, br)
p4 = π_projection_function(K, R, t, bl)
P = [p1 p2 p3 p4]'

m1 = [0.0, 0.0, 1.0]
m2 = [1123.0, 0.0, 1.0]
m3 = [1123.0, 791.0, 1.0]
m4 = [0.0, 791.0, 1.0]
M = [m1 m2 m3 m4]'

res = compute_homograpy_DLT(M , P)
homography = reshape(res, (3,3))'
println("Homography: ", homography)
end

And the result:

Julia (version 1.1.0)
Homography: [0.000244416 -0.000198718 0.915494; 2.16745e-5 8.80403e-5 0.40233; -5.35586e-8 -1.81231e-7 0.00140359]
  0.073007 seconds (218.81 k allocations: 11.512 MiB)
elapsed time (ns): 73007284
bytes allocated:   12071266
pool allocs:       218781
non-pool GC allocs:33

C++ ( clang++ version: 7.0.1,  release-flags: "-march=native -O3 -DNDEBUG",  Eigen version: 3.3.4) 
Homography: 
 0.000244416 -0.000198718     0.915494
 2.16745e-05  8.80403e-05      0.40233
-5.35586e-08 -1.81231e-07   0.00140359
elapsed time (ns): 275887

As can be seen, the homography results are same but c++ and eigen are approximately 250! times faster.
It is noticeable that I am beginner in julia and there is a possibility that I did not implement my code in efficient way. So the main question is, how can I improve the speed of my julia code?

1 Like

Before anything you should take a look at the Performance tips.

To get you started, you should really try to avoid all those temporary allocations. For example, a slice, like M[i,:], on the rhs of an assignment allocates a whole new array. Here you might want to consider @views.

7 Likes

I remember reading somewhere that Eigen is even faster than MKL in small matrix computations, however such advantage will disappear when the dimension of matrix increases. Could you try a larger matrix?

Apart from pre-allocating, consider using @btime from BenchmarkTools, also looks like your test may have included in the compile time of the jit. You may try running @btime or just run your @timev block a second time to exclude the compile time. Refer to e.g here for tips on how to use @btime. Additionally, looks like your matrix size is small, you may try https://github.com/JuliaArrays/StaticArrays.jl

4 Likes

You’re using Eigen::Vector3d for points in C++, yet you’re using the equivalent of Eigen::VectorXd in Julia (i.e., a Vector{Float64}). You’ll definitely want to use StaticArrays.

12 Likes

Just wrapping your main code in a function and timing it with @btime from BenchmarkTools I got a runtime of 21us, which is several thousand times faster than what you’re seeing.

I also tried using StaticArrays, in which case runtime is entirely dominated by by the call to nullspace (btw. nullspace doesn’t work for static arrays, unfortunately): all the code, except nullspace, took 150ns to run, but then nullspace(A) takes 18us.

Here’s a simple benchmark without static arrays, but using BenchmarkTools:

function foo()
    K = [ 1169.19630    0.0     652.98743;
             0.0     1169.61014 528.83429;
             0.0        0.0       1.0]

    T = [0.961255 -0.275448 0.0108487  112.79;
         0.171961  0.629936 0.75737   -217.627;
        -0.21545  -0.72616  0.652895  1385.13]

    R = T[:, SVector(1,2,3)]
    t = T[:,end]

    tl = position_in_world(0.0, 0.0)
    tr = position_in_world(1123.0, 0.0)
    br = position_in_world(1123.0, 791.0)
    bl = position_in_world(0.0, 791.0)

    p1 = π_projection_function(K, R, t, tl)
    p2 = π_projection_function(K, R, t, tr)
    p3 = π_projection_function(K, R, t, br)
    p4 = π_projection_function(K, R, t, bl)
    P = [p1 p2 p3 p4]'

    M = [0.0    0.0 1.0;
       1123.0    0.0 1.0;
       1123.0 791.0 1.0;
             0.0 791.0 1.0]

    res = compute_homograpy_DLT(M, P)
    homography = reshape(res, (3,3))'
    return homography
end

Then benchmark it:

julia> @btime bar()
  20.643 μs (65 allocations: 14.48 KiB)
3×3 Adjoint{Float64,Array{Float64,2}}:
  0.000244416  -0.000198718  0.915494  
  2.16745e-5    8.80403e-5   0.40233   
 -5.35586e-8   -1.81231e-7   0.00140359

Edit: To clarify, the problem isn’t with your code, but with your benchmarking.

8 Likes

I think the comment about StaticArrays may still stand for a thorough comparison. Eigen is pretty explicit about this sort of thing and I believe it is where libraries like tensorflow get a lot of their speed because I read earlier today about someone having trouble getting past the 32 dimension limit that’s hard coded into tensorflow.

Yes it it’s not possible to make a super fast type stable nullspace because the output size depends on the elements rather than the dimension of the input. Eigen would have the same problem of course. Even so, it’s a pity that the Base version of nullspace doesn’t work out of the box with SMatrix. Here, I fixed this:

(Hmm. It would be possible to return an SMatrix wrapped in a view from nullspace if somebody cared enough to try it out. I’m not sure that would be useful though.)

3 Likes

Would it help if the output were a Vector of SVector{N}s, where the length N is known?

1 Like

I don’t think so. But regardless of that, we don’t have a statically sized svd implementation anyway so there’s much more room for improvement there. Currently we fall back to svd(::AbstractMatrix) and simply wrap the output to make sure we preserve the size information. We could probably do much better by using MMatrix and passing that into LinearAlgebra.LAPACK.gesdd! directly for BlasFloat types, but nobody has bothered to dig in that deep yet.

I think that 3x3 has a more or less manageable closed form, at least I have seen it hardcoded.

To be honest, before this post, I didn’t know about the overhead of JIT in julia. I read all of your comments and fixed my code. First of all, the BenchmarkTools is applied to evaluate the execution speed and I had significant improvement after JIT warm-up. Secondly, all possible variables of type AbstractArray (dynamic memory allocation) are replaced with SVector and SMatrix. Finnaly, I used @view macro for SubArray slicing.

function compute_homograpy_DLT(M, P)
      A = zeros(8,9)
      for i in 1:size(M)[1]
            A[2 * i - 1, 1:3] = @view M[i,:]
            A[2 * i - 1, 7:9] = - P[i,1] * @view M[i,:]
            A[2 * i, 4:6] = @view M[i,:]
            A[2 * i, 7:9] = - P[i,2] * @view M[i,:]
      end
      return nullspace(A)
end

and

K = @SMatrix [ 1169.19630        0.0 652.98743;
            0.0 1169.61014 528.83429;
            0.0        0.0       1.0]
T = @SMatrix [0.961255 -0.275448 0.0108487    112.79;
       0.171961  0.629936   0.75737  -217.627;
       -0.21545  -0.72616  0.652895   1385.13]

R = @view T[1:3, 1:3]
t = @view T[:,end]

tl = position_in_world(0.0, 0.0)
tr = position_in_world(1123.0, 0.0)
br = position_in_world(1123.0, 791.0)
bl = position_in_world(0.0, 791.0)

p1 = π_projection_function(K, R, t, tl)
p2 = π_projection_function(K, R, t, tr)
p3 = π_projection_function(K, R, t, br)
p4 = π_projection_function(K, R, t, bl)
P = [p1 p2 p3 p4]'

M = @SMatrix [0.0     0.0    1.0;
                    1123.0  0.0    1.0;
                    1123.0  791.0  1.0;
                    0.0     791.0  1.0]

and re-execute the code:

151.783 μs (133 allocations: 13.27 KiB)

BUT, for a fair comparison between julia and eigen, I also modified my c++ code. For this one, I ran the code for 100,000 and the output average time is:

elapsed time (microseconds): 34.2004

So the C++ code is still 5 times faster and I think the reason is coming from null space computation.

I have never installed MKL and consequently, I don’t have any estimation about the performance of it. But for Eigen, there is one benchmark, which is done by eigen team. They evaluates the performance of Eigen for different dense decomposition methods, based on the size of input matrix.

The following runs in ~11 microseconds on my machine:

using LinearAlgebra
using StaticArrays

function position_in_world(x::Float64, y::Float64)
    target_width_px = 1123
    target_height_px = 791
    target_width_mm = 297
    target_height_mm = 210.025
    X = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
    Y = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
    Z = 0.0
    return SVector(X, Y, Z)
end

function compute_homograpy_DLT(M, P)
    T = promote_type(eltype(M), eltype(P))
    A = zeros(T, 8, 9)
    for i in 1:size(M, 1)
        Mi = M[i,:]
        A[2 * i - 1, 1:3] = Mi
        A[2 * i - 1, 7:9] = - P[i,1] * Mi
        A[2 * i, 4:6] = Mi
        A[2 * i, 7:9] = - P[i,2] * Mi
    end
    return nullspace(A)
end

function π_projection_function(K::AbstractMatrix, R::AbstractMatrix, t::AbstractVector, point::AbstractVector)
    point_in_camera = R * point + t
    point_in_image = K * point_in_camera
    return point_in_image / point_in_camera[end]
end
                  
function foo()
    K = @SMatrix [ 1169.19630    0.0     652.98743;
                0.0     1169.61014 528.83429;
                0.0        0.0       1.0]

    T = @SMatrix [0.961255 -0.275448 0.0108487  112.79;
            0.171961  0.629936 0.75737   -217.627;
        -0.21545  -0.72616  0.652895  1385.13]

    R = T[:, SOneTo(3)]
    t = T[:,end]

    tl = position_in_world(0.0, 0.0)
    tr = position_in_world(1123.0, 0.0)
    br = position_in_world(1123.0, 791.0)
    bl = position_in_world(0.0, 791.0)

    p1 = π_projection_function(K, R, t, tl)
    p2 = π_projection_function(K, R, t, tr)
    p3 = π_projection_function(K, R, t, br)
    p4 = π_projection_function(K, R, t, bl)
    P = [p1 p2 p3 p4]'

    M = @SMatrix [0.0    0.0 1.0;
        1123.0    0.0 1.0;
        1123.0 791.0 1.0;
                0.0 791.0 1.0]

    res = compute_homograpy_DLT(M, P)
    homography = reshape(res, (3,3))'
    return homography
end

using BenchmarkTools
@btime foo()

Note that everything up to compute_homography_DLT is completely constant-folded (computed once, at compile time), which is fair game for the compiler since foo takes no arguments and all of the functions called on the constants defined in foo are pure. In fact, ideally the whole thing would get constant-folded, but the creation of dynamically-sized Arrays in compute_homograpy_DLT is preventing this from happening. It depends on your actual use case whether this is a good benchmark or not.

7 Likes

In the previous evaluation, I did not remove the print line which you did. By the way, for a practical and more fair test, I also removed the print line from both julia and C++ codes.
Julia : ~ 17 micro
C++ : ~ 13 micro
and we can ignore this 4 microseconds. Acceptable. Thanks.

3 Likes