Why Julia and linear algebra module are incredibly slow, compare to C++ and Eigen

Amin_Abouee · February 28, 2019, 2:32pm

Hi,
I developed a julia file to compare its performance with C++. I implemented the homography estimation based on DLT algortihm. In brief, I created a 8*9 matrix A from two bundle of four 3D points and solve Ah=0. So we just need to compute nullspace vectors of matrix A and finish.
For this one, I implemented the following julia code and one equivalence C++ code that you can find here. I tried to have same number of variables and function’s calls in both languages. the linear algebra part of C++ was implemented by eigen library.

using LinearAlgebra

function position_in_world(x::Float64, y::Float64)
      target_width_px = 1123
      target_height_px = 791
      target_width_mm = 297
      target_height_mm = 210.025
      X = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
      Y = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
      Z = 0.0
      return [X, Y, Z ]
end

# function position_in_world2!(x::Float64, y::Float64, vec::Array{Float64,1})
#       target_width_px = 1123
#       target_height_px = 791
#       target_width_mm = 297
#       target_height_mm = 210.025
#       vec[1] = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
#       vec[2] = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
#       vec[3] = 0.0
# end


function π_projection_function(K::Array{Float64,2}, R::Array{Float64,2}, t::Array{Float64,1}, point::Array{Float64,1})
      point_in_camera = R * point + t
      point_in_image = K * point_in_camera
      return point_in_image / point_in_camera[end]
end


function compute_homograpy_DLT(M, P)
      A = zeros(Float64, 8, 9)
      for i in 1:size(M)[1]
            A[2 * i - 1, 1:3] = M[i,:]
            A[2 * i - 1, 7:9] = - P[i,1] * M[i,:]
            A[2 * i, 4:6] = M[i,:]
            A[2 * i, 7:9] = - P[i,2] * M[i,:]
      end
      return nullspace(A)
end

@timev begin
K = [ 1169.19630        0.0 652.98743;
      0.0 1169.61014 528.83429;
      0.0        0.0       1.0]

T = [0.961255 -0.275448 0.0108487    112.79;
 0.171961  0.629936   0.75737  -217.627;
 -0.21545  -0.72616  0.652895   1385.13]

R = T[1:3, 1:3]
t = T[:,end]

tl = position_in_world(0.0, 0.0)
tr = position_in_world(1123.0, 0.0)
br = position_in_world(1123.0, 791.0)
bl = position_in_world(0.0, 791.0)

# println("pos in world 2")
# @timev begin
# tl = Float64[0.0, 0.0, 0.0]
# tr = Float64[0.0, 0.0, 0.0]
# br = Float64[0.0, 0.0, 0.0]
# bl = Float64[0.0, 0.0, 0.0]
# position_in_world2!(0.0, 0.0,tl)
# position_in_world2!(1123.0, 0.0,tr)
# position_in_world2!(1123.0, 791.0,br)
# position_in_world2!(0.0, 791.0,bl)
# end


p1 = π_projection_function(K, R, t, tl)
p2 = π_projection_function(K, R, t, tr)
p3 = π_projection_function(K, R, t, br)
p4 = π_projection_function(K, R, t, bl)
P = [p1 p2 p3 p4]'

m1 = [0.0, 0.0, 1.0]
m2 = [1123.0, 0.0, 1.0]
m3 = [1123.0, 791.0, 1.0]
m4 = [0.0, 791.0, 1.0]
M = [m1 m2 m3 m4]'

res = compute_homograpy_DLT(M , P)
homography = reshape(res, (3,3))'
println("Homography: ", homography)
end

And the result:

Julia (version 1.1.0)
Homography: [0.000244416 -0.000198718 0.915494; 2.16745e-5 8.80403e-5 0.40233; -5.35586e-8 -1.81231e-7 0.00140359]
  0.073007 seconds (218.81 k allocations: 11.512 MiB)
elapsed time (ns): 73007284
bytes allocated:   12071266
pool allocs:       218781
non-pool GC allocs:33

C++ ( clang++ version: 7.0.1,  release-flags: "-march=native -O3 -DNDEBUG",  Eigen version: 3.3.4) 
Homography: 
 0.000244416 -0.000198718     0.915494
 2.16745e-05  8.80403e-05      0.40233
-5.35586e-08 -1.81231e-07   0.00140359
elapsed time (ns): 275887

As can be seen, the homography results are same but c++ and eigen are approximately 250! times faster.
It is noticeable that I am beginner in julia and there is a possibility that I did not implement my code in efficient way. So the main question is, how can I improve the speed of my julia code?

carstenbauer · February 28, 2019, 2:47pm

Before anything you should take a look at the Performance tips.

To get you started, you should really try to avoid all those temporary allocations. For example, a slice, like M[i,:], on the rhs of an assignment allocates a whole new array. Here you might want to consider @views.

Yifan_Liu · February 28, 2019, 3:27pm

I remember reading somewhere that Eigen is even faster than MKL in small matrix computations, however such advantage will disappear when the dimension of matrix increases. Could you try a larger matrix?

Egwene_al_Vere · February 28, 2019, 4:02pm

Apart from pre-allocating, consider using @btime from BenchmarkTools, also looks like your test may have included in the compile time of the jit. You may try running @btime or just run your @timev block a second time to exclude the compile time. Refer to e.g here for tips on how to use @btime. Additionally, looks like your matrix size is small, you may try https://github.com/JuliaArrays/StaticArrays.jl

tkoolen · February 28, 2019, 4:27pm

You’re using Eigen::Vector3d for points in C++, yet you’re using the equivalent of Eigen::VectorXd in Julia (i.e., a Vector{Float64}). You’ll definitely want to use StaticArrays.

DNF · February 28, 2019, 7:59pm

Just wrapping your main code in a function and timing it with @btime from BenchmarkTools I got a runtime of 21us, which is several thousand times faster than what you’re seeing.

I also tried using StaticArrays, in which case runtime is entirely dominated by by the call to nullspace (btw. nullspace doesn’t work for static arrays, unfortunately): all the code, except nullspace, took 150ns to run, but then nullspace(A) takes 18us.

Here’s a simple benchmark without static arrays, but using BenchmarkTools:

function foo()
    K = [ 1169.19630    0.0     652.98743;
             0.0     1169.61014 528.83429;
             0.0        0.0       1.0]

    T = [0.961255 -0.275448 0.0108487  112.79;
         0.171961  0.629936 0.75737   -217.627;
        -0.21545  -0.72616  0.652895  1385.13]

    R = T[:, SVector(1,2,3)]
    t = T[:,end]

    tl = position_in_world(0.0, 0.0)
    tr = position_in_world(1123.0, 0.0)
    br = position_in_world(1123.0, 791.0)
    bl = position_in_world(0.0, 791.0)

    p1 = π_projection_function(K, R, t, tl)
    p2 = π_projection_function(K, R, t, tr)
    p3 = π_projection_function(K, R, t, br)
    p4 = π_projection_function(K, R, t, bl)
    P = [p1 p2 p3 p4]'

    M = [0.0    0.0 1.0;
       1123.0    0.0 1.0;
       1123.0 791.0 1.0;
             0.0 791.0 1.0]

    res = compute_homograpy_DLT(M, P)
    homography = reshape(res, (3,3))'
    return homography
end

Then benchmark it:

julia> @btime bar()
  20.643 μs (65 allocations: 14.48 KiB)
3×3 Adjoint{Float64,Array{Float64,2}}:
  0.000244416  -0.000198718  0.915494  
  2.16745e-5    8.80403e-5   0.40233   
 -5.35586e-8   -1.81231e-7   0.00140359

Edit: To clarify, the problem isn’t with your code, but with your benchmarking.

Zach_Christensen · March 1, 2019, 1:16am

I think the comment about StaticArrays may still stand for a thorough comparison. Eigen is pretty explicit about this sort of thing and I believe it is where libraries like tensorflow get a lot of their speed because I read earlier today about someone having trouble getting past the 32 dimension limit that’s hard coded into tensorflow.

c42f · March 1, 2019, 2:56am

Yes it it’s not possible to make a super fast type stable nullspace because the output size depends on the elements rather than the dimension of the input. Eigen would have the same problem of course. Even so, it’s a pity that the Base version of nullspace doesn’t work out of the box with SMatrix. Here, I fixed this:

github.com/JuliaArrays/StaticArrays.jl

Support for `svd(::StaticMatrix, full=true)`

JuliaArrays:master ← JuliaArrays:cjf/full-svd

opened 02:53AM - 01 Mar 19 UTC

c42f

+27 -5

A small update to the `svd` wrapper so that `full=true` is supported. Code occas…ionally uses this, for example `nullspace`... in fact this is the only place `full=true` occurs in `LinearAlgebra`. XRef: https://discourse.julialang.org/t/why-julia-and-linear-algebra-module-are-incredibly-slow-compare-to-c-and-eigen/21304/6

(Hmm. It would be possible to return an SMatrix wrapped in a view from nullspace if somebody cared enough to try it out. I’m not sure that would be useful though.)

DNF · March 1, 2019, 8:30am

Would it help if the output were a Vector of SVector{N}s, where the length N is known?

c42f · March 1, 2019, 8:52am

I don’t think so. But regardless of that, we don’t have a statically sized svd implementation anyway so there’s much more room for improvement there. Currently we fall back to svd(::AbstractMatrix) and simply wrap the output to make sure we preserve the size information. We could probably do much better by using MMatrix and passing that into LinearAlgebra.LAPACK.gesdd! directly for BlasFloat types, but nobody has bothered to dig in that deep yet.

Tamas_Papp · March 1, 2019, 9:09am

I think that 3x3 has a more or less manageable closed form, at least I have seen it hardcoded.

Amin_Abouee · March 1, 2019, 1:03pm

To be honest, before this post, I didn’t know about the overhead of JIT in julia. I read all of your comments and fixed my code. First of all, the BenchmarkTools is applied to evaluate the execution speed and I had significant improvement after JIT warm-up. Secondly, all possible variables of type AbstractArray (dynamic memory allocation) are replaced with SVector and SMatrix. Finnaly, I used @view macro for SubArray slicing.

function compute_homograpy_DLT(M, P)
      A = zeros(8,9)
      for i in 1:size(M)[1]
            A[2 * i - 1, 1:3] = @view M[i,:]
            A[2 * i - 1, 7:9] = - P[i,1] * @view M[i,:]
            A[2 * i, 4:6] = @view M[i,:]
            A[2 * i, 7:9] = - P[i,2] * @view M[i,:]
      end
      return nullspace(A)
end

and

K = @SMatrix [ 1169.19630        0.0 652.98743;
            0.0 1169.61014 528.83429;
            0.0        0.0       1.0]
T = @SMatrix [0.961255 -0.275448 0.0108487    112.79;
       0.171961  0.629936   0.75737  -217.627;
       -0.21545  -0.72616  0.652895   1385.13]

R = @view T[1:3, 1:3]
t = @view T[:,end]

tl = position_in_world(0.0, 0.0)
tr = position_in_world(1123.0, 0.0)
br = position_in_world(1123.0, 791.0)
bl = position_in_world(0.0, 791.0)

p1 = π_projection_function(K, R, t, tl)
p2 = π_projection_function(K, R, t, tr)
p3 = π_projection_function(K, R, t, br)
p4 = π_projection_function(K, R, t, bl)
P = [p1 p2 p3 p4]'

M = @SMatrix [0.0     0.0    1.0;
                    1123.0  0.0    1.0;
                    1123.0  791.0  1.0;
                    0.0     791.0  1.0]

and re-execute the code:

151.783 μs (133 allocations: 13.27 KiB)

BUT, for a fair comparison between julia and eigen, I also modified my c++ code. For this one, I ran the code for 100,000 and the output average time is:

elapsed time (microseconds): 34.2004

So the C++ code is still 5 times faster and I think the reason is coming from null space computation.

Amin_Abouee · March 1, 2019, 2:09pm

I have never installed MKL and consequently, I don’t have any estimation about the performance of it. But for Eigen, there is one benchmark, which is done by eigen team. They evaluates the performance of Eigen for different dense decomposition methods, based on the size of input matrix.

tkoolen · March 1, 2019, 2:51pm

The following runs in ~11 microseconds on my machine:

using LinearAlgebra
using StaticArrays

function position_in_world(x::Float64, y::Float64)
    target_width_px = 1123
    target_height_px = 791
    target_width_mm = 297
    target_height_mm = 210.025
    X = ((x + 0.5) / target_width_px) * target_width_mm - 0.5 * target_width_mm
    Y = ((y + 0.5) / target_height_px) * target_height_mm - 0.5 * target_height_mm
    Z = 0.0
    return SVector(X, Y, Z)
end

function compute_homograpy_DLT(M, P)
    T = promote_type(eltype(M), eltype(P))
    A = zeros(T, 8, 9)
    for i in 1:size(M, 1)
        Mi = M[i,:]
        A[2 * i - 1, 1:3] = Mi
        A[2 * i - 1, 7:9] = - P[i,1] * Mi
        A[2 * i, 4:6] = Mi
        A[2 * i, 7:9] = - P[i,2] * Mi
    end
    return nullspace(A)
end

function π_projection_function(K::AbstractMatrix, R::AbstractMatrix, t::AbstractVector, point::AbstractVector)
    point_in_camera = R * point + t
    point_in_image = K * point_in_camera
    return point_in_image / point_in_camera[end]
end
                  
function foo()
    K = @SMatrix [ 1169.19630    0.0     652.98743;
                0.0     1169.61014 528.83429;
                0.0        0.0       1.0]

    T = @SMatrix [0.961255 -0.275448 0.0108487  112.79;
            0.171961  0.629936 0.75737   -217.627;
        -0.21545  -0.72616  0.652895  1385.13]

    R = T[:, SOneTo(3)]
    t = T[:,end]

    tl = position_in_world(0.0, 0.0)
    tr = position_in_world(1123.0, 0.0)
    br = position_in_world(1123.0, 791.0)
    bl = position_in_world(0.0, 791.0)

    p1 = π_projection_function(K, R, t, tl)
    p2 = π_projection_function(K, R, t, tr)
    p3 = π_projection_function(K, R, t, br)
    p4 = π_projection_function(K, R, t, bl)
    P = [p1 p2 p3 p4]'

    M = @SMatrix [0.0    0.0 1.0;
        1123.0    0.0 1.0;
        1123.0 791.0 1.0;
                0.0 791.0 1.0]

    res = compute_homograpy_DLT(M, P)
    homography = reshape(res, (3,3))'
    return homography
end

using BenchmarkTools
@btime foo()

Note that everything up to compute_homography_DLT is completely constant-folded (computed once, at compile time), which is fair game for the compiler since foo takes no arguments and all of the functions called on the constants defined in foo are pure. In fact, ideally the whole thing would get constant-folded, but the creation of dynamically-sized Arrays in compute_homograpy_DLT is preventing this from happening. It depends on your actual use case whether this is a good benchmark or not.

Amin_Abouee · March 1, 2019, 3:37pm

In the previous evaluation, I did not remove the print line which you did. By the way, for a practical and more fair test, I also removed the print line from both julia and C++ codes.
Julia : ~ 17 micro
C++ : ~ 13 micro
and we can ignore this 4 microseconds. Acceptable. Thanks.

Topic		Replies	Views
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7673	May 21, 2017
Julia is slower than MATLAB at diagonalizing matrices New to Julia question , matlab , linearalgebra , matrices , eigenvalues	41	4404	March 30, 2022
Does eigen( ) get stuck in processing of the results? Performance linearalgebra	10	573	January 27, 2021
Very slow execution time in comparison even to Python Performance	35	2020	October 29, 2020
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19813	October 15, 2019

Why Julia and linear algebra module are incredibly slow, compare to C++ and Eigen

Related topics