Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra

DY_K · June 24, 2023, 6:22am

My simple test indicates that Julia is significantly slower than Matlab in vector and matrix algebra.

For Julia, I constructed a simple function like

function vec_prod(N,n)

Arg1 = 0.0;
matt1 = Array{Float64}(undef, n, n);
matt2 = Array{Float64}(undef, n, n);
for i = 1:N
    vec1 = 1:n;
    mat = vec1*vec1';
    
    matt1 = (mat.*mat)/n^2;
    matt2 = (mat*mat)/n^2;
    Arg1 = Arg1 + sum(matt1.*matt2)/N;
end

return Arg1

end

In my computer, @time vec_prod(10, 1000) gives ~1.8 sec.
However, the same Matlab code takes only 0.1x sec.

I am very confused because Julia is known as a much higher-performance language than Matlab but this result is totally opposite.

gdalle · June 24, 2023, 6:28am

Hi @DY_K, welcome!

Indeed, Julia is fast but there are a few tricks to make it so. They are summarized in the documentation, but at first glance here are the ones that might apply here:

Don’t benchmark the first run of a function with @time, because it includes compilation time. Benchmark the second run instead, or even better, use @btime from BenchmarkTools.jl
Try to reduce memory allocations. For instance you re-allocate new matrices matt1 and matt2 at every iteration instead of writing into the containers you created at the beginning.

To get a more precise diagnosis, the first thing you can do is profile your code. If you’re in VSCode, it’s very simple.

Side note: you don’t need to end every line with ; in Julia

gdalle · June 24, 2023, 6:44am

Of course your code is a bit weird because it does the same thing at every iteration, so we could just do it once and multiply the resulting Arg1 by N. For the sake of comparison however, I wrote another function that performs exactly the same computations.

The better memory management is enabled by pre-allocating outputs and fusing vectorized operations.
In addition I used a few specific linear algebra functions:

dot(matt1, matt2) to replace sum(matt1 .* matt2) (which allocates the elementwise product before summing it)
mul!(matt2, mat, mat) instead of matt2 = (mat*mat) to avoid allocating the matrix product since I already know where I’m gonna store it

julia> using BenchmarkTools, LinearAlgebra

julia> function better_vec_prod_aux!(mat, matt1, matt2, N, n)
           Arg1 = 0.0
           for i = 1:N
               vec1 = 1:n
               mat .= vec1 .* vec1'
               matt1 .= (mat .* mat) .* (1 / n^2)
               mul!(matt2, mat, mat)
               matt2 .*= 1 / n^2
               Arg1 = Arg1 + dot(matt1, matt2) / N
           end
           return Arg1
       end;

julia> function better_vec_prod(N, n)
           mat = Matrix{Float64}(undef, n, n)
           matt1 = Matrix{Float64}(undef, n, n)
           matt2 = Matrix{Float64}(undef, n, n)
           return better_vec_prod_aux!(mat, matt1, matt2, N, n)
       end;

julia> vec_prod(2, 5) == better_vec_prod(2, 5)
true

julia> @btime vec_prod(10, 1000);
  3.820 s (154 allocations: 473.33 MiB)

julia> @btime better_vec_prod(10, 1000);
  238.816 ms (6 allocations: 22.89 MiB)

julia> @btime better_vec_prod_aux!(mat, matt1, matt2, 10, 1000) setup = (
           n = 1000;
           mat = Matrix{Float64}(undef, n, n);
           matt1 = Matrix{Float64}(undef, n, n);
           matt2 = Matrix{Float64}(undef, n, n)
       );
  247.210 ms (0 allocations: 0 bytes)

As you can see, in this version, the only allocations happen at the beginning of the loop: better_vec_prod_aux! performs none. This is especially important when dealing with large matrices like you are (1000 x 1000).

gdalle · June 24, 2023, 7:02am

Yup, just noticed it, thanks. I thought I might be able to sweep this one under the rug, but more honest eyes were watching

SteffenPL · June 24, 2023, 7:20am

Maybe in addition to @gdalle 's reply:
It’s worth keeping in mind that MATLAB (as the name suggests) is great at doing matrix operations. Your example has a lot of large matrices, therefore, the runtime is probably dominated by the matrix operations (which likely use some C/BLAS routines underneath). A good Julia implementation will probably be as fast as MATLAB, but not much faster in this concrete example.

Julia has the strength that one can implement custom routines (with for loops) also in a high-performant way. Therefore, one can typically be faster then MATLAB, in cases where the MATLAB code cannot be fully vectorized.

gdalle · June 24, 2023, 7:59am

Whereas it is universally acknowledged that “Julia” stands for “Just Using Loops Is Amazing”

DY_K · June 24, 2023, 8:12am

Thank you for your replies.

It would not be an equal comparison by adding other input parameters (like mat, matt1, matt2) or constructing the second function (like better_vec_prod.jl).

However, from @gdalle’s suggestion, I’ve changed my code a little bit like this,
function vec_prod(N,n)

Arg1 = 0.0;
mat = Array{Float64}(undef, n, n)
matt1 = mat
matt2 = mat

for i = 1:N
    vec1 = 1:n
    mat .= vec1.*vec1'
    
    matt1 .= (mat.*mat)./n^2
    matt2 .= (mat*mat)./n^2
    Arg1 = Arg1 + sum(matt1.*matt2)/N
end

return Arg1

end

Now, @time vec_prod(10, 1000) gives 0.11 ~0.13 sec which is almost the same as Matlab.

By the way, using dot and mul! of LinearAlgebra

sum(matt1.* matt2) => dot(matt1, matt2),
matt2 .= (mat * mat)./n^2 => mul!(matt2, mat, mat); matt2 = matt2./n^2

makes the code faster. (~11% faster than Matlab, I checked average values from 10 results)
Matlab is famous for very fast vector and matrix calculation. This is amazing!!

gdalle · June 24, 2023, 9:06am

I’m curious why you think that. I have constructed an auxiliary function for the purposes of clarity, but you could just as well initialize and modify your matrices in better_vec_prod. There’s nothing that makes it unfair here.

DY_K · June 24, 2023, 9:12am

Never mind. From your comment, I’ve learned a lot. Thanks.
Julia is amazing. I never thought Julia is faster than Matlab for this kind of simple calculation of large vectors and matrice.

giordano · June 24, 2023, 10:14am

It really shouldn’t: if you only do linear algebra operations like matrix-matrix multiplications, both julia and matlab eventually call a BLAS library (OpenBLAS for Julia by default, probably MKL in Matlab, which you can use in Julia with MKL.jl), the language you’re writing the code in is totally irrelevant. You aren’t really measuring any language-specific feature, apart from the cost of allocating lots of unneeded memory.

rveltz · June 24, 2023, 3:22pm

It could be: it depends on what BLAS functions are exposed. For example, is it possible to perform inplace operations in matlab?

giordano · June 24, 2023, 3:28pm

But then you aren’t comparing apples-to-apples.

DNF · June 24, 2023, 3:50pm

Matlab normally uses multi-threading for many operations (not just BLAS). Check that you are not comparing multi-threaded Matlab operations with single-threaded Julia ops.

alfaromartino · June 24, 2023, 5:49pm

I emphasize the point made by @DNF. Matlab is multi-threaded by default. In my computer with 16 cores and using the original code snippet (so not improving the code by @gdalle’s suggestions):

2.159 s (154 allocations: 473.33 MiB) → original code
304.525 ms (299 allocations: 473.34 MiB) → threaded code

Just in case, the basic threaded option only requires adding Threads.@threads before the for-loop

function vec_prod2(N,n)
  Arg1 = 0.0;
  matt1 = Array{Float64}(undef, n, n);
  matt2 = Array{Float64}(undef, n, n);
  Threads.@threads for i = 1:N
      vec1 = 1:n;
      mat = vec1*vec1';
      
      matt1 = (mat.*mat)/n^2;
      matt2 = (mat*mat)/n^2;
      Arg1 = Arg1 + sum(matt1.*matt2)/N;
  end

  return Arg1

end

Mason · June 24, 2023, 7:43pm

@alfaromartino that code has a race condition on updating Arg1, it’s completely unsafe and incorrect.

Paul_Soderlind · June 24, 2023, 7:47pm

Yes, create a vector Arg1 and fill as Arg1[i] and then after the loop, sum over the vector. That should work.

alfaromartino · June 24, 2023, 7:48pm

oh I hadn’t checked the code. Yes, you’re definitely right. The point in any case is that matlab is multi-threaded by default.

Mason · June 24, 2023, 7:53pm

by the way, writing this does not copy mat, all three of those variables are pointing to the same memory so this code is also incorrect

Mason · June 24, 2023, 7:54pm

No, this pattern is also incorrect because tasks can yield partway through and overwrite eachothers’ work.

alfaromartino · June 24, 2023, 8:03pm

Not sure I understand this. What @Paul_Soderlind suggests corrects the race condition.



function vec_prod1(N,n)
  matt1 = Array{Float64}(undef, n, n);
  matt2 = Array{Float64}(undef, n, n);
  Arg1  = Vector{Float64}(undef,N}

  for i = 1:N
      vec1 = 1:n;
      mat  = vec1*vec1';
      
      matt1   = (mat.*mat)/n^2;
      matt2   = (mat*mat)/n^2;
      Arg1[i] = sum(matt1.*matt2)/N;
  end

  return sum(Arg1)

end

function vec_prod2(N,n)  
  matt1 = Array{Float64}(undef, n, n);
  matt2 = Array{Float64}(undef, n, n);
  Arg1  = Vector{Float64}(undef,N}

  Threads.@threads for i = 1:N
      vec1 = 1:n;
      mat  = vec1*vec1';
      
      matt1   = (mat.*mat)/n^2;
      matt2   = (mat*mat)/n^2;
      Arg1[i] = sum(matt1.*matt2)/N;
  end

  return sum(Arg1)

end

vec_prod1(10, 1000) == vec_prod2(10,1000) #true

@btime vec_prod2(10,1000) # 2.159 s (154 allocations: 473.33 MiB)
@btime vec_prod1(10, 1000) # 297.634 ms (289 allocations: 473.34 MiB)

btw, nice to know that we work at the same university @Mason!!!