Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra

Paul_Soderlind · June 24, 2023, 8:05pm

Because matt1 and matt2 are (without purpose) defined before the loop? Otherwise, I don’t see it. (To be safe, declare vec1, mat, matt1 and matt2 as locals.)

Mason · June 24, 2023, 8:07pm

Working on a blog-post about it, but here’s the unpublished draft: www.julialang.org/blog/2023/06/PSA-dont-use-threadid.md at 65e25e87d01111cc45a477c8072c9f5dc5878a39 · JuliaLang/www.julialang.org · GitHub

alfaromartino · June 24, 2023, 8:26pm

Interesting. I always thought that each task in a for-loop was running separately, in the sense of not sharing local variables with the other iterations. Is that the problem in last instance?

Mason · June 24, 2023, 8:28pm

Yes, they run separately. The problem is that to do a summation, you need to somehow merge those separate values, and the various ways people often try to merge those values is incorrect.

alfaromartino · June 24, 2023, 8:32pm

alfaromartino:

function vec_prod2(N,n)  
  matt1 = Array{Float64}(undef, n, n);
  matt2 = Array{Float64}(undef, n, n);
  Arg1  = Vector{Float64}(undef,N}

  Threads.@threads for i = 1:N
      vec1 = 1:n;
      mat  = vec1*vec1';
      
      matt1   = (mat.*mat)/n^2;
      matt2   = (mat*mat)/n^2;
      Arg1[i] = sum(matt1.*matt2)/N;
  end

  return sum(Arg1)

end

But is the problem you mention arising here? if each thread is running separately and you store the result in Arg1[i], then you can sum Arg1 out of the loop.

Mason · June 24, 2023, 8:35pm

Yes, that one is fine because you haven’t used threadid. But if you want to be efficient, you should avoid allocating an N element vector Arg1 and can get away with nthreads() different elements, and that’s where people start getting themselves into trouble.

Mason · June 24, 2023, 8:39pm

To get things back on topic, here’s the fastest version I could make (essentially just what @gdalle already wrote):

#+begin_src julia
function vec_prod1(N,n)
    Arg = 0.0
    mat = Array{Float64}(undef, n, n)
    matt1 = Array{Float64}(undef, n, n)
    matt2 = Array{Float64}(undef, n, n)
    for i ∈ 1:N
        vec1 = 1:n;
        mat .= vec1 .* vec1'
        matt1 .= (mat .* mat) ./ n^2;
        matt2 .= mul!(matt2, mat, mat) ./ n^2
        Arg += (matt1 ⋅ matt2)/N;
    end
    Arg
end

@btime vec_prod1(10, 1000)
#+end_src

#+RESULTS:
:   84.133 ms (6 allocations: 22.89 MiB)
: 2.094817739604174e19

I found this was actually faster than using explicit multithreading since the matrix multiplication is already multithreaded for 1000 x 1000 matrices and BLAS multithreading is very efficient.

Here’s the explicitly multithreaded version for comparison

#+begin_src julia
function vec_prod3(N,n)
    chunks = Iterators.partition(1:N, max(1, N ÷ Threads.nthreads()))
    tasks = map(chunks) do chunk
        Arg = 0.0
        mat = Array{Float64}(undef, n, n)
        matt1 = Array{Float64}(undef, n, n)
        matt2 = Array{Float64}(undef, n, n)
        for i ∈ chunk
            vec1 = 1:n;
            mat .= vec1 .* vec1'
            matt1 .= (mat .* mat) ./ n^2;
            matt2 .= mul!(matt2, mat, mat) ./ n^2
            Arg += (matt1 ⋅ matt2)/N;
        end
        return Arg
    end
    Arg = sum(fetch, tasks)
end

@btime vec_prod3(10, 1000)
#+end_src

#+RESULTS:
:   97.034 ms (84 allocations: 228.88 MiB)
: 2.094817739604174e19

Mason · June 25, 2023, 12:12am

alfaromartino:

function vec_prod2(N,n)  
  matt1 = Array{Float64}(undef, n, n);
  matt2 = Array{Float64}(undef, n, n);
  Arg1  = Vector{Float64}(undef,N}

  Threads.@threads for i = 1:N
      vec1 = 1:n;
      mat  = vec1*vec1';
      
      matt1   = (mat.*mat)/n^2;
      matt2   = (mat*mat)/n^2;
      Arg1[i] = sum(matt1.*matt2)/N;
  end

  return sum(Arg1)

end

Sorry, I dont know why I said this was fine. You’ll still get race conditions from different threads overwriting mat, matt1 and matt2 (assuming you were actually calculating new values each iteration instead of reusing the exact same values like is done here)

alfaromartino · June 25, 2023, 1:57am

Now I see what you mean. The variables inside the loop weren’t local, because they were already defined outside the loop.

Yeah, redefining variables inside a function is prone to err. For example for type stability:

#STABLE
function type_stable()
    e = 1
    parameter() = e

    return e 
end

#UNSTABLE
function type_unstable()
    e = 1    
    parameter() = e
    
    e = 1
    return e
end

#UNSTABLE
function type_unstable()
    e = 1    
    parameter() = e
    
    e::Int64 = 1
    return e
end

Oscar_Smith · June 25, 2023, 1:59am

yeah, this is my least favorite issue in Julia: https://github.com/JuliaLang/julia/issues/15276. It is one of the most annoying performance issues in Julia.

Mason · June 25, 2023, 2:07am

Type instability isnt the problem here, it’s a race condition Race condition - Wikipedia

alfaromartino · June 25, 2023, 2:33am

I know, I meant that redefining variables can lead to several problems (e.g. race conditions, type instability)

Mason · June 25, 2023, 2:39am

In this case, the variable is never re-assigned, it’s just that the memory of the array it points to is over-written

Topic		Replies	Views
Matlab versus Julia General Usage	33	4784	July 15, 2021
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4514	December 20, 2017
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4205	December 30, 2022
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19513	October 15, 2019
Julia slower than Matlab & Python? No Performance economics , tensorflow , matlab , pytorch	120	16659	March 16, 2021

Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra

Related topics