Outperformed by Matlab

Julia 1.7.0 - dev, Intel 4 physical cores, I will try with 1.6.3. What I still think is strange is that in my case the fastest code is the one copied from matlab

No big change except for Tullio which allocates less on 1.6.3

  1.586 s (20001 allocations: 943.88 KiB)
  1.495 s (30001 allocations: 1.07 MiB)
  510.505 ms (5 allocations: 6.59 KiB)
  228.873 ms (6 allocations: 6.61 KiB)
  217.190 ms (104 allocations: 5.67 KiB)
  152.197 ms (19 allocations: 197.54 MiB)
  166.631 ms (10 allocations: 188.00 MiB)```

Did you reenable @fastmath? Because I recently had the impression it could pessimize code.

What is going on here? I am getting this:

julia> @btime normalEquationsMatlabJulia($C,$Cx,$x,$A,$B); 
  3.808 ms (13 allocations: 9.17 MiB)

I’m copying here the complete Julia section, because I think I’m going crazy:

% julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.3 (2021-09-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using LoopVectorization

julia> function  normalEquationsMatlabJulia(CC,Cx,A,B,x)
           batchSize = 10000;
           stencil = ones(1, size(A,2))
           # @fastmath @inbounds @views for n = 1:batchSize:size(A,1)
           @views for n = 1:batchSize:size(A,1)
               idx = min(n+batchSize-1,size(A,1))
               @turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
               @turbo CC .= CC.+temp'*temp
               @turbo Cx .= Cx.+temp'*x[n:idx]
           end
           return Cx 
       end
normalEquationsMatlabJulia (generic function with 1 method)

julia> N = 10000;

julia> R = 20;

julia> M = 40;

julia> A = rand(N,M);

julia> B = rand(N,R);

julia> C = rand(M*R,M*R);

julia> x = rand(N,);

julia> Cx = rand(M*R,);  

julia> @btime normalEquationsMatlabJulia($C,$Cx,$x,$A,$B); 
  3.808 ms (13 allocations: 9.17 MiB)


Is the result correct? @goerch had some @assert macros to check if it is correct (it is when I run it). Also, how does the @btime of that function compare to the others on your machine?

Well, the assertions are not correct, because it is just referring to the same array:

normal!(C,Cx,x,A,B_r);
CxNormal = Cx  # CxNormal is just the same array as Cx

normalTurbo!(C,Cx,x,A,B);
@assert Cx β‰ˆ CxNormal    # this will always be true

it should be CxNormal = copy(Cx).

Also, since the functions modify Cx, each version is not operating on the same values. The original Cx should be copied from call to call (I am not sure how that affects performance, but it may, and is necessary to assert the values).

4 Likes

Sorry for the mistake. I tried to correct my post above. As far as I understand we have three different results:

  • normal!, normalTurbo! and normalTturbo! computing the same
  3.404 s (20004 allocations: 5.81 MiB)
  1.184 s (20004 allocations: 5.81 MiB)
  1.179 s (20004 allocations: 5.81 MiB)
  • normalEquations!, normalEquationsP!, normalEquationsMatlab and normalEquationsMatlabJulia computing the same
  385.193 ms (8 allocations: 4.90 MiB)
  479.380 ms (8 allocations: 4.90 MiB)
  182.132 ms (19 allocations: 197.54 MiB)
  209.968 ms (13 allocations: 192.88 MiB)
  • normalEquationsTullio
 99.472 ms (156 allocations: 4.90 MiB)

Sorry, I didn’t notice that before I replied. FWIW, I think that annotations far from their effect make such misinterpretations likely.

The temp here is transposed in some codes and not in others. That probably explains most of the differences in the results.

Just added more validations: the problem seems to be more systematic, it also occurs with C

1 Like

I started a thread about the inards of kron here.

1 Like

Thank you all for the inputs, apologies again for the sloppy code. @goerch @lmiq I fixed the source of different results. The results so far were identical under permutation due to calling kron(A,B) instead of kron(B,A). Also, the Matlab copy was missing broadcasting, which is actually needed. Below is the thus modified code from @goerch with @fastmath enabled (it seems to speed up the code for me):

using LoopVectorization
using Tullio
using BenchmarkTools


function normal!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    @fastmath @views @inbounds for n = 1:size(A,1)
    # @views for n = 1:size(A,1)
        kron!(temp,B[n,:],A[n,:])
        C .= C.+ temp.*temp'
        Cx .= Cx .+ temp.*x[n]
    end
    return Cx
end


function normalTurbo!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    @fastmath @views @inbounds for n = 1:size(A,1)
    # @views for n = 1:size(A,1)
        kron!(temp,B[n,:],A[n,:])
        @turbo C .= C.+ temp.*temp'
        @turbo Cx .= Cx .+ temp.*x[n]
    end
    return Cx
end


function normalTturbo!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    @fastmath @views @inbounds for n = 1:size(A,1)
    # @views for n = 1:size(A,1)
        kron!(temp,B[n,:],A[n,:])
        @tturbo C .= C.+ temp.*temp'
        @tturbo Cx .= Cx .+ temp.*x[n]
    end
    return Cx
end


# swapped arguments
function normalEquations!(C,Cx,x,A,B)
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    C4x = reshape(Cx, size(B,2),size(A,2))
    @fastmath @inbounds @turbo for m = 1:size(B,2)
        # @inbounds @turbo for m = 1:size(B,2)
        for r = 1:size(A,2)
            for mm = 1:size(B,2)
                for rr = 1:size(A,2)
                    for n = 1:size(B,1)
                        temp =  A[n,r] * B[n,m]
                        C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
                        C4x[m,r] = C4x[m,r] + temp*x[n]
                    end
                end
            end
        end
    end
    return Cx
end

# swapped arguments
function normalEquationsP!(C,Cx,x,A,B)
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    C4x = reshape(Cx, size(B,2),size(A,2))
    @fastmath @inbounds @tturbo for m = 1:size(B,2)
        # @inbounds  @tturbo for m = 1:size(B,2)
        for r = 1:size(A,2)
            for mm = 1:size(B,2)
                for rr = 1:size(A,2)
                    for n = 1:size(B,1)
                        temp =  A[n,r] * B[n,m]
                        C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
                        C4x[m,r] = C4x[m,r] + temp*x[n]
                    end
                end
            end
        end
    end
    return Cx
end



# swapped arguments
function normalEquationsTullio!(C,Cx,x,A,B)
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    Cx4 = reshape(Cx, size(B,2),size(A,2))
    @tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
    @tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
    return Cx
end


function  normalEquationsMatlab(CC,Cx,A,B,x)
    batchSize = 10000;
    @inbounds @fastmath for n = 1:batchSize:size(A,1)
        idx = min(n+batchSize-1,size(A,1));
        temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:], ones(1, size(A,2)));
        CC .= CC.+temp'*temp;
        Cx .= Cx.+temp'*x[n:idx];
    end
    return Cx
end


function  normalEquationsMatlabJulia(CC,Cx,A,B,x)
    batchSize = 10000;
    stencil = ones(1, size(A,2))
    @fastmath @inbounds @views for n = 1:batchSize:size(A,1)
    # @inbounds @views for n = 1:batchSize:size(A,1)
        idx = min(n+batchSize-1,size(A,1))
        @turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
        @turbo CC .= CC.+temp'*temp
        @turbo Cx .= Cx.+temp'*x[n:idx]
    end
    return Cx
end


N = 10000;
R = 20;
M = 40;


A = rand(N,M);
B = rand(N,R);
x = rand(N,);

C = zeros(M*R,M*R);
Cx = zeros(M*R,);

A_r = PermutedDimsArray(permutedims(A), (2, 1))
B_r = PermutedDimsArray(permutedims(B), (2, 1))

normal!(C,Cx,x,A,B_r);
CxNormal = copy(Cx)


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalTurbo!(C,Cx,x,A,B);
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalTturbo!(C,Cx,x,A,B);
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquations!(C,Cx,x,B_r,A);
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsP!(C,Cx,x,B,A);
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsTullio!(C,Cx,x,B,A);
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsMatlab(C,Cx,A,B,x);
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsMatlabJulia(C,Cx,A,B,x);
@assert Cx β‰ˆ CxNormal


@btime normal!($C,$Cx,$x,$A,$B_r);

@btime normalTurbo!($C,$Cx,$x,$A,$B);

@btime normalTturbo!($C,$Cx,$x,$A,$B);

@btime normalEquations!($C,$Cx,$x,$B_r,$A);

@btime normalEquationsP!($C,$Cx,$x,$B_r,$A);

@btime normalEquationsTullio!($C,$Cx,$x,$B_r,$A);

@btime normalEquationsMatlab($C,$Cx,$A,$B,$x);

@btime normalEquationsMatlabJulia($C,$Cx,$A,$B,$x);

Now all the asserts pass. In my case now the timings are the following:

6.306 s (20001 allocations: 943.88 KiB)       #Normal
  1.114 s (20001 allocations: 943.88 KiB)     #NormalTurbo
  1.039 s (30001 allocations: 1.07 MiB)        #NormalTturbo
  382.551 ms (4 allocations: 224 bytes)        #Loop
  146.457 ms (5 allocations: 240 bytes)        #ParallelLoop
  176.421 ms (104 allocations: 5.67 KiB)      #Tullio
  115.667 ms (16 allocations: 192.65 MiB)    #Matlab
  139.217 ms (10 allocations: 188.00 MiB)    #MatlabJulia

Now the performance is already way better (\approx 1.5 times for me) thanks to the indexing (thanks @Jeff_Emanuel , @goerch , @baggepinnen ), however it is still slower than the Matlab version in Julia (normalEquationsMatlab) and the actual Matlab implementation (identical to normalEquationsMatlab except using repmat instead of repeat) which takes 0.0869 seconds on my computer.

1 Like

Hi @germinator,

thanks for the updated MWE. We still have to discuss the difference in our results. These are my latest numbers for your consolidated MWE:

  5.525 s (20001 allocations: 943.88 KiB)
  1.204 s (20001 allocations: 943.88 KiB)
  898.374 ms (30001 allocations: 1.07 MiB)
  483.086 ms (4 allocations: 224 bytes)
  105.213 ms (5 allocations: 240 bytes)
  97.093 ms (153 allocations: 8.70 KiB)
  185.628 ms (16 allocations: 192.65 MiB)
  207.268 ms (10 allocations: 188.00 MiB)

Using the naive

function _kron!(C::T1, A::T2, B::T3) where {T1 <: AbstractVector, T2 <: AbstractVector, T3 <: AbstractVector}
    for i = 1:size(A, 1)
        @views mul!(C[(i-1)*size(B,1)+1:i*size(B,1)], A[i], B)
    end
    C
end

where applicable I see

  3.458 s (1 allocation: 6.38 KiB)
  1.160 s (1 allocation: 6.38 KiB)
  911.221 ms (10001 allocations: 162.62 KiB)
  480.688 ms (4 allocations: 224 bytes)
  106.741 ms (5 allocations: 240 bytes)
  102.574 ms (153 allocations: 8.70 KiB)
  189.446 ms (16 allocations: 192.65 MiB)
  218.533 ms (10 allocations: 188.00 MiB)
1 Like

Hi @goerch , thanks for your updates. I have the same versions of Tullio and LoopVectorization.

(@v1.6) pkg> status Tullio
      Status `C:\Users\LocalAdmin\.julia\environments\v1.6\Project.toml`
  [bc48ee85] Tullio v0.3.2
(@v1.6) pkg> status LoopVectorization
      Status `C:\Users\LocalAdmin\.julia\environments\v1.6\Project.toml`
  [bdcacae8] LoopVectorization v0.12.98

I probably have dynamic fluency scaling on as well, although I have in power settings minimum and maximum processor states at 100%.

Now when it comes to the results: could it be simply due to the fact that you are running Julia on 6 threads? Hence you perform better when using @tturbo and thus also @tullio. This would explain the similar scaling when using kron! and in the Matlab functions, and the faster scaling when using @tturbo and @tullio. One thing that makes me wonder however is why in your result @tullio is faster than loops with @tturbo, as @tullio is called twice in the function and many of the computations done for C and Cx are the same and computed twice. Would this mean that it is still possible to further improve the performance of the looped version?

Also, do you happen to have a Matlab license? It would be interesting to see how that would work on your machine

As an experiment I ran the benchmark with Julia and BLAS threads set to one (via BLAS.set_num_threads(1)) and see

  3.196 s (1 allocation: 6.38 KiB)
  1.091 s (1 allocation: 6.38 KiB)
  1.083 s (1 allocation: 6.38 KiB)
  431.337 ms (4 allocations: 224 bytes)
  389.771 ms (4 allocations: 224 bytes)
  394.443 ms (4 allocations: 224 bytes)
  555.561 ms (10 allocations: 188.00 MiB)
  557.053 ms (10 allocations: 188.00 MiB)

I’m curious to hear what you make of this. Interesting stuff could happen if someone runs this on a real workstation…

Sorry, no.

I don’t know which of the variants is an interesting candidate for improvement? Both normalEquationsTullio! and normalEquationsMatlab look like clever solutions of the problem.

I’m sorry, I am not following all the details of the thread, but have tried using MKL instead of BLAS in the Julia implementations? That could explain the difference relative to Matlab if the performance is dependent on a matrix multiplication only.

6 Likes

I’d like to propose to add something like

function kr!(C, A, B)
    @fastmath @inbounds for n = 1:size(A,1)
        for i = 1:size(B,2)
            for j = 1:size(A,2)
                C[(i-1)*size(A,2)+j, n] = A[n,j] * B[n,i]
            end
        end
    end
end

function _normal!(C,Cx,x,A,B)
    temp = Matrix{Float64}(undef,size(A,2)*size(B,2),size(A,1))
    kr!(temp, A, B)
    mul!(C, temp, temp')
    mul!(Cx, temp, x)
    return Cx
end

to the comparison.

Here we go again. My latest MWE, switching to MKL,

using LinearAlgebra
using MKL
using LoopVectorization
using Tullio
using BenchmarkTools

# BLAS.set_num_threads(1)

function _kron(A::T1, B::T2) where {T1 <: AbstractVector, T2 <: AbstractVector}
    C = zeros(size(A,1) * size(B,1))
    @fastmath @views @inbounds for i = 1:size(A, 1)
        C[(i-1)*size(B,1)+1:i*size(B,1)] .= A[i] * B
    end
    C
end

function _kron!(C::T1, A::T2, B::T3) where {T1 <: AbstractVector, T2 <: AbstractVector, T3 <: AbstractVector}
    @fastmath @views @inbounds for i = 1:size(A, 1)
        mul!(C[(i-1)*size(B,1)+1:i*size(B,1)], A[i], B)
    end
    C
end


function normal!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    @fastmath @views @inbounds for n = 1:size(A,1)
    # @views for n = 1:size(A,1)
        _kron!(temp,B[n,:],A[n,:])
        C .+= temp.*temp'
        Cx .+= temp.*x[n]
    end
    return Cx
end


function kr!(C, A, B)
    @fastmath @inbounds for n = 1:size(A,1)
        # @views C[1:size(A,2)*size(B,2), n:n] .= reshape(A[n,:] * B[n,:]', size(A,2) * size(B,2))
        # @views for i = 1:size(B, 2)
        #     C[(i-1)*size(A,2)+1:i*size(A,2), n:n] .= A[n,:] .* B[n,i]
        # end
        for i = 1:size(B,2)
            for j = 1:size(A,2)
                C[(i-1)*size(A,2)+j, n] = A[n,j] * B[n,i]
            end
        end
    end
end

function _normal!(C,Cx,x,A,B)
    temp = Matrix{Float64}(undef,size(A,2)*size(B,2),size(A,1))
    kr!(temp, A, B)
    mul!(C, temp, temp')
    mul!(Cx, temp, x)
    return Cx
end


function normalTurbo!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    @fastmath @views @inbounds for n = 1:size(A,1)
    # @views for n = 1:size(A,1)
        _kron!(temp,B[n,:],A[n,:])
        @turbo C .+= temp.*temp'
        @turbo Cx .+= temp.*x[n]
    end
    return Cx
end


function normalTturbo!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    @fastmath @views @inbounds for n = 1:size(A,1)
    # @views for n = 1:size(A,1)
        _kron!(temp,B[n,:],A[n,:])
        @tturbo C .+= temp.*temp'
        @tturbo Cx .+= temp.*x[n]
    end
    return Cx
end


# swapped arguments
function normalEquations!(C,Cx,x,A,B)
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    C4x = reshape(Cx, size(B,2),size(A,2))
    @fastmath @inbounds @turbo for m = 1:size(B,2)
        # @inbounds @turbo for m = 1:size(B,2)
        for r = 1:size(A,2)
            for mm = 1:size(B,2)
                for rr = 1:size(A,2)
                    for n = 1:size(B,1)
                        temp =  A[n,r] * B[n,m]
                        C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
                        C4x[m,r] = C4x[m,r] + temp*x[n]
                    end
                end
            end
        end
    end
    return Cx
end

# swapped arguments
function normalEquationsP!(C,Cx,x,A,B)
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    C4x = reshape(Cx, size(B,2),size(A,2))
    @fastmath @inbounds @tturbo for m = 1:size(B,2)
        # @inbounds  @tturbo for m = 1:size(B,2)
        for r = 1:size(A,2)
            for mm = 1:size(B,2)
                for rr = 1:size(A,2)
                    for n = 1:size(B,1)
                        temp =  A[n,r] * B[n,m]
                        C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
                        C4x[m,r] = C4x[m,r] + temp*x[n]
                    end
                end
            end
        end
    end
    return Cx
end


# swapped arguments
function normalEquationsTullio!(C,Cx,x,A,B)
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    Cx4 = reshape(Cx, size(B,2),size(A,2))
    @tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
    @tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
    return Cx
end


function  normalEquationsMatlab(CC,Cx,A,B,x)
    batchSize = 10000;
    stencil = ones(1, size(A,2))
    @inbounds @fastmath @views for n = 1:batchSize:size(A,1)
        idx = min(n+batchSize-1,size(A,1));
        temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:], stencil)
        CC .+= temp'*temp;
        Cx .+= temp'*x[n:idx];
    end
    return Cx
end


function  normalEquationsMatlabJulia(CC,Cx,A,B,x)
    batchSize = 10000;
    stencil = ones(1, size(A,2))
    @fastmath @inbounds @views for n = 1:batchSize:size(A,1)
    # @inbounds @views for n = 1:batchSize:size(A,1)
        idx = min(n+batchSize-1,size(A,1))
        @turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
        @turbo CC .+= temp'*temp
        @turbo Cx .+= temp'*x[n:idx]
    end
    return Cx
end


N = 10000;
R = 20;
M = 40;


A = rand(N,M);
B = rand(N,R);
x = rand(N,);

C = zeros(M*R,M*R);
Cx = zeros(M*R,);

A_r = PermutedDimsArray(permutedims(A), (2, 1))
B_r = PermutedDimsArray(permutedims(B), (2, 1))

normal!(C,Cx,x,A,B_r);
CNormal = copy(C)
CxNormal = copy(Cx)


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
_normal!(C,Cx,x,A,B_r);
@assert C β‰ˆ CNormal
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalTurbo!(C,Cx,x,A,B);
@assert C β‰ˆ CNormal
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalTturbo!(C,Cx,x,A,B);
@assert C β‰ˆ CNormal
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquations!(C,Cx,x,B_r,A);
@assert C β‰ˆ CNormal
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsP!(C,Cx,x,B,A);
@assert C β‰ˆ CNormal
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsTullio!(C,Cx,x,B,A);
@assert C β‰ˆ CNormal
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsMatlab(C,Cx,A,B,x);
@assert C β‰ˆ CNormal
@assert Cx β‰ˆ CxNormal


C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsMatlabJulia(C,Cx,A,B,x);
@assert C β‰ˆ CNormal
@assert Cx β‰ˆ CxNormal


@btime normal!($C,$Cx,$x,$A,$B_r);

@btime _normal!($C,$Cx,$x,$A,$B);

@btime normalTurbo!($C,$Cx,$x,$A,$B);

@btime normalTturbo!($C,$Cx,$x,$A,$B);

@btime normalEquations!($C,$Cx,$x,$B_r,$A);

@btime normalEquationsP!($C,$Cx,$x,$B_r,$A);

@btime normalEquationsTullio!($C,$Cx,$x,$B_r,$A);

@btime normalEquationsMatlab($C,$Cx,$A,$B,$x);

@btime normalEquationsMatlabJulia($C,$Cx,$A,$B,$x);

resulting in

  3.705 s (1 allocation: 6.38 KiB)
  41.172 ms (2 allocations: 61.04 MiB)
  1.247 s (1 allocation: 6.38 KiB)
  935.492 ms (10001 allocations: 162.62 KiB)
  406.754 ms (4 allocations: 224 bytes)
  112.410 ms (5 allocations: 240 bytes)
  105.382 ms (152 allocations: 8.67 KiB)
  118.262 ms (10 allocations: 188.00 MiB)
  124.236 ms (10 allocations: 188.00 MiB)

So what is the difference when compared to Matlab now?

3 Likes

Here are my timings, since I’m procrastinating.

2.319 s (1 allocation: 6.38 KiB)
45.789 ms (2 allocations: 61.04 MiB)
626.227 ms (1 allocation: 6.38 KiB)
741.258 ms (10001 allocations: 162.62 KiB)
309.291 ms (4 allocations: 224 bytes)
38.514 ms (5 allocations: 240 bytes)
37.346 ms (860 allocations: 48.42 KiB)
87.815 ms (10 allocations: 188.00 MiB)
100.636 ms (10 allocations: 188.00 MiB)
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)

That was 32 CPU threads, 8 BLAS threads, and without MKL.

3 Likes