Outperformed by Matlab

Hi, I forgot to mention that the system is regularized, so that X'X+M =X'y, that is why I am trying to compute these two quantities (I am gonna edit the question)

Hello, I am using the @views macro at the beginning of each (outermost) loop when slicing, I have read that is should be equivalent?

Hello, try to call it with the the arguments A and B inverted (as in the example), it works for me with bounds checks

Sorry, my copy & paste mistake when adapting the MWE. Should I somehow edit/delete my previous comment?

No worries, just edit it perhaps? Apologies on my side for the lazy indexing in the function

1 Like

Edit: corrected after @lmiq `s observations…

Here is a modified MWE, adding the proposals of the thread and using @aplavin’s nice trick Y.A.t.Q : Yet Another @threads Question - #13 by aplavin

using LinearAlgebra
using LoopVectorization
using Tullio
using BenchmarkTools


function normal!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    # @fastmath @views @inbounds for n = 1:size(A,1)
    @views for n = 1:size(A,1)
        kron!(temp,A[n,:],B[n,:])
        C .= C.+ temp.*temp'
        Cx .= Cx .+ temp.*x[n]
    end
    return Cx
end


function normalTurbo!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    # @fastmath @views @inbounds for n = 1:size(A,1)
    @views for n = 1:size(A,1)
        kron!(temp,A[n,:],B[n,:])
        @turbo C .= C.+ temp.*temp'
        @turbo Cx .= Cx .+ temp.*x[n]
    end
    return Cx
end


function normalTturbo!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    # @fastmath @views @inbounds for n = 1:size(A,1)
    @views for n = 1:size(A,1)
        kron!(temp,A[n,:],B[n,:])
        @tturbo C .= C.+ temp.*temp'
        @tturbo Cx .= Cx .+ temp.*x[n]
    end
    return Cx
end


# swapped arguments
function normalEquations!(C,Cx,x,A,B)
    temp = Matrix{Float64}(undef,size(B,2),size(A,2))
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    C4x = reshape(Cx, size(B,2),size(A,2))
    # @fastmath @inbounds @turbo for m = 1:size(B,2)
    @turbo for m = 1:size(B,2)
        for r = 1:size(A,2)
            for mm = 1:size(B,2)
                for rr = 1:size(A,2)
                    for n = 1:size(B,1)
                        temp =  A[n,r] * B[n,m]
                        C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
                        C4x[m,r] = C4x[m,r] + temp*x[n]
                    end
                end
            end
        end
    end
    return Cx
end

# swapped arguments
function normalEquationsP!(C,Cx,x,A,B)
    temp = Matrix{Float64}(undef,size(B,2),size(A,2))
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    C4x = reshape(Cx, size(B,2),size(A,2))
    # @fastmath @inbounds @tturbo for m = 1:size(B,2)
    @tturbo for m = 1:size(B,2)
        for r = 1:size(A,2)
            for mm = 1:size(B,2)
                for rr = 1:size(A,2)
                    for n = 1:size(B,1)
                        temp =  A[n,r] * B[n,m]
                        C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
                        C4x[m,r] = C4x[m,r] + temp*x[n]
                    end
                end
            end
        end
    end
    return Cx
end


# swapped arguments
function normalEquationsTullio!(C,Cx,x,A,B)
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    Cx4 = reshape(Cx, size(B,2),size(A,2))
    @tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
    @tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
    return Cx
end


function  normalEquationsMatlab(CC,Cx,A,B,x)
    batchSize = 10000;
    for n = 1:batchSize:size(A,1)
        idx = min(n+batchSize-1,size(A,1));
        temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:], ones(1, size(A,2)));
        CC .= CC.+temp'*temp;
        Cx .= Cx.+temp'*x[n:idx];
    end
    return Cx
end


function  normalEquationsMatlabJulia(CC,Cx,A,B,x)
    batchSize = 10000;
    stencil = ones(1, size(A,2))
    # @fastmath @inbounds @views for n = 1:batchSize:size(A,1)
    @views for n = 1:batchSize:size(A,1)
        idx = min(n+batchSize-1,size(A,1))
        @turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
        @turbo CC .= CC.+temp'*temp
        @turbo Cx .= Cx.+temp'*x[n:idx]
    end
    return Cx
end


N = 10000;
R = 20;
M = 40;


A = rand(N,M);
B = rand(N,R);
C = rand(M*R,M*R);
x = rand(N,);
Cx = rand(M*R,);

A_r = PermutedDimsArray(permutedims(A), (2, 1))
B_r = PermutedDimsArray(permutedims(B), (2, 1))

Cp = copy(C)
Cxp = copy(Cx)
normal!(Cp,Cxp,x,A,B_r);
CNormal = copy(Cp)
CxNormal = copy(Cxp)

Cp = copy(C)
Cxp = copy(Cx)
normalTurbo!(Cp,Cxp,x,A,B);
@assert Cp ≈ CNormal
@assert Cxp ≈ CxNormal

Cp = copy(C)
Cxp = copy(Cx)
normalTturbo!(Cp,Cxp,x,A,B);
@assert Cp ≈ CNormal
@assert Cxp ≈ CxNormal

Cp = copy(C)
Cxp = copy(Cx)
normalEquations!(Cp,Cxp,x,B_r,A);
CNormalEquations = copy(Cp)
println(norm(CNormalEquations-CNormal))
CxNormalEquations = copy(Cxp)
println(norm(CxNormalEquations-CxNormal))

Cp = copy(C)
Cxp = copy(Cx)
normalEquationsP!(Cp,Cxp,x,B,A);
@assert Cp ≈ CNormalEquations
@assert Cxp ≈ CxNormalEquations

Cp = copy(C)
Cxp = copy(Cx)
normalEquationsTullio!(Cp,Cxp,x,B,A);
CNormalEquationsTullio = copy(Cp)
println(norm(CNormalEquationsTullio-CNormal))
println(norm(CNormalEquationsTullio-CNormalEquations))
CxNormalEquationsTullio = copy(Cxp)
println(norm(CxNormalEquationsTullio-CxNormal))
println(norm(CxNormalEquationsTullio-CxNormalEquations))

Cp = copy(C)
Cxp = copy(Cx)
normalEquationsMatlab(Cp,Cxp,A,B,x);
@assert Cp ≈ CNormalEquations
@assert Cxp ≈ CxNormalEquations

Cp = copy(C)
Cxp = copy(Cx)
normalEquationsMatlabJulia(Cp,Cxp,A,B,x);
@assert Cp ≈ CNormalEquations
@assert Cxp ≈ CxNormalEquations


@btime normal!(copy($C),copy($Cx),$x,$A,$B_r);

@btime normalTurbo!(copy($C),copy($Cx),$x,$A,$B);

@btime normalTturbo!(copy($C),copy($Cx),$x,$A,$B);

@btime normalEquations!(copy($C),copy($Cx),$x,$B_r,$A);

@btime normalEquationsP!(copy($C),copy($Cx),$x,$B_r,$A);

@btime normalEquationsTullio!(copy($C),copy($Cx),$x,$B_r,$A);

@btime normalEquationsMatlab(copy($C),copy($Cx),$A,$B,$x);

@btime normalEquationsMatlabJulia(copy($C),copy($Cx),$A,$B,$x);

resulting in

  3.404 s (20004 allocations: 5.81 MiB)
  1.184 s (20004 allocations: 5.81 MiB)
  1.179 s (20004 allocations: 5.81 MiB)
  385.193 ms (8 allocations: 4.90 MiB)
  479.380 ms (8 allocations: 4.90 MiB)
  99.472 ms (156 allocations: 4.90 MiB)
  182.132 ms (19 allocations: 197.54 MiB)
  209.968 ms (13 allocations: 192.88 MiB)
2 Likes

In normalEquationsP! I think you can get rid of the initial allocation of temp, you later assign it a scalar value so there should be no need to allocate the matrix in the start. It might not make a noticeable difference, but was the thing I found when having a quick look.

2 Likes

True, it is useless! Thanks:)

Interesting, if I run your code I get no improvement.

5.107 s (20001 allocations: 943.88 KiB)
1.660 s (20001 allocations: 943.88 KiB)
1.648 s (20001 allocations: 943.88 KiB)
526.596 ms (4 allocations: 224 bytes)
242.233 ms (4 allocations: 224 bytes)
273.597 ms (104 allocations: 5.67 KiB)
152.689 ms (19 allocations: 197.54 MiB)
173.030 ms (10 allocations: 188.00 MiB)

Could this be dependent on the processor or julia version?

Did you start Julia with multiple threads?

1 Like

Seeing the different allocation counts I’d assume some differences. I’m on Julia 1.6.3 with 6 cores from Intel.

Edit: it could be me who has the problem: my system is performing dynamic frequency scaling.

julia> Pkg.status("Tullio")
  [bc48ee85] Tullio v0.3.2
julia> Pkg.status("LoopVectorization")
  [bdcacae8] LoopVectorization v0.12.98

Yes! ( I am using VSCodium with the NumThreads specified in the settings, 4 in my case) Threads.nthreads() returns 4

Julia 1.7.0 - dev, Intel 4 physical cores, I will try with 1.6.3. What I still think is strange is that in my case the fastest code is the one copied from matlab

No big change except for Tullio which allocates less on 1.6.3

  1.586 s (20001 allocations: 943.88 KiB)
  1.495 s (30001 allocations: 1.07 MiB)
  510.505 ms (5 allocations: 6.59 KiB)
  228.873 ms (6 allocations: 6.61 KiB)
  217.190 ms (104 allocations: 5.67 KiB)
  152.197 ms (19 allocations: 197.54 MiB)
  166.631 ms (10 allocations: 188.00 MiB)```

Did you reenable @fastmath? Because I recently had the impression it could pessimize code.

What is going on here? I am getting this:

julia> @btime normalEquationsMatlabJulia($C,$Cx,$x,$A,$B); 
  3.808 ms (13 allocations: 9.17 MiB)

I’m copying here the complete Julia section, because I think I’m going crazy:

% julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.3 (2021-09-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using LoopVectorization

julia> function  normalEquationsMatlabJulia(CC,Cx,A,B,x)
           batchSize = 10000;
           stencil = ones(1, size(A,2))
           # @fastmath @inbounds @views for n = 1:batchSize:size(A,1)
           @views for n = 1:batchSize:size(A,1)
               idx = min(n+batchSize-1,size(A,1))
               @turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
               @turbo CC .= CC.+temp'*temp
               @turbo Cx .= Cx.+temp'*x[n:idx]
           end
           return Cx 
       end
normalEquationsMatlabJulia (generic function with 1 method)

julia> N = 10000;

julia> R = 20;

julia> M = 40;

julia> A = rand(N,M);

julia> B = rand(N,R);

julia> C = rand(M*R,M*R);

julia> x = rand(N,);

julia> Cx = rand(M*R,);  

julia> @btime normalEquationsMatlabJulia($C,$Cx,$x,$A,$B); 
  3.808 ms (13 allocations: 9.17 MiB)


Is the result correct? @goerch had some @assert macros to check if it is correct (it is when I run it). Also, how does the @btime of that function compare to the others on your machine?

Well, the assertions are not correct, because it is just referring to the same array:

normal!(C,Cx,x,A,B_r);
CxNormal = Cx  # CxNormal is just the same array as Cx

normalTurbo!(C,Cx,x,A,B);
@assert Cx ≈ CxNormal    # this will always be true

it should be CxNormal = copy(Cx).

Also, since the functions modify Cx, each version is not operating on the same values. The original Cx should be copied from call to call (I am not sure how that affects performance, but it may, and is necessary to assert the values).

4 Likes

Sorry for the mistake. I tried to correct my post above. As far as I understand we have three different results:

  • normal!, normalTurbo! and normalTturbo! computing the same
  3.404 s (20004 allocations: 5.81 MiB)
  1.184 s (20004 allocations: 5.81 MiB)
  1.179 s (20004 allocations: 5.81 MiB)
  • normalEquations!, normalEquationsP!, normalEquationsMatlab and normalEquationsMatlabJulia computing the same
  385.193 ms (8 allocations: 4.90 MiB)
  479.380 ms (8 allocations: 4.90 MiB)
  182.132 ms (19 allocations: 197.54 MiB)
  209.968 ms (13 allocations: 192.88 MiB)
  • normalEquationsTullio
 99.472 ms (156 allocations: 4.90 MiB)

Sorry, I didn’t notice that before I replied. FWIW, I think that annotations far from their effect make such misinterpretations likely.