# Outperformed by Matlab

Dear Julia users,

I am trying to set up the normal equations ( https://en.wikipedia.org/wiki/Linear_least_squares ) which means constructing X'X and X'y in

(X'X+\lambda M) w = X'y

where M is SPD, and in my case, each row of X is a Kronecker product of two matrices, i.e.:
X_{i,:}=A_{i,:} \otimes B_{i,:}
This means that X has Khatri-Rao structure. In Matlab I am using propert #4 in https://en.wikipedia.org/wiki/Khatri%E2%80%93Rao_product to perform the Khatri-Rao product for n rows at once. In Julia, I would like to avoid this in order to take advantage of the fact that the code is compiled. However all my attempts at generating a faster version of the code fail. In the code snippet below I tried to use naive rank-1 updates (“normal”), using a for loop with LoopVectorization, and Tullio (@mcabbott helped me greatly in a previous question on the same topic). I find that Tullio is very fast however the problem is that since it needs to be called twice (once for X'X and once for X'y), it ends up being slower that the for loop with @tturbo on my machine. I wonder if there is any way to speed the code-up further taking advantage of Julia. Here are the times on my machines:

4.333 s (226937 allocations: 6.06 MiB) Naive
1.558 s (226937 allocations: 6.06 MiB) Naive @turbo
1.531 s (226937 allocations: 6.06 MiB) Naive @tturbo
610.630 ms (51 allocations: 8.47 KiB) Loop @turbo
248.648 ms (51 allocations: 8.47 KiB) Loop @tturbo
257.008 ms (129 allocations: 6.17 KiB) Tullio
147.583 ms (43 allocations: 197.54 MiB) Matlab port
165.664 ms (35 allocations: 188.00 MiB) Wrongly improved Matlab port

For reference Matlab takes 0.0869 seconds to run this code (almost twice as fast as the Matlab port).
This is currently the only part of my code in which Matlab performs better. If I can Julia to run faster, I would abandon Matlab for good.


N = 10000;
R = 20;
M = 40;

A = rand(N,M);
B = rand(N,R);
C = rand(M*R,M*R);
x = rand(N,);
Cx = rand(M*R,);

function normal!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,M*R)
@fastmath @views @inbounds for n = 1:N
kron!(temp,A[n,:],B[n,:])
C .= C.+ temp.*temp'
Cx .= Cx .+ temp.*x[n]
end
return nothing
end

function normalTurbo!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,M*R)
@fastmath @views @inbounds for n = 1:N
kron!(temp,A[n,:],B[n,:])
@turbo C .= C.+ temp.*temp'
@turbo Cx .= Cx .+ temp.*x[n]
end
return nothing
end

function normalTturbo!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,M*R)
@fastmath @views @inbounds for n = 1:N
kron!(temp,A[n,:],B[n,:])
@tturbo C .= C.+ temp.*temp'
@tturbo Cx .= Cx .+ temp.*x[n]
end
return nothing
end

function normalEquations!(C,Cx,x,A,B)
temp = Matrix{Float64}(undef,M,R)
C4 = reshape(C, M,R,M,R)  # writing into original C
C4x = reshape(Cx, M,R)
@fastmath @inbounds @turbo for m = 1:M
for r = 1:R
for mm = 1:M
for rr = 1:R
for n = 1:N
temp =  A[n,r] * B[n,m]
C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
C4x[m,r] = C4x[m,r] + temp*x[n]
end
end
end
end
end
return nothing
end

function normalEquationsP!(C,Cx,x,A,B)
temp = Matrix{Float64}(undef,M,R)
C4 = reshape(C, M,R,M,R)  # writing into original C
C4x = reshape(Cx, M,R)
@fastmath @inbounds @tturbo for m = 1:M
for r = 1:R
for mm = 1:M
for rr = 1:R
for n = 1:N
temp =  A[n,r] * B[n,m]
C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
C4x[m,r] = C4x[m,r] + temp*x[n]
end
end
end
end
end
return nothing
end

function normalEquationsTullio!(C,Cx,x,A,B)
C4 = reshape(C, M,R,M,R)  # writing into original C
Cx4 = reshape(Cx, M,R)
@tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
@tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
return nothing
end

function  normalEquationsMatlab(CC,Cx,A,B,x)
batchSize = 10000;
for n = 1:batchSize:N
idx = min(n+batchSize-1,N);
temp = repeat(A[n:idx,:],1,R).*kron(B[n:idx,:], ones(1, M));
CC = CC+temp'*temp;
Cx = Cx+temp'*x[n:idx];
end
return nothing
end

function  normalEquationsMatlabJulia(CC,Cx,A,B,x)
batchSize = 10000;
stencil = ones(1, M)
@fastmath @inbounds @views for n = 1:batchSize:N
idx = min(n+batchSize-1,N)
@turbo temp = repeat(A[n:idx,:],1,R).*kron(B[n:idx,:],stencil )
@turbo CC .= CC.+temp'*temp
@turbo Cx .= Cx.+temp'*x[n:idx]
end
return nothing
end

@btime normal!(C,Cx,x,A,B);

@btime normalTurbo!(C,Cx,x,A,B);

@btime normalTturbo!(C,Cx,x,A,B);

@btime normalEquations!(C,Cx,x,B,A);

@btime normalEquationsP!(C,Cx,x,B,A);

@btime normalEquationsTullio!(C,Cx,x,B,A);

@btime normalEquationsMatlab(C,Cx,A,B,x);

@btime normalEquationsMatlabJulia(C,Cx,A,B,x);

1 Like

Low hanging fruit includes:

These make copies, use views instead.
You are accessing global variables M and R.

Your array accesses [n,:] are suboptimal. [:,n] is better.

6 Likes

Edit: the following was due to a mistake in my editing the MWE:

In function normalEquationsTullio!(C,Cx,x,A,B) I get

ERROR: LoadError: "range of index l must agree"


on Julia 1.6.3.

Around function normalEquations!(C,Cx,x,A,B) I see intermittend

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x640d4acd -- macro expansion at C:\Users\Win10\.julia\packages\VectorizationBase\xHOp9\src\llvm_intrin\memory_addr.jl:635 [inlined]
[...]
Allocations: 157951623 (Pool: 157880074; Big: 71549); GC: 200


In function normalEquationsMatlab(CC,Cx,A,B,x) and function normalEquationsMatlabJulia(CC,Cx,A,B,x)  I get

ERROR: LoadError: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 20 and 1600")


This is after

• reducing N to 1000
• removing @inbounds and @fastmath everywhere
• using interpolation like @btime normal!($C,$Cx,$x,$A,$B); Could part of my problems be due to the reshaping and modifications in function normalEquationsP!(C,Cx,x,A,B)? P.S.: whenever I see @inbounds without result validation I get suspicious;) 1 Like The variables N, R, and M are being used as non-constant globals. You should probably either set them to be const or pass them in as function variables. https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables 5 Likes Is there a reason not to do w = X\y? For both languages, that should use QR to do linear least squares automatically, and be faster than setting up X'X. In your code, there might be no need to compute C or Cx? 3 Likes Or even better, don’t use them at all within the function but use size(A, 1) etc. instead. 9 Likes QR (used by X \ y) isn’t generally faster than the normal equations (it has the same complexity, but a worse constant factor). But it’s more accurate in general, sometimes much more accurate (if X is badly conditioned). See Efficient way of doing linear regression - #33 by stevengj 11 Likes Julia uses column-major memory layout, so you typically want to iterate the first index in the innermost loop. These nested loops are written with m as the outermost loop, but m is the first index in many of the arrays. I believe @turbo will rearrange the loops for you sometimes, but it’s a general thing to keep in mind while writing nested loops in Julia. This is related to the advice of storing data so that you can slice it as a[:, i] instead of a[i, :] that was given above. 2 Likes Hi, I forgot to mention that the system is regularized, so that X'X+M =X'y, that is why I am trying to compute these two quantities (I am gonna edit the question) Hello, I am using the @views macro at the beginning of each (outermost) loop when slicing, I have read that is should be equivalent? Hello, try to call it with the the arguments A and B inverted (as in the example), it works for me with bounds checks Sorry, my copy & paste mistake when adapting the MWE. Should I somehow edit/delete my previous comment? No worries, just edit it perhaps? Apologies on my side for the lazy indexing in the function 1 Like Edit: corrected after @leandromartinez98 s observations… Here is a modified MWE, adding the proposals of the thread and using @aplavin’s nice trick Y.A.t.Q : Yet Another @threads Question - #13 by aplavin using LinearAlgebra using LoopVectorization using Tullio using BenchmarkTools function normal!(C,Cx,x,A,B) temp = Vector{Float64}(undef,size(A,2)*size(B,2)) # @fastmath @views @inbounds for n = 1:size(A,1) @views for n = 1:size(A,1) kron!(temp,A[n,:],B[n,:]) C .= C.+ temp.*temp' Cx .= Cx .+ temp.*x[n] end return Cx end function normalTurbo!(C,Cx,x,A,B) temp = Vector{Float64}(undef,size(A,2)*size(B,2)) # @fastmath @views @inbounds for n = 1:size(A,1) @views for n = 1:size(A,1) kron!(temp,A[n,:],B[n,:]) @turbo C .= C.+ temp.*temp' @turbo Cx .= Cx .+ temp.*x[n] end return Cx end function normalTturbo!(C,Cx,x,A,B) temp = Vector{Float64}(undef,size(A,2)*size(B,2)) # @fastmath @views @inbounds for n = 1:size(A,1) @views for n = 1:size(A,1) kron!(temp,A[n,:],B[n,:]) @tturbo C .= C.+ temp.*temp' @tturbo Cx .= Cx .+ temp.*x[n] end return Cx end # swapped arguments function normalEquations!(C,Cx,x,A,B) temp = Matrix{Float64}(undef,size(B,2),size(A,2)) C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C C4x = reshape(Cx, size(B,2),size(A,2)) # @fastmath @inbounds @turbo for m = 1:size(B,2) @turbo for m = 1:size(B,2) for r = 1:size(A,2) for mm = 1:size(B,2) for rr = 1:size(A,2) for n = 1:size(B,1) temp = A[n,r] * B[n,m] C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm] C4x[m,r] = C4x[m,r] + temp*x[n] end end end end end return Cx end # swapped arguments function normalEquationsP!(C,Cx,x,A,B) temp = Matrix{Float64}(undef,size(B,2),size(A,2)) C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C C4x = reshape(Cx, size(B,2),size(A,2)) # @fastmath @inbounds @tturbo for m = 1:size(B,2) @tturbo for m = 1:size(B,2) for r = 1:size(A,2) for mm = 1:size(B,2) for rr = 1:size(A,2) for n = 1:size(B,1) temp = A[n,r] * B[n,m] C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm] C4x[m,r] = C4x[m,r] + temp*x[n] end end end end end return Cx end # swapped arguments function normalEquationsTullio!(C,Cx,x,A,B) C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C Cx4 = reshape(Cx, size(B,2),size(A,2)) @tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l] @tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n] return Cx end function normalEquationsMatlab(CC,Cx,A,B,x) batchSize = 10000; for n = 1:batchSize:size(A,1) idx = min(n+batchSize-1,size(A,1)); temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:], ones(1, size(A,2))); CC .= CC.+temp'*temp; Cx .= Cx.+temp'*x[n:idx]; end return Cx end function normalEquationsMatlabJulia(CC,Cx,A,B,x) batchSize = 10000; stencil = ones(1, size(A,2)) # @fastmath @inbounds @views for n = 1:batchSize:size(A,1) @views for n = 1:batchSize:size(A,1) idx = min(n+batchSize-1,size(A,1)) @turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil ) @turbo CC .= CC.+temp'*temp @turbo Cx .= Cx.+temp'*x[n:idx] end return Cx end N = 10000; R = 20; M = 40; A = rand(N,M); B = rand(N,R); C = rand(M*R,M*R); x = rand(N,); Cx = rand(M*R,); A_r = PermutedDimsArray(permutedims(A), (2, 1)) B_r = PermutedDimsArray(permutedims(B), (2, 1)) Cp = copy(C) Cxp = copy(Cx) normal!(Cp,Cxp,x,A,B_r); CNormal = copy(Cp) CxNormal = copy(Cxp) Cp = copy(C) Cxp = copy(Cx) normalTurbo!(Cp,Cxp,x,A,B); @assert Cp ≈ CNormal @assert Cxp ≈ CxNormal Cp = copy(C) Cxp = copy(Cx) normalTturbo!(Cp,Cxp,x,A,B); @assert Cp ≈ CNormal @assert Cxp ≈ CxNormal Cp = copy(C) Cxp = copy(Cx) normalEquations!(Cp,Cxp,x,B_r,A); CNormalEquations = copy(Cp) println(norm(CNormalEquations-CNormal)) CxNormalEquations = copy(Cxp) println(norm(CxNormalEquations-CxNormal)) Cp = copy(C) Cxp = copy(Cx) normalEquationsP!(Cp,Cxp,x,B,A); @assert Cp ≈ CNormalEquations @assert Cxp ≈ CxNormalEquations Cp = copy(C) Cxp = copy(Cx) normalEquationsTullio!(Cp,Cxp,x,B,A); CNormalEquationsTullio = copy(Cp) println(norm(CNormalEquationsTullio-CNormal)) println(norm(CNormalEquationsTullio-CNormalEquations)) CxNormalEquationsTullio = copy(Cxp) println(norm(CxNormalEquationsTullio-CxNormal)) println(norm(CxNormalEquationsTullio-CxNormalEquations)) Cp = copy(C) Cxp = copy(Cx) normalEquationsMatlab(Cp,Cxp,A,B,x); @assert Cp ≈ CNormalEquations @assert Cxp ≈ CxNormalEquations Cp = copy(C) Cxp = copy(Cx) normalEquationsMatlabJulia(Cp,Cxp,A,B,x); @assert Cp ≈ CNormalEquations @assert Cxp ≈ CxNormalEquations @btime normal!(copy($C),copy($Cx),$x,$A,$B_r);

@btime normalTurbo!(copy($C),copy($Cx),$x,$A,$B); @btime normalTturbo!(copy($C),copy($Cx),$x,$A,$B);

@btime normalEquations!(copy($C),copy($Cx),$x,$B_r,$A); @btime normalEquationsP!(copy($C),copy($Cx),$x,$B_r,$A);

@btime normalEquationsTullio!(copy($C),copy($Cx),$x,$B_r,$A); @btime normalEquationsMatlab(copy($C),copy($Cx),$A,$B,$x);

@btime normalEquationsMatlabJulia(copy($C),copy($Cx),$A,$B,\$x);


resulting in

  3.404 s (20004 allocations: 5.81 MiB)
1.184 s (20004 allocations: 5.81 MiB)
1.179 s (20004 allocations: 5.81 MiB)
385.193 ms (8 allocations: 4.90 MiB)
479.380 ms (8 allocations: 4.90 MiB)
99.472 ms (156 allocations: 4.90 MiB)
182.132 ms (19 allocations: 197.54 MiB)
209.968 ms (13 allocations: 192.88 MiB)

2 Likes

In normalEquationsP! I think you can get rid of the initial allocation of temp, you later assign it a scalar value so there should be no need to allocate the matrix in the start. It might not make a noticeable difference, but was the thing I found when having a quick look.

2 Likes

True, it is useless! Thanks:)

Interesting, if I run your code I get no improvement.

5.107 s (20001 allocations: 943.88 KiB)
1.660 s (20001 allocations: 943.88 KiB)
1.648 s (20001 allocations: 943.88 KiB)
526.596 ms (4 allocations: 224 bytes)
242.233 ms (4 allocations: 224 bytes)
273.597 ms (104 allocations: 5.67 KiB)
152.689 ms (19 allocations: 197.54 MiB)
173.030 ms (10 allocations: 188.00 MiB)

Could this be dependent on the processor or julia version?

Did you start Julia with multiple threads?

1 Like

Seeing the different allocation counts I’d assume some differences. I’m on Julia 1.6.3 with 6 cores from Intel.

Edit: it could be me who has the problem: my system is performing dynamic frequency scaling.

julia> Pkg.status("Tullio")
[bc48ee85] Tullio v0.3.2

julia> Pkg.status("LoopVectorization")
[bdcacae8] LoopVectorization v0.12.98


Yes! ( I am using VSCodium with the NumThreads specified in the settings, 4 in my case) Threads.nthreads() returns 4`