Hi, I forgot to mention that the system is regularized, so that X'X+M =X'y, that is why I am trying to compute these two quantities (I am gonna edit the question)
Hello, I am using the @views macro at the beginning of each (outermost) loop when slicing, I have read that is should be equivalent?
Hello, try to call it with the the arguments A and B inverted (as in the example), it works for me with bounds checks
Sorry, my copy & paste mistake when adapting the MWE. Should I somehow edit/delete my previous comment?
No worries, just edit it perhaps? Apologies on my side for the lazy indexing in the function
Edit: corrected after @lmiq `s observations…
Here is a modified MWE, adding the proposals of the thread and using @aplavin’s nice trick Y.A.t.Q : Yet Another @threads Question - #13 by aplavin
using LinearAlgebra
using LoopVectorization
using Tullio
using BenchmarkTools
function normal!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
# @fastmath @views @inbounds for n = 1:size(A,1)
@views for n = 1:size(A,1)
kron!(temp,A[n,:],B[n,:])
C .= C.+ temp.*temp'
Cx .= Cx .+ temp.*x[n]
end
return Cx
end
function normalTurbo!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
# @fastmath @views @inbounds for n = 1:size(A,1)
@views for n = 1:size(A,1)
kron!(temp,A[n,:],B[n,:])
@turbo C .= C.+ temp.*temp'
@turbo Cx .= Cx .+ temp.*x[n]
end
return Cx
end
function normalTturbo!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
# @fastmath @views @inbounds for n = 1:size(A,1)
@views for n = 1:size(A,1)
kron!(temp,A[n,:],B[n,:])
@tturbo C .= C.+ temp.*temp'
@tturbo Cx .= Cx .+ temp.*x[n]
end
return Cx
end
# swapped arguments
function normalEquations!(C,Cx,x,A,B)
temp = Matrix{Float64}(undef,size(B,2),size(A,2))
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
C4x = reshape(Cx, size(B,2),size(A,2))
# @fastmath @inbounds @turbo for m = 1:size(B,2)
@turbo for m = 1:size(B,2)
for r = 1:size(A,2)
for mm = 1:size(B,2)
for rr = 1:size(A,2)
for n = 1:size(B,1)
temp = A[n,r] * B[n,m]
C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
C4x[m,r] = C4x[m,r] + temp*x[n]
end
end
end
end
end
return Cx
end
# swapped arguments
function normalEquationsP!(C,Cx,x,A,B)
temp = Matrix{Float64}(undef,size(B,2),size(A,2))
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
C4x = reshape(Cx, size(B,2),size(A,2))
# @fastmath @inbounds @tturbo for m = 1:size(B,2)
@tturbo for m = 1:size(B,2)
for r = 1:size(A,2)
for mm = 1:size(B,2)
for rr = 1:size(A,2)
for n = 1:size(B,1)
temp = A[n,r] * B[n,m]
C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
C4x[m,r] = C4x[m,r] + temp*x[n]
end
end
end
end
end
return Cx
end
# swapped arguments
function normalEquationsTullio!(C,Cx,x,A,B)
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
Cx4 = reshape(Cx, size(B,2),size(A,2))
@tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
@tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
return Cx
end
function normalEquationsMatlab(CC,Cx,A,B,x)
batchSize = 10000;
for n = 1:batchSize:size(A,1)
idx = min(n+batchSize-1,size(A,1));
temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:], ones(1, size(A,2)));
CC .= CC.+temp'*temp;
Cx .= Cx.+temp'*x[n:idx];
end
return Cx
end
function normalEquationsMatlabJulia(CC,Cx,A,B,x)
batchSize = 10000;
stencil = ones(1, size(A,2))
# @fastmath @inbounds @views for n = 1:batchSize:size(A,1)
@views for n = 1:batchSize:size(A,1)
idx = min(n+batchSize-1,size(A,1))
@turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
@turbo CC .= CC.+temp'*temp
@turbo Cx .= Cx.+temp'*x[n:idx]
end
return Cx
end
N = 10000;
R = 20;
M = 40;
A = rand(N,M);
B = rand(N,R);
C = rand(M*R,M*R);
x = rand(N,);
Cx = rand(M*R,);
A_r = PermutedDimsArray(permutedims(A), (2, 1))
B_r = PermutedDimsArray(permutedims(B), (2, 1))
Cp = copy(C)
Cxp = copy(Cx)
normal!(Cp,Cxp,x,A,B_r);
CNormal = copy(Cp)
CxNormal = copy(Cxp)
Cp = copy(C)
Cxp = copy(Cx)
normalTurbo!(Cp,Cxp,x,A,B);
@assert Cp ≈ CNormal
@assert Cxp ≈ CxNormal
Cp = copy(C)
Cxp = copy(Cx)
normalTturbo!(Cp,Cxp,x,A,B);
@assert Cp ≈ CNormal
@assert Cxp ≈ CxNormal
Cp = copy(C)
Cxp = copy(Cx)
normalEquations!(Cp,Cxp,x,B_r,A);
CNormalEquations = copy(Cp)
println(norm(CNormalEquations-CNormal))
CxNormalEquations = copy(Cxp)
println(norm(CxNormalEquations-CxNormal))
Cp = copy(C)
Cxp = copy(Cx)
normalEquationsP!(Cp,Cxp,x,B,A);
@assert Cp ≈ CNormalEquations
@assert Cxp ≈ CxNormalEquations
Cp = copy(C)
Cxp = copy(Cx)
normalEquationsTullio!(Cp,Cxp,x,B,A);
CNormalEquationsTullio = copy(Cp)
println(norm(CNormalEquationsTullio-CNormal))
println(norm(CNormalEquationsTullio-CNormalEquations))
CxNormalEquationsTullio = copy(Cxp)
println(norm(CxNormalEquationsTullio-CxNormal))
println(norm(CxNormalEquationsTullio-CxNormalEquations))
Cp = copy(C)
Cxp = copy(Cx)
normalEquationsMatlab(Cp,Cxp,A,B,x);
@assert Cp ≈ CNormalEquations
@assert Cxp ≈ CxNormalEquations
Cp = copy(C)
Cxp = copy(Cx)
normalEquationsMatlabJulia(Cp,Cxp,A,B,x);
@assert Cp ≈ CNormalEquations
@assert Cxp ≈ CxNormalEquations
@btime normal!(copy($C),copy($Cx),$x,$A,$B_r);
@btime normalTurbo!(copy($C),copy($Cx),$x,$A,$B);
@btime normalTturbo!(copy($C),copy($Cx),$x,$A,$B);
@btime normalEquations!(copy($C),copy($Cx),$x,$B_r,$A);
@btime normalEquationsP!(copy($C),copy($Cx),$x,$B_r,$A);
@btime normalEquationsTullio!(copy($C),copy($Cx),$x,$B_r,$A);
@btime normalEquationsMatlab(copy($C),copy($Cx),$A,$B,$x);
@btime normalEquationsMatlabJulia(copy($C),copy($Cx),$A,$B,$x);
resulting in
3.404 s (20004 allocations: 5.81 MiB)
1.184 s (20004 allocations: 5.81 MiB)
1.179 s (20004 allocations: 5.81 MiB)
385.193 ms (8 allocations: 4.90 MiB)
479.380 ms (8 allocations: 4.90 MiB)
99.472 ms (156 allocations: 4.90 MiB)
182.132 ms (19 allocations: 197.54 MiB)
209.968 ms (13 allocations: 192.88 MiB)
In normalEquationsP!
I think you can get rid of the initial allocation of temp
, you later assign it a scalar value so there should be no need to allocate the matrix in the start. It might not make a noticeable difference, but was the thing I found when having a quick look.
True, it is useless! Thanks:)
Interesting, if I run your code I get no improvement.
5.107 s (20001 allocations: 943.88 KiB)
1.660 s (20001 allocations: 943.88 KiB)
1.648 s (20001 allocations: 943.88 KiB)
526.596 ms (4 allocations: 224 bytes)
242.233 ms (4 allocations: 224 bytes)
273.597 ms (104 allocations: 5.67 KiB)
152.689 ms (19 allocations: 197.54 MiB)
173.030 ms (10 allocations: 188.00 MiB)
Could this be dependent on the processor or julia version?
Did you start Julia with multiple threads?
Seeing the different allocation counts I’d assume some differences. I’m on Julia 1.6.3 with 6 cores from Intel.
Edit: it could be me who has the problem: my system is performing dynamic frequency scaling.
julia> Pkg.status("Tullio")
[bc48ee85] Tullio v0.3.2
julia> Pkg.status("LoopVectorization")
[bdcacae8] LoopVectorization v0.12.98
Yes! ( I am using VSCodium with the NumThreads specified in the settings, 4 in my case) Threads.nthreads()
returns 4
Julia 1.7.0 - dev, Intel 4 physical cores, I will try with 1.6.3. What I still think is strange is that in my case the fastest code is the one copied from matlab
No big change except for Tullio which allocates less on 1.6.3
1.586 s (20001 allocations: 943.88 KiB)
1.495 s (30001 allocations: 1.07 MiB)
510.505 ms (5 allocations: 6.59 KiB)
228.873 ms (6 allocations: 6.61 KiB)
217.190 ms (104 allocations: 5.67 KiB)
152.197 ms (19 allocations: 197.54 MiB)
166.631 ms (10 allocations: 188.00 MiB)```
Did you reenable @fastmath
? Because I recently had the impression it could pessimize code.
What is going on here? I am getting this:
julia> @btime normalEquationsMatlabJulia($C,$Cx,$x,$A,$B);
3.808 ms (13 allocations: 9.17 MiB)
I’m copying here the complete Julia section, because I think I’m going crazy:
% julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.6.3 (2021-09-23)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using LoopVectorization
julia> function normalEquationsMatlabJulia(CC,Cx,A,B,x)
batchSize = 10000;
stencil = ones(1, size(A,2))
# @fastmath @inbounds @views for n = 1:batchSize:size(A,1)
@views for n = 1:batchSize:size(A,1)
idx = min(n+batchSize-1,size(A,1))
@turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
@turbo CC .= CC.+temp'*temp
@turbo Cx .= Cx.+temp'*x[n:idx]
end
return Cx
end
normalEquationsMatlabJulia (generic function with 1 method)
julia> N = 10000;
julia> R = 20;
julia> M = 40;
julia> A = rand(N,M);
julia> B = rand(N,R);
julia> C = rand(M*R,M*R);
julia> x = rand(N,);
julia> Cx = rand(M*R,);
julia> @btime normalEquationsMatlabJulia($C,$Cx,$x,$A,$B);
3.808 ms (13 allocations: 9.17 MiB)
Is the result correct? @goerch had some @assert macros to check if it is correct (it is when I run it). Also, how does the @btime of that function compare to the others on your machine?
Well, the assertions are not correct, because it is just referring to the same array:
normal!(C,Cx,x,A,B_r);
CxNormal = Cx # CxNormal is just the same array as Cx
normalTurbo!(C,Cx,x,A,B);
@assert Cx ≈ CxNormal # this will always be true
it should be CxNormal = copy(Cx)
.
Also, since the functions modify Cx
, each version is not operating on the same values. The original Cx
should be copied from call to call (I am not sure how that affects performance, but it may, and is necessary to assert the values).
Sorry for the mistake. I tried to correct my post above. As far as I understand we have three different results:
-
normal!
,normalTurbo!
andnormalTturbo!
computing the same
3.404 s (20004 allocations: 5.81 MiB)
1.184 s (20004 allocations: 5.81 MiB)
1.179 s (20004 allocations: 5.81 MiB)
-
normalEquations!
,normalEquationsP!
,normalEquationsMatlab
andnormalEquationsMatlabJulia
computing the same
385.193 ms (8 allocations: 4.90 MiB)
479.380 ms (8 allocations: 4.90 MiB)
182.132 ms (19 allocations: 197.54 MiB)
209.968 ms (13 allocations: 192.88 MiB)
normalEquationsTullio
99.472 ms (156 allocations: 4.90 MiB)
Sorry, I didn’t notice that before I replied. FWIW, I think that annotations far from their effect make such misinterpretations likely.