OP should really post the Matlab code since it seems there were mistakes in the Julia code fixed along the way. I can run Matlab timings on my machine for direct comparison.
For me this whole adventure boils down to these two implementations
using LinearAlgebra
using MKL
using BenchmarkTools
function _kron!(C::T1, A::T2, B::T3) where {T1 <: AbstractVector, T2 <: AbstractVector, T3 <: AbstractVector}
@views for i = 1:size(A, 1)
mul!(C[(i-1)*size(B,1)+1:i*size(B,1)],A[i],B)
end
return C
end
function _kron(A::T1, B::T2) where {T1 <: AbstractVector, T2 <: AbstractVector}
C = zeros(size(A,1) * size(B,1))
_kron!(C, A, B)
return C
end
function normal1!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
@views for n = 1:size(A,1)
_kron!(temp,B[n,:],A[n,:])
BLAS.ger!(1.,temp,temp,C)
BLAS.axpy!(x[n],temp,Cx)
end
return Cx
end
function kr!(C, A, B)
for n = 1:size(A,1)
@views for i = 1:size(B, 2)
BLAS.axpy!(B[n,i],A[n,:],C[(i-1)*size(A,2)+1:i*size(A,2),n:n])
end
end
return C
end
function kr(A, B)
C = zeros(size(A,2)*size(B,2),size(A,1))
kr!(C, A, B)
return C
end
function normal2!(C,Cx,x,A,B)
temp = kr(A, B)
mul!(C, temp, temp')
mul!(Cx, temp, x)
return Cx
end
N = 10000;
R = 20;
M = 40;
A = rand(N,M);
B = rand(N,R);
x = rand(N,);
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normal1!(C,Cx,x,A,B);
CNormal = copy(C)
CxNormal = copy(Cx)
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normal2!(C,Cx,x,A,B);
@assert C ≈ CNormal
@assert Cx ≈ CxNormal
@btime normal1!($C,$Cx,$x,$A,$B);
@btime normal2!($C,$Cx,$x,$A,$B);
which trade memory for time:
313.621 ms (1 allocation: 6.38 KiB)
49.573 ms (2 allocations: 61.04 MiB)
I don’t see an immediate need for other optimizations besides using BLAS.
Hello @goerch, I tested your addition without MKL, which I have trouble getting to work. Here are the results:
4.221 s (1 allocation: 6.38 KiB)
85.958 ms (2 allocations: 61.04 MiB)
1.443 s (1 allocation: 6.38 KiB)
1.422 s (10001 allocations: 162.62 KiB)
489.500 ms (4 allocations: 224 bytes)
192.034 ms (5 allocations: 240 bytes)
214.050 ms (104 allocations: 5.67 KiB)
161.611 ms (10 allocations: 188.00 MiB)
165.157 ms (10 allocations: 188.00 MiB)
@btime normal1!($C,$Cx,$x,$A,$B);
gives me essentially the same performance as Matlab’s 0.0869
seconds, which is quite impressive (without MKL). Now I agree it is indeed a question of memory vs time question. However suppose that the constraint is on memory (you do not want to compute the large C matrix at once), which is indeed the case for me. Then one other solution would be to do what you are doing in normal1!
, however it seems like @tullio and simple @turbo are doing a better job in that regards with the reduction etc., also probably because they manage the threads efficiently over the whole computation instead of having to deal with multiple calls to BLAS. So I think that (meaningful) increased performance would be in the @turbo area of things. On a sidenote, in Matlab I am essentially doing what you are doing except in batches, so not to run out of memory. I was hoping (and it seems to be a bit the case), that @turbo and @tullio can take care of reductions in some kind of optimal way, as opposed to the empirical and probably far-from-optimal repeated calls to BLAS with chunks of data. So again, I think (perhaps naively, as I am new to Julia) that performance can be probably found using @turbo, perhaps leveraging the fact that the rank-1 updates are symmetric (many calculations are repeated).
While I have not read all variants, here’s one way to fuse the kron
of into the mul!
, with no temporary arrays:
julia> using Tullio, LoopVectorization
julia> function normal4!(C, Cx, x,A,B)
# @cast temp[(i,j), z] := A[z,i] * B[z,j] # z is the long index
# @tullio C[ij, ij2] = temp[ij,z]* temp[ij2,z]
C4 = reshape(C, size(A,2), size(B,2), size(A,2), size(B,2))
@tullio C4[i,j, i2,j2] = A[z,i] * B[z,j] * A[z,i2] * B[z,j2]
# @tullio Cx[ij] = temp[ij, z] * x[z]
Cx2 = reshape(Cx, size(A,2), size(B,2))
@tullio Cx2[i,j] = A[z,i] * B[z,j] * x[z]
Cx
end
normal4! (generic function with 1 method)
julia> @btime normal4!($C,$Cx,$x,$A,$B);
min 116.381 ms, mean 117.503 ms (105 allocations, 5.52 KiB)
julia> C ≈ CNormal
true
julia> Cx ≈ CxNormal
true
julia> @btime normal1!($C,$Cx,$x,$A,$B); # from @goerch just above
min 394.857 ms, mean 451.183 ms (1 allocation, 6.38 KiB)
julia> @btime normal2!($C,$Cx,$x,$A,$B); # from @goerch just above, without MKL
min 54.951 ms, mean 57.396 ms (2 allocations, 61.04 MiB. GC mean 1.81%)
julia> versioninfo()
Julia Version 1.8.0-DEV.1002
Commit 89c137c1ef (2021-11-17 05:04 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.6.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 4
ps. Note that the function kr!
using a loop over BLAS.axpy!
and calculating slices by hand is quite a bit slower than just using broadcasting & reshape (although this is only 10% of normal2!
's time):
julia> using TensorCast
julia> AB = similar(kr(A, B));
julia> @btime kr!($AB, $A, $B);
min 7.544 ms, mean 7.592 ms (0 allocations)
julia> @btime @cast $AB[(i,j), z] = $A[z,i] * $B[z,j];
min 2.909 ms, mean 2.958 ms (2 allocations, 96 bytes)
Hello @tbeason, impressive performance on your 8-threads for @tullio and @tturbo! Here is the Matlab code:
function [CC,Cx] = normalEquations(A,B,x)
[N,DA] = size(A);
[~,DB] = size(B);
CC = zeros(DA*DB,DA*DB);
Cy = zeros(DA*DB,1);
batchSize = 10000;
for n = 1:batchSize:N
idx = min(n+batchSize-1,N);
temp = repmat(A(n:idx,:),1,DB).*kron(B(n:idx,:), ones(1, DA));
CC = CC+temp'*temp;
Cy = Cy+temp'*x(n:idx,:);
end
end
Note that batchSize = 10000;
means that in the current example, you would first form the whole X matrix and then compute X'X and X'y in two separate steps. In practice, I have been considering batchSize = 10000;
when N\approx 10^7. As in the latest reply to @goerch, this is probably very naive, so my hope was that smart threading of @tturbo and @tullio could do the trick. It seems indeed that on your machine that is the case! So even if @tullio is computing some things twice in the two calls, it is still faster than time-optimal BLAS approach from @goerch. The consequences of this are however not clear to me: does this mean that improvements when using LoopVectorizations are possible when computing the rank-1 updates efficiently (accounting for symmetry)? I assume BLAS does this in ger!
and other functions.
I used your function and then ran
avgt = 0.0;
reps = 250;
for t = 1:reps
tic
[CC,Cx] = normalEquations(A,B,x);
avgt = avgt + toc/reps;
end
avgt
was 0.0390.
So your requirements are N around 10^7, but R and M stay in their order of magnitude?
LinearAlgebra and BLAS underneath already are multithreaded.
Just tried my latest and greatest for N = 1000000
:
@btime normal1!($C,$Cx,$x,$A,$B)
for blocksize in [10, 100, 1000, 10000, 100000]
@btime normal2!($C,$Cx,$x,$A,$B,$blocksize)
end
with results
48.886 s (1 allocation: 6.38 KiB)
10.600 s (7 allocations: 130.25 KiB)
7.888 s (8 allocations: 1.27 MiB)
8.257 s (9 allocations: 12.67 MiB)
8.619 s (10 allocations: 126.72 MiB)
8.492 s (10 allocations: 1.24 GiB)
My 6 cores are running hot. Most time is spent in gemm
. How fast do you think we can get this?
Edit: obvious improvement: implementing C += temp * temp'
with rank-1 or rank-k updates via
BLAS.syr!
and BLAS.syrk!
, here the results:
64.247 s (1 allocation: 6.38 KiB)
6.456 s (4 allocations: 125.09 KiB)
4.752 s (4 allocations: 1.22 MiB)
5.423 s (4 allocations: 12.21 MiB)
6.103 s (4 allocations: 122.07 MiB)
5.422 s (4 allocations: 1.19 GiB)
Another edit after discussions using symmetric matrices for C
:
65.309 s (1 allocation: 6.38 KiB)
6.326 s (4 allocations: 125.09 KiB)
4.815 s (4 allocations: 1.22 MiB)
4.616 s (4 allocations: 12.21 MiB)
5.026 s (4 allocations: 122.07 MiB)
4.971 s (4 allocations: 1.19 GiB)
Thanks for your input. I believe your proposal is already contained in the OP’s benchmark in the following form:
function normalEquationsTullio!(C,Cx,x,A,B)
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
Cx4 = reshape(Cx, size(B,2),size(A,2))
@tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
@tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
return Cx
end
I just tested normal4!
and it behaves very similar.
Yes, this is the one part where I’m not that happy with BLAS: it also requires to zero-initialize the result every time. Current version therefore uses
C[(i-1)*size(A,2)+1:i*size(A,2), n:n] .= A[n,:] .* B[n,i]
instead.
Hi Tyler,
thanks for your time and effort. It would be very useful to know in which unit this time is reported.
Oh right, sorry. Somehow I saw .*
vs *
and stopped reading the earlier code.
That is in seconds, so 39ms. Similar to the fastest Julia code.
Thanks again. So I assume your Matlab doesn’t use MKL either and you could increase the number of BLAS threads? That is good to hear! BTW MKL outperforms OpenBLAS in this benchmark around a factor of 2 as far as I see.
I tried adding MKL but got errors (on Windows 10). Supposedly it works better with 1.7 but I haven’t gotten around to making that my daily driver yet.
I hear you. Just switched to 1.7.0-rc3 to be able to check OpenBLAS vs. MKL.
Lessons learned:
- Regarding
kron
: we had this discussion, which shows room to improve performance for Kronecker products and their kind like Hadamard and Khatri-Rao, probably. - Regarding basic linear algebra: we had this and that discussion about how to dispatch high level linear algebra operations to OpenBLAS or MKL.
For now I’d recommend users coming from MATLAB and seeking high performance to look into the lower level BLAS interfaces. Would it be sensible to update the documentation to reflect this experience?
P.S.: a word on the differences w.r.t. the selected BLAS implementation wouldn’t hurt, also.