Outperformed by Matlab

OP should really post the Matlab code since it seems there were mistakes in the Julia code fixed along the way. I can run Matlab timings on my machine for direct comparison.

8 Likes

For me this whole adventure boils down to these two implementations

using LinearAlgebra
using MKL
using BenchmarkTools

function _kron!(C::T1, A::T2, B::T3) where {T1 <: AbstractVector, T2 <: AbstractVector, T3 <: AbstractVector}
    @views for i = 1:size(A, 1)
        mul!(C[(i-1)*size(B,1)+1:i*size(B,1)],A[i],B)
    end
    return C
end

function _kron(A::T1, B::T2) where {T1 <: AbstractVector, T2 <: AbstractVector}
    C = zeros(size(A,1) * size(B,1))
    _kron!(C, A, B)
    return C
end

function normal1!(C,Cx,x,A,B)
    temp = Vector{Float64}(undef,size(A,2)*size(B,2))
    @views for n = 1:size(A,1)
        _kron!(temp,B[n,:],A[n,:])
        BLAS.ger!(1.,temp,temp,C)
        BLAS.axpy!(x[n],temp,Cx)
    end
    return Cx
end

function kr!(C, A, B)
    for n = 1:size(A,1)
        @views for i = 1:size(B, 2)
            BLAS.axpy!(B[n,i],A[n,:],C[(i-1)*size(A,2)+1:i*size(A,2),n:n])
        end
    end
    return C
end

function kr(A, B)
    C = zeros(size(A,2)*size(B,2),size(A,1))
    kr!(C, A, B)
    return C
end

function normal2!(C,Cx,x,A,B)
    temp = kr(A, B)
    mul!(C, temp, temp')
    mul!(Cx, temp, x)
    return Cx
end

N = 10000;
R = 20;
M = 40;

A = rand(N,M);
B = rand(N,R);
x = rand(N,);

C = zeros(M*R,M*R);
Cx = zeros(M*R,);

normal1!(C,Cx,x,A,B);
CNormal = copy(C)
CxNormal = copy(Cx)

C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normal2!(C,Cx,x,A,B);
@assert C ≈ CNormal
@assert Cx ≈ CxNormal

@btime normal1!($C,$Cx,$x,$A,$B);

@btime normal2!($C,$Cx,$x,$A,$B);

which trade memory for time:

  313.621 ms (1 allocation: 6.38 KiB)
  49.573 ms (2 allocations: 61.04 MiB)

I don’t see an immediate need for other optimizations besides using BLAS.

2 Likes

Hello @goerch, I tested your addition without MKL, which I have trouble getting to work. Here are the results:

  4.221 s (1 allocation: 6.38 KiB)
  85.958 ms (2 allocations: 61.04 MiB)
  1.443 s (1 allocation: 6.38 KiB)
  1.422 s (10001 allocations: 162.62 KiB)
  489.500 ms (4 allocations: 224 bytes)
  192.034 ms (5 allocations: 240 bytes)
  214.050 ms (104 allocations: 5.67 KiB)
  161.611 ms (10 allocations: 188.00 MiB)
  165.157 ms (10 allocations: 188.00 MiB)

@btime normal1!($C,$Cx,$x,$A,$B); gives me essentially the same performance as Matlab’s 0.0869 seconds, which is quite impressive (without MKL). Now I agree it is indeed a question of memory vs time question. However suppose that the constraint is on memory (you do not want to compute the large C matrix at once), which is indeed the case for me. Then one other solution would be to do what you are doing in normal1!, however it seems like @tullio and simple @turbo are doing a better job in that regards with the reduction etc., also probably because they manage the threads efficiently over the whole computation instead of having to deal with multiple calls to BLAS. So I think that (meaningful) increased performance would be in the @turbo area of things. On a sidenote, in Matlab I am essentially doing what you are doing except in batches, so not to run out of memory. I was hoping (and it seems to be a bit the case), that @turbo and @tullio can take care of reductions in some kind of optimal way, as opposed to the empirical and probably far-from-optimal repeated calls to BLAS with chunks of data. So again, I think (perhaps naively, as I am new to Julia) that performance can be probably found using @turbo, perhaps leveraging the fact that the rank-1 updates are symmetric (many calculations are repeated).

While I have not read all variants, here’s one way to fuse the kron of into the mul!, with no temporary arrays:

julia> using Tullio, LoopVectorization

julia> function normal4!(C, Cx, x,A,B)
           # @cast temp[(i,j), z] := A[z,i] * B[z,j]  # z is the long index
           # @tullio C[ij, ij2] = temp[ij,z]* temp[ij2,z]
           C4 = reshape(C, size(A,2), size(B,2), size(A,2), size(B,2))
           @tullio C4[i,j, i2,j2] = A[z,i] * B[z,j] * A[z,i2] * B[z,j2]

           # @tullio Cx[ij] = temp[ij, z] * x[z]
           Cx2 = reshape(Cx, size(A,2), size(B,2))
           @tullio Cx2[i,j] = A[z,i] * B[z,j] * x[z]
           Cx
       end
normal4! (generic function with 1 method)

julia> @btime normal4!($C,$Cx,$x,$A,$B);
  min 116.381 ms, mean 117.503 ms (105 allocations, 5.52 KiB)

julia> C ≈ CNormal
true

julia> Cx ≈ CxNormal
true

julia> @btime normal1!($C,$Cx,$x,$A,$B);  # from @goerch just above
  min 394.857 ms, mean 451.183 ms (1 allocation, 6.38 KiB)

julia> @btime normal2!($C,$Cx,$x,$A,$B);  # from @goerch just above, without MKL
  min 54.951 ms, mean 57.396 ms (2 allocations, 61.04 MiB. GC mean 1.81%)

julia> versioninfo()
Julia Version 1.8.0-DEV.1002
Commit 89c137c1ef (2021-11-17 05:04 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.6.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

ps. Note that the function kr! using a loop over BLAS.axpy! and calculating slices by hand is quite a bit slower than just using broadcasting & reshape (although this is only 10% of normal2!'s time):

julia> using TensorCast

julia> AB = similar(kr(A, B));

julia> @btime kr!($AB, $A, $B);
  min 7.544 ms, mean 7.592 ms (0 allocations)

julia> @btime @cast $AB[(i,j), z] = $A[z,i] * $B[z,j];
  min 2.909 ms, mean 2.958 ms (2 allocations, 96 bytes)
3 Likes

Hello @tbeason, impressive performance on your 8-threads for @tullio and @tturbo! Here is the Matlab code:

function [CC,Cx] = normalEquations(A,B,x)
    [N,DA] = size(A);
    [~,DB] = size(B);
    CC = zeros(DA*DB,DA*DB);
    Cy = zeros(DA*DB,1);
    batchSize = 10000;
    for n = 1:batchSize:N
        idx = min(n+batchSize-1,N);
        temp = repmat(A(n:idx,:),1,DB).*kron(B(n:idx,:), ones(1, DA));
        CC = CC+temp'*temp;
        Cy = Cy+temp'*x(n:idx,:);
    end
end

Note that batchSize = 10000; means that in the current example, you would first form the whole X matrix and then compute X'X and X'y in two separate steps. In practice, I have been considering batchSize = 10000; when N\approx 10^7. As in the latest reply to @goerch, this is probably very naive, so my hope was that smart threading of @tturbo and @tullio could do the trick. It seems indeed that on your machine that is the case! So even if @tullio is computing some things twice in the two calls, it is still faster than time-optimal BLAS approach from @goerch. The consequences of this are however not clear to me: does this mean that improvements when using LoopVectorizations are possible when computing the rank-1 updates efficiently (accounting for symmetry)? I assume BLAS does this in ger! and other functions.

1 Like

I used your function and then ran

avgt = 0.0;
reps = 250;
for t = 1:reps
tic
[CC,Cx] = normalEquations(A,B,x);
avgt = avgt + toc/reps;
end

avgt was 0.0390.

1 Like

So your requirements are N around 10^7, but R and M stay in their order of magnitude?

LinearAlgebra and BLAS underneath already are multithreaded.

Just tried my latest and greatest for N = 1000000:

@btime normal1!($C,$Cx,$x,$A,$B)

for blocksize in [10, 100, 1000, 10000, 100000]
    @btime normal2!($C,$Cx,$x,$A,$B,$blocksize)
end

with results

  48.886 s (1 allocation: 6.38 KiB)
  10.600 s (7 allocations: 130.25 KiB)
  7.888 s (8 allocations: 1.27 MiB)
  8.257 s (9 allocations: 12.67 MiB)
  8.619 s (10 allocations: 126.72 MiB)
  8.492 s (10 allocations: 1.24 GiB)

My 6 cores are running hot. Most time is spent in gemm. How fast do you think we can get this?

Edit: obvious improvement: implementing C += temp * temp' with rank-1 or rank-k updates via
BLAS.syr! and BLAS.syrk!, here the results:

  64.247 s (1 allocation: 6.38 KiB)
  6.456 s (4 allocations: 125.09 KiB)
  4.752 s (4 allocations: 1.22 MiB)
  5.423 s (4 allocations: 12.21 MiB)
  6.103 s (4 allocations: 122.07 MiB)
  5.422 s (4 allocations: 1.19 GiB)

Another edit after discussions using symmetric matrices for C:

  65.309 s (1 allocation: 6.38 KiB)
  6.326 s (4 allocations: 125.09 KiB)
  4.815 s (4 allocations: 1.22 MiB)
  4.616 s (4 allocations: 12.21 MiB)
  5.026 s (4 allocations: 122.07 MiB)
  4.971 s (4 allocations: 1.19 GiB)
1 Like

Thanks for your input. I believe your proposal is already contained in the OP’s benchmark in the following form:

function normalEquationsTullio!(C,Cx,x,A,B)
    C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2))  # writing into original C
    Cx4 = reshape(Cx, size(B,2),size(A,2))
    @tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
    @tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
    return Cx
end

I just tested normal4! and it behaves very similar.

Yes, this is the one part where I’m not that happy with BLAS: it also requires to zero-initialize the result every time. Current version therefore uses

C[(i-1)*size(A,2)+1:i*size(A,2), n:n] .= A[n,:] .* B[n,i]

instead.

Hi Tyler,

thanks for your time and effort. It would be very useful to know in which unit this time is reported.

Oh right, sorry. Somehow I saw .* vs * and stopped reading the earlier code.

1 Like

That is in seconds, so 39ms. Similar to the fastest Julia code.

2 Likes

Thanks again. So I assume your Matlab doesn’t use MKL either and you could increase the number of BLAS threads? That is good to hear! BTW MKL outperforms OpenBLAS in this benchmark around a factor of 2 as far as I see.

I tried adding MKL but got errors (on Windows 10). Supposedly it works better with 1.7 but I haven’t gotten around to making that my daily driver yet.

1 Like

I hear you. Just switched to 1.7.0-rc3 to be able to check OpenBLAS vs. MKL.

Lessons learned:

  1. Regarding kron: we had this discussion, which shows room to improve performance for Kronecker products and their kind like Hadamard and Khatri-Rao, probably.
  2. Regarding basic linear algebra: we had this and that discussion about how to dispatch high level linear algebra operations to OpenBLAS or MKL.

For now I’d recommend users coming from MATLAB and seeking high performance to look into the lower level BLAS interfaces. Would it be sensible to update the documentation to reflect this experience?

P.S.: a word on the differences w.r.t. the selected BLAS implementation wouldn’t hurt, also.

1 Like