Matrix multiplication

Hi there,

I need to perform many matrix multiplications as follows. Here I proposed two functions (f1 and f2). I am wondering whether we could improve the speed of the function f2 further? Thanks in advance!

using LinearAlgebra
f1(C,X,Y,Z,T,x) = C .= .-x[1].*X*X' .- x[2].*Y*Y' .+ x[3].*Z*T;
function f2(C,X,Y,Z,T,x)
    x1 = -x[1]
    x2 = -x[2]
    x3 = x[3]
    fill!(C,0.0)    
    BLAS.syrk!('L', 'N', x1, X, 1.0, C)
    BLAS.syrk!('L', 'N', x2, Y, 1.0, C)
    conSym(C)
    mul!(C, Z, T, x3, 1.0)
end
function conSym(A)
   @inbounds for ii=1:size(A,1)-1
           for jj=ii+1:size(A,1)
             A[ii,jj] = A[jj,ii]   
           end
        end
end
v = 600;
r = 3;
b1 = 60;
b2 = 90;
x = rand(3);
C1 = zeros(Float64,v,v);
C2 = zeros(Float64,v,v);
C3 = zeros(Float64,v,v);

X = rand(v,b1);
Y = rand(v,b2);
Z = rand(v,v*r);
T = rand(v*r,v) ;         
@time f1(C1,X,Y,Z,T,x);
@time f2(C2,X,Y,Z,T,x);

@test all(C1 .≈ C2);

Are you aware that you are measuring the compilation time there as well? Add a call to f2 before that and you will get a better time estimate:

f2(C2,X,Y,Z,T,x);
@time f2(C2,X,Y,Z,T,x);
  0.012986 seconds

(instead of 0.063080 seconds (36.77 k allocations: 1.934 MiB), that you get in the first call)

Or even better, use BenchmarkTools.

You should read the Performance Tips section in the docs. For example, you are using global variables which affect type stability and the @time macro called once will also include the time needed for the first compilation. Use @btime from BenchmarkTools instead.

Are you aware that you are measuring the compilation time there as well? Add a call to f2 before that and you will get a better time estimate:

Yes, I am aware of that. I actually don’t compare the computing times of these functions for a first call. Also, the function f1 is just a way that I make sure that the calculation of function f2 is correct.

(instead of 0.063080 seconds (36.77 k allocations: 1.934 MiB), that you get in the first call)
Or even better, use BenchmarkTools.

Indeed, I actually used the command @btime. But somehow I got trouble when using this command @test all(C1 .≈ C2). This is because I thought that these commands @btime f1($C1,$X,$Y,$Z,$T,$x); @btime f2($C2,$X,$Y,$Z,$T,$x); will not change variables C1 and C2. Therefore, I used these commands @time f1(C1,X,Y,Z,T,x); @time f2(C2,X,Y,Z,T,x); in the original post so that my codes will run smoothly. But probably, I am wrong here. I am sorry about this. Nevertheless, this is computing times from my computer:

@btime f1($C1,$X,$Y,$Z,$T,$x);
  23.513 ms (12 allocations: 17.17 MiB)
@btime f2($C2,$X,$Y,$Z,$T,$x);
  17.317 ms (0 allocations: 0 bytes)

As you can see, the function f2 is not much faster than function f1 as I expected. I had an idea to write a loop. Since the dimensions of matrices X, Y, Z, and T are different, it is not straightforward to do that. So, do you have any suggestions? Thanks in advance!

To benchmark that with BenchmarkTools you should use:

setup = (
C3 = zeros(Float64,v,v);
Xin = X;
Yin = Y;
Zin = Z;
Tin = T;
)
@btime f2(C3,Xin,Yin,Zin,Tin,$x) setup=setup evals=1

On what remains, no, I do not see any obvious change to accelerate that code.

To benchmark that with BenchmarkTools you should use:

Yes, thanks for your tip!

Maybe less obvious. Depending on the application, one could switch to Float32, which should speed things up.