Outperformed by Matlab

So your requirements are N around 10^7, but R and M stay in their order of magnitude?

LinearAlgebra and BLAS underneath already are multithreaded.

Just tried my latest and greatest for N = 1000000:

@btime normal1!($C,$Cx,$x,$A,$B)

for blocksize in [10, 100, 1000, 10000, 100000]
    @btime normal2!($C,$Cx,$x,$A,$B,$blocksize)
end

with results

  48.886 s (1 allocation: 6.38 KiB)
  10.600 s (7 allocations: 130.25 KiB)
  7.888 s (8 allocations: 1.27 MiB)
  8.257 s (9 allocations: 12.67 MiB)
  8.619 s (10 allocations: 126.72 MiB)
  8.492 s (10 allocations: 1.24 GiB)

My 6 cores are running hot. Most time is spent in gemm. How fast do you think we can get this?

Edit: obvious improvement: implementing C += temp * temp' with rank-1 or rank-k updates via
BLAS.syr! and BLAS.syrk!, here the results:

  64.247 s (1 allocation: 6.38 KiB)
  6.456 s (4 allocations: 125.09 KiB)
  4.752 s (4 allocations: 1.22 MiB)
  5.423 s (4 allocations: 12.21 MiB)
  6.103 s (4 allocations: 122.07 MiB)
  5.422 s (4 allocations: 1.19 GiB)

Another edit after discussions using symmetric matrices for C:

  65.309 s (1 allocation: 6.38 KiB)
  6.326 s (4 allocations: 125.09 KiB)
  4.815 s (4 allocations: 1.22 MiB)
  4.616 s (4 allocations: 12.21 MiB)
  5.026 s (4 allocations: 122.07 MiB)
  4.971 s (4 allocations: 1.19 GiB)
1 Like