So your requirements are N around 10^7, but R and M stay in their order of magnitude?
LinearAlgebra and BLAS underneath already are multithreaded.
Just tried my latest and greatest for N = 1000000
:
@btime normal1!($C,$Cx,$x,$A,$B)
for blocksize in [10, 100, 1000, 10000, 100000]
@btime normal2!($C,$Cx,$x,$A,$B,$blocksize)
end
with results
48.886 s (1 allocation: 6.38 KiB)
10.600 s (7 allocations: 130.25 KiB)
7.888 s (8 allocations: 1.27 MiB)
8.257 s (9 allocations: 12.67 MiB)
8.619 s (10 allocations: 126.72 MiB)
8.492 s (10 allocations: 1.24 GiB)
My 6 cores are running hot. Most time is spent in gemm
. How fast do you think we can get this?
Edit: obvious improvement: implementing C += temp * temp'
with rank-1 or rank-k updates via
BLAS.syr!
and BLAS.syrk!
, here the results:
64.247 s (1 allocation: 6.38 KiB)
6.456 s (4 allocations: 125.09 KiB)
4.752 s (4 allocations: 1.22 MiB)
5.423 s (4 allocations: 12.21 MiB)
6.103 s (4 allocations: 122.07 MiB)
5.422 s (4 allocations: 1.19 GiB)
Another edit after discussions using symmetric matrices for C
:
65.309 s (1 allocation: 6.38 KiB)
6.326 s (4 allocations: 125.09 KiB)
4.815 s (4 allocations: 1.22 MiB)
4.616 s (4 allocations: 12.21 MiB)
5.026 s (4 allocations: 122.07 MiB)
4.971 s (4 allocations: 1.19 GiB)