Outperformed by Matlab

goerch · November 19, 2021, 6:37pm

So your requirements are N around 10^7, but R and M stay in their order of magnitude?

LinearAlgebra and BLAS underneath already are multithreaded.

Just tried my latest and greatest for N = 1000000:

@btime normal1!($C,$Cx,$x,$A,$B)

for blocksize in [10, 100, 1000, 10000, 100000]
    @btime normal2!($C,$Cx,$x,$A,$B,$blocksize)
end

with results

  48.886 s (1 allocation: 6.38 KiB)
  10.600 s (7 allocations: 130.25 KiB)
  7.888 s (8 allocations: 1.27 MiB)
  8.257 s (9 allocations: 12.67 MiB)
  8.619 s (10 allocations: 126.72 MiB)
  8.492 s (10 allocations: 1.24 GiB)

My 6 cores are running hot. Most time is spent in gemm. How fast do you think we can get this?

Edit: obvious improvement: implementing C += temp * temp' with rank-1 or rank-k updates via
BLAS.syr! and BLAS.syrk!, here the results:

  64.247 s (1 allocation: 6.38 KiB)
  6.456 s (4 allocations: 125.09 KiB)
  4.752 s (4 allocations: 1.22 MiB)
  5.423 s (4 allocations: 12.21 MiB)
  6.103 s (4 allocations: 122.07 MiB)
  5.422 s (4 allocations: 1.19 GiB)

Another edit after discussions using symmetric matrices for C:

  65.309 s (1 allocation: 6.38 KiB)
  6.326 s (4 allocations: 125.09 KiB)
  4.815 s (4 allocations: 1.22 MiB)
  4.616 s (4 allocations: 12.21 MiB)
  5.026 s (4 allocations: 122.07 MiB)
  4.971 s (4 allocations: 1.19 GiB)

Topic		Replies	Views
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7763	May 21, 2017
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19999	October 15, 2019
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2964	July 16, 2018
How to accelerate matrix operations(multiplication, add, inverse) in a for loop? Performance performance , matlab	23	6767	September 2, 2018
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8522	March 5, 2017

Outperformed by Matlab

Related topics