I have a code about large matrix multiplication in a for loop which currently takes longer time than I expected. I wonder if it is possible to improve the following MWE code:
temp1 = randn(1600, 1600)
temp2 = randn(1600)
temp = Vector{Vector{Float64}}(undef, 10000)
@time for t = 1:10000
temp[t] = temp1 * temp2
end
6.276464 seconds (19.49 k allocations: 123.436 MiB, 0.34% gc time)
you expected wrong. Directly multiplying float64 matrices simply calls OpenBLAS under the hood, there’s nothing you can do (in fact, probably nothing can be done period)
Not sure that’s totally fair; for example, you can try changing the number of threads OpenBLAS uses, because apparently its defaults are sometimes pretty bad. You can also try MKL.jl to use MKL instead.
Especially for matrix-vector multiplication, which is totally limited by memory bandwidth. When using a single thread, temp1 * temp2 and sum(temp1) will be about the same fast:
julia> temp1 = randn(1600, 1600);
julia> temp2 = randn(1600);
julia> temp3 = similar(temp2);
julia> using LinearAlgebra; BLAS.set_num_threads(1);
julia> @benchmark mul!($temp3, $temp1, $temp2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.074 ms (0.00% GC)
median time: 1.099 ms (0.00% GC)
mean time: 1.115 ms (0.00% GC)
maximum time: 1.499 ms (0.00% GC)
--------------
samples: 4479
evals/sample: 1
julia> @benchmark sum($temp1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.120 ms (0.00% GC)
median time: 1.149 ms (0.00% GC)
mean time: 1.169 ms (0.00% GC)
maximum time: 1.613 ms (0.00% GC)
--------------
samples: 4274
evals/sample: 1
(Re: the GPU suggestion, GPUs do have far higher memory bandwidth than any normal CPU, so they are good for this sort of memory bandwidth limited thing [and not just compute heavy things like matrix-matrix multiplication])
Are your consecutive iterations dependent of each other? If not, you could maybe reformulate your problem to use matrix-matrix multiplication, which is not memory bound and way more efficient.
If your application allows for it you could also switch to single precision Float32, which roughly reduces the time by a factor of two if you compare with the benchmark above
The hint from @DNF is valuable and you should think about it. Maybe you can even improve the whole approach or do something smart when your matrices are somehow correlated or have a specific structure or property (something like symmetric, logarithmic value range, etc.).