 # How to improve the speed of matrix multiplication?

I have a code about large matrix multiplication in a for loop which currently takes longer time than I expected. I wonder if it is possible to improve the following MWE code:

``````temp1 = randn(1600, 1600)
temp2 = randn(1600)
temp = Vector{Vector{Float64}}(undef, 10000)
@time for t = 1:10000
temp[t] = temp1 * temp2
end

6.276464 seconds (19.49 k allocations: 123.436 MiB, 0.34% gc time)
``````

Use a GPU

4 Likes

you expected wrong. Directly multiplying float64 matrices simply calls OpenBLAS under the hood, there’s nothing you can do (in fact, probably nothing can be done period)

2 Likes

Not sure that’s totally fair; for example, you can try changing the number of threads OpenBLAS uses, because apparently its defaults are sometimes pretty bad. You can also try MKL.jl to use MKL instead.

3 Likes

Not much faster for this example, but you can avoid some allocations by using

``````@time for t = 1:10000
mul!(temp[t],temp1,temp2)
end
``````

but you have to allocate them properly before…

4 Likes

Especially for matrix-vector multiplication, which is totally limited by memory bandwidth. When using a single thread, `temp1 * temp2` and `sum(temp1)` will be about the same fast:

``````julia> temp1 = randn(1600, 1600);

julia> temp2 = randn(1600);

julia> temp3 = similar(temp2);

julia> @benchmark mul!(\$temp3, \$temp1, \$temp2)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.074 ms (0.00% GC)
median time:      1.099 ms (0.00% GC)
mean time:        1.115 ms (0.00% GC)
maximum time:     1.499 ms (0.00% GC)
--------------
samples:          4479
evals/sample:     1

julia> @benchmark sum(\$temp1)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.120 ms (0.00% GC)
median time:      1.149 ms (0.00% GC)
mean time:        1.169 ms (0.00% GC)
maximum time:     1.613 ms (0.00% GC)
--------------
samples:          4274
evals/sample:     1
``````

(Re: the GPU suggestion, GPUs do have far higher memory bandwidth than any normal CPU, so they are good for this sort of memory bandwidth limited thing [and not just compute heavy things like matrix-matrix multiplication])

4 Likes

You are benchmarking in global scope, which is not a good idea.

Also, if you have some memory you can reuse, you can write

``````mul!(out, temp1, temp2)
``````

And if your matrices have some sort of special structure, that may be exploited.

1 Like

Are your consecutive iterations dependent of each other? If not, you could maybe reformulate your problem to use matrix-matrix multiplication, which is not memory bound and way more efficient.

``````using BenchmarkTools
temp1 = randn(1600, 1600)
temp2 = randn(1600)

@benchmark \$temp1 * \$temp2
814.858 μs (1 allocation: 12.63 KiB)
``````

would take more than 8.1 s per 10000 iterations, while the same calculation reformulated to a matrix-matrix problem

``````temp3 = repeat(\$temp2, 1, 10000)
@btime \$temp1 * \$temp3
952.054 ms (2 allocations: 122.07 MiB)
``````

is about 8 times faster on my machine.

6 Likes

If your application allows for it you could also switch to single precision `Float32`, which roughly reduces the time by a factor of two if you compare with the benchmark above

``````using BenchmarkTools

temp1 = randn(Float32,1600, 1600)
temp2 = randn(Float32,1600)

@btime \$temp1 * \$temp2
518.491 μs (1 allocation: 6.38 KiB)
``````
1 Like

I really appreciate your suggestion. I am able to reduce the time from 10 seconds to 2 seconds. Thank you very much.

The hint from @DNF is valuable and you should think about it. Maybe you can even improve the whole approach or do something smart when your matrices are somehow correlated or have a specific structure or property (something like symmetric, logarithmic value range, etc.).