How to improve the speed of matrix multiplication?

I have a code about large matrix multiplication in a for loop which currently takes longer time than I expected. I wonder if it is possible to improve the following MWE code:

temp1 = randn(1600, 1600)
temp2 = randn(1600)
temp = Vector{Vector{Float64}}(undef, 10000)
@time for t = 1:10000
        temp[t] = temp1 * temp2
end

6.276464 seconds (19.49 k allocations: 123.436 MiB, 0.34% gc time)

Use a GPU

5 Likes

you expected wrong. Directly multiplying float64 matrices simply calls OpenBLAS under the hood, there’s nothing you can do (in fact, probably nothing can be done period)

3 Likes

Not sure that’s totally fair; for example, you can try changing the number of threads OpenBLAS uses, because apparently its defaults are sometimes pretty bad. You can also try MKL.jl to use MKL instead.

3 Likes

Not much faster for this example, but you can avoid some allocations by using

@time for t = 1:10000
    mul!(temp[t],temp1,temp2)
 end

but you have to allocate them properly before…

4 Likes

Especially for matrix-vector multiplication, which is totally limited by memory bandwidth. When using a single thread, temp1 * temp2 and sum(temp1) will be about the same fast:

julia> temp1 = randn(1600, 1600);

julia> temp2 = randn(1600);

julia> temp3 = similar(temp2);

julia> using LinearAlgebra; BLAS.set_num_threads(1);

julia> @benchmark mul!($temp3, $temp1, $temp2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.074 ms (0.00% GC)
  median time:      1.099 ms (0.00% GC)
  mean time:        1.115 ms (0.00% GC)
  maximum time:     1.499 ms (0.00% GC)
  --------------
  samples:          4479
  evals/sample:     1

julia> @benchmark sum($temp1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.120 ms (0.00% GC)
  median time:      1.149 ms (0.00% GC)
  mean time:        1.169 ms (0.00% GC)
  maximum time:     1.613 ms (0.00% GC)
  --------------
  samples:          4274
  evals/sample:     1

(Re: the GPU suggestion, GPUs do have far higher memory bandwidth than any normal CPU, so they are good for this sort of memory bandwidth limited thing [and not just compute heavy things like matrix-matrix multiplication])

5 Likes

You are benchmarking in global scope, which is not a good idea.

Also, if you have some memory you can reuse, you can write

mul!(out, temp1, temp2)

instead.

And if your matrices have some sort of special structure, that may be exploited.

1 Like

Are your consecutive iterations dependent of each other? If not, you could maybe reformulate your problem to use matrix-matrix multiplication, which is not memory bound and way more efficient.

using BenchmarkTools
temp1 = randn(1600, 1600)
temp2 = randn(1600)

@benchmark $temp1 * $temp2
  814.858 μs (1 allocation: 12.63 KiB)

would take more than 8.1 s per 10000 iterations, while the same calculation reformulated to a matrix-matrix problem

temp3 = repeat($temp2, 1, 10000)
@btime $temp1 * $temp3
  952.054 ms (2 allocations: 122.07 MiB)

is about 8 times faster on my machine.

6 Likes

If your application allows for it you could also switch to single precision Float32, which roughly reduces the time by a factor of two if you compare with the benchmark above

using BenchmarkTools

temp1 = randn(Float32,1600, 1600)
temp2 = randn(Float32,1600)

@btime $temp1 * $temp2
  518.491 μs (1 allocation: 6.38 KiB)
1 Like

I really appreciate your suggestion. I am able to reduce the time from 10 seconds to 2 seconds. Thank you very much.

The hint from @DNF is valuable and you should think about it. Maybe you can even improve the whole approach or do something smart when your matrices are somehow correlated or have a specific structure or property (something like symmetric, logarithmic value range, etc.).