How to improve the speed of matrix multiplication?

zxjroger · January 13, 2021, 5:20pm

I have a code about large matrix multiplication in a for loop which currently takes longer time than I expected. I wonder if it is possible to improve the following MWE code:

temp1 = randn(1600, 1600)
temp2 = randn(1600)
temp = Vector{Vector{Float64}}(undef, 10000)
@time for t = 1:10000
        temp[t] = temp1 * temp2
end

6.276464 seconds (19.49 k allocations: 123.436 MiB, 0.34% gc time)

ChrisRackauckas · January 13, 2021, 5:20pm

Use a GPU

jling · January 13, 2021, 5:27pm

you expected wrong. Directly multiplying float64 matrices simply calls OpenBLAS under the hood, there’s nothing you can do (in fact, probably nothing can be done period)

ericphanson · January 13, 2021, 5:41pm

Not sure that’s totally fair; for example, you can try changing the number of threads OpenBLAS uses, because apparently its defaults are sometimes pretty bad. You can also try MKL.jl to use MKL instead.

fgerick · January 13, 2021, 5:47pm

Not much faster for this example, but you can avoid some allocations by using

@time for t = 1:10000
    mul!(temp[t],temp1,temp2)
 end

but you have to allocate them properly before…

Elrod · January 13, 2021, 6:32pm

Especially for matrix-vector multiplication, which is totally limited by memory bandwidth. When using a single thread, temp1 * temp2 and sum(temp1) will be about the same fast:

julia> temp1 = randn(1600, 1600);

julia> temp2 = randn(1600);

julia> temp3 = similar(temp2);

julia> using LinearAlgebra; BLAS.set_num_threads(1);

julia> @benchmark mul!($temp3, $temp1, $temp2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.074 ms (0.00% GC)
  median time:      1.099 ms (0.00% GC)
  mean time:        1.115 ms (0.00% GC)
  maximum time:     1.499 ms (0.00% GC)
  --------------
  samples:          4479
  evals/sample:     1

julia> @benchmark sum($temp1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.120 ms (0.00% GC)
  median time:      1.149 ms (0.00% GC)
  mean time:        1.169 ms (0.00% GC)
  maximum time:     1.613 ms (0.00% GC)
  --------------
  samples:          4274
  evals/sample:     1

(Re: the GPU suggestion, GPUs do have far higher memory bandwidth than any normal CPU, so they are good for this sort of memory bandwidth limited thing [and not just compute heavy things like matrix-matrix multiplication])

DNF · January 13, 2021, 6:51pm

You are benchmarking in global scope, which is not a good idea.

Also, if you have some memory you can reuse, you can write

mul!(out, temp1, temp2)

instead.

And if your matrices have some sort of special structure, that may be exploited.

moeddel · January 13, 2021, 9:44pm

Are your consecutive iterations dependent of each other? If not, you could maybe reformulate your problem to use matrix-matrix multiplication, which is not memory bound and way more efficient.

using BenchmarkTools
temp1 = randn(1600, 1600)
temp2 = randn(1600)

@benchmark $temp1 * $temp2
  814.858 μs (1 allocation: 12.63 KiB)

would take more than 8.1 s per 10000 iterations, while the same calculation reformulated to a matrix-matrix problem

temp3 = repeat($temp2, 1, 10000)
@btime $temp1 * $temp3
  952.054 ms (2 allocations: 122.07 MiB)

is about 8 times faster on my machine.

moeddel · January 14, 2021, 8:26am

If your application allows for it you could also switch to single precision Float32, which roughly reduces the time by a factor of two if you compare with the benchmark above

using BenchmarkTools

temp1 = randn(Float32,1600, 1600)
temp2 = randn(Float32,1600)

@btime $temp1 * $temp2
  518.491 μs (1 allocation: 6.38 KiB)

zxjroger · January 14, 2021, 4:12pm

I really appreciate your suggestion. I am able to reduce the time from 10 seconds to 2 seconds. Thank you very much.

tamasgal · January 14, 2021, 4:34pm

The hint from @DNF is valuable and you should think about it. Maybe you can even improve the whole approach or do something smart when your matrices are somehow correlated or have a specific structure or property (something like symmetric, logarithmic value range, etc.).

Topic		Replies	Views
How can I improve the speed of large matrix multiplication Performance question	6	1615	March 10, 2020
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2480	July 13, 2020
How to make faster in a matrix multiplication inside a loop? Performance	2	456	July 27, 2020
Optimizing Linear Algebra Code? Performance linearalgebra	12	1410	April 9, 2021
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	419	November 7, 2022

How to improve the speed of matrix multiplication?

Related topics