How can I improve the speed of large matrix multiplication

I have some functions involving large matrix multiplication in my code. A sample code is as follows.

function foo1()
    temp = Array[]
    for i = 1:1000
        push!(temp, randn(400,400)*randn(400,1000))
    end
end

function foo2()
    for i = 1:1000
        randn(400,400)*randn(400,1000)
    end
end

@time foo1()
@time foo2()

I find that foo1() takes 17 seconds. Without push!, foo2() takes 11 seconds. Is there any way to improve the speed of foo1()? Thank you.

1 Like

The eltype of temp is Array, which isn’t a concrete type: the compiler has to assume it can hold any subtype of Array. Use temp = Matrix{Float64}[] instead.

julia> isconcretetype(Array)
false

julia> isconcretetype(Matrix{Float64})
true

That shouldn’t have any measurable effect on the speed of this code. You are more likely to be simply observing the cost of actually allocating and populating larger and larger memory. foo2 is just creating garbage in a loop so the GC, the libc and the OS can work together to reuse the memory.

Hmm, I timed it before posting my original response and saw a significant effect, but that looks to have been a fluke.

@zxjroger, your benchmark is largely just timing randn:

julia> @btime randn(400,400) * randn(400,1000);
  9.321 ms (6 allocations: 7.32 MiB)

julia> @btime a * b setup=(a=randn(400,400); b=randn(400,1000));
  4.182 ms (2 allocations: 3.05 MiB)

You can also preallocate the two random matrices outside the for loop and use randn!.

You can also use mul! instead of *.
Or use BLAS functions.

2 Likes

To expand on what @Juan said,

using LinearAlgebra, Random
using BenchmarkTools

function foo1()
    temp = Array[]
    for i = 1:1000
        push!(temp, randn(400,400)*randn(400,1000))
    end
end

function foo_prealloc(tmp::AbstractVector, A::AbstractMatrix, B::AbstractMatrix)
    @assert size(A, 2) == size(B, 1) # multiplicable
    @assert size(first(tmp)) == (size(A, 1), size(B, 2)) # fits into tmp
    for t in tmp
        randn!(A), randn!(B)
        mul!(t, A, B)
    end
    return tmp
end


A = zeros(400,400)
B = zeros(400,1000)
tmp = [zeros(400,1000) for _ in 1:1000]

@btime foo1(); # 22.646 s (6010 allocations: 7.15 GiB)
@btime foo_prealloc($tmp, $A, $B); # 1.368 s (0 allocations: 0 bytes)
1 Like

Thank you for all the suggestions. I will look into it.