Mul! vs * benchmarks

I have these 4 examles I’m benchmarking:

using SparseArrays
using LinearAlgebra
using BenchmarkTools
using Random

function myfun1(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp = A' * kron(sparse(I, m, m), B) * A
end

function myfun2(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp1 = kron(sparse(I, m, m), B) * A
    temp = A' * temp1
end

function myfun3(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp = Array{Float64}(undef, n, n)
    mul!(temp, kron(sparse(I, m, m), B), A)
    B = A' * temp
end

function myfun4(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp = Array{Float64}(undef, n, n)
    mul!(temp, kron(sparse(I, m, m), B), A)
    B = Array{Float64}(undef, n, n)
    mul!(B, A', temp)
end

function myfun5(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp1 = Array{Float64}(undef, n, n)
    temp = Array{Float64}(undef, n, n)

    temp1 .= kron(sparse(I, m, m), B) * A
    temp .= A' * temp1
end


m::Int32 = 100
n::Int32 = m*m
@btime myfun1(m, n)
@btime myfun2(m, n)
@btime myfun3(m, n)
@btime myfun4(m, n)
@btime myfun5(m, n)

Results are:

36.307 s (28 allocations: 2.25 GiB)
24.707 s (28 allocations: 2.25 GiB)
25.310 s (28 allocations: 2.25 GiB)
25.378 s (28 allocations: 2.25 GiB)
26.327 s (32 allocations: 3.74 GiB)

Perhaps I could have created A and B out of the functions, but the seed is set, so there is only a fixed time issue.

What I’m wondering is why is the first example so far of the others?

Regarding the little differences of the others the conclusion I reach is that:

If you have to overwrite an existing matrix use mul! rather than simple multiplication.
If you don’t have to overwrite it, use simple multiplication without preallocation.

Am I right?

Yes that is right. In fact you can just look at the definition of * with

julia> A = zeros(10,10);
julia> @less A*A

and you will see that the definition is essentially (up to some type related stuff)

function (*)(A::Matrix, B::Matrix)
    mul!(similar(size(A,1), size(B,2)), A, B)
end

So * just allocates the output for you and then uses mul! to fill it.

2 Likes

You should see from myfun5’s benchmark that, despite the excess allocations you introduced there, it runs at virtually the same speed as the others. The cost of allocation just isn’t very big (compared to multiplication) for large matrices.

Since others have already commented on the merits of mul! vs *, I will comment on the benchmark itself instead, since this was presumably motivated by a real example.

While kron is a useful mathematical construct, it’s sometimes bad in code due to the large but structured matrices that result. Consider restructuring your calculation to account for this structure. Note that your calculation is
A^\top (I \otimes B) A = \begin{bmatrix} A_1 \\ A_2 \\ \vdots \end{bmatrix}^\top \begin{bmatrix} B & & \\ & B & \\ & & \ddots \end{bmatrix} \begin{bmatrix} A_1 \\ A_2 \\ \vdots \end{bmatrix} = \sum_i A_i^\top B A_i
where A_i are blocks of A of compatible size.

Also note that B is not actually sparse in your examples, so I’m going to stop treating it as such here. If it is truly sparse, then feel free to restore that. Here is my submission:

function myfun6(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = rand(Float64, m, m)

    result = zeros(Float64, n, n)
    temp = similar(result, m, n)
    for i in 1:m
        Ai = @view A[(i-1)*m .+ (1:m), :]
        mul!(temp, B, Ai) # temp = B * Ai
        mul!(result, Ai', temp, true, true) # result = Ai' * temp + result
    end
    return result
end

On my machine, this requires only about 60% of time (and less memory) than any of the others.

Alternatively, consider using Kronecker.jl to form a lazy kronecker product. Although it may not yield significant improvements in this example.

3 Likes