Mul! vs * benchmarks

Enlil50 · February 20, 2024, 10:09am

I have these 4 examles I’m benchmarking:

using SparseArrays
using LinearAlgebra
using BenchmarkTools
using Random

function myfun1(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp = A' * kron(sparse(I, m, m), B) * A
end

function myfun2(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp1 = kron(sparse(I, m, m), B) * A
    temp = A' * temp1
end

function myfun3(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp = Array{Float64}(undef, n, n)
    mul!(temp, kron(sparse(I, m, m), B), A)
    B = A' * temp
end

function myfun4(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp = Array{Float64}(undef, n, n)
    mul!(temp, kron(sparse(I, m, m), B), A)
    B = Array{Float64}(undef, n, n)
    mul!(B, A', temp)
end

function myfun5(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = sparse(rand(Float64, m, m))

    temp1 = Array{Float64}(undef, n, n)
    temp = Array{Float64}(undef, n, n)

    temp1 .= kron(sparse(I, m, m), B) * A
    temp .= A' * temp1
end


m::Int32 = 100
n::Int32 = m*m
@btime myfun1(m, n)
@btime myfun2(m, n)
@btime myfun3(m, n)
@btime myfun4(m, n)
@btime myfun5(m, n)

Results are:

36.307 s (28 allocations: 2.25 GiB)
24.707 s (28 allocations: 2.25 GiB)
25.310 s (28 allocations: 2.25 GiB)
25.378 s (28 allocations: 2.25 GiB)
26.327 s (32 allocations: 3.74 GiB)

Perhaps I could have created A and B out of the functions, but the seed is set, so there is only a fixed time issue.

What I’m wondering is why is the first example so far of the others?

Regarding the little differences of the others the conclusion I reach is that:

If you have to overwrite an existing matrix use mul! rather than simple multiplication.
If you don’t have to overwrite it, use simple multiplication without preallocation.

Am I right?

abraemer · February 20, 2024, 10:35am

Yes that is right. In fact you can just look at the definition of * with

julia> A = zeros(10,10);
julia> @less A*A

and you will see that the definition is essentially (up to some type related stuff)

function (*)(A::Matrix, B::Matrix)
    mul!(similar(size(A,1), size(B,2)), A, B)
end

So * just allocates the output for you and then uses mul! to fill it.

mikmoore · February 20, 2024, 4:41pm

You should see from myfun5’s benchmark that, despite the excess allocations you introduced there, it runs at virtually the same speed as the others. The cost of allocation just isn’t very big (compared to multiplication) for large matrices.

Since others have already commented on the merits of mul! vs *, I will comment on the benchmark itself instead, since this was presumably motivated by a real example.

While kron is a useful mathematical construct, it’s sometimes bad in code due to the large but structured matrices that result. Consider restructuring your calculation to account for this structure. Note that your calculation is
A^\top (I \otimes B) A = \begin{bmatrix} A_1 \\ A_2 \\ \vdots \end{bmatrix}^\top \begin{bmatrix} B & & \\ & B & \\ & & \ddots \end{bmatrix} \begin{bmatrix} A_1 \\ A_2 \\ \vdots \end{bmatrix} = \sum_i A_i^\top B A_i
where A_i are blocks of A of compatible size.

Also note that B is not actually sparse in your examples, so I’m going to stop treating it as such here. If it is truly sparse, then feel free to restore that. Here is my submission:

function myfun6(m::Integer, n::Integer)
    Random.seed!(1)
    A = rand(Float64, n, n)
    B = rand(Float64, m, m)

    result = zeros(Float64, n, n)
    temp = similar(result, m, n)
    for i in 1:m
        Ai = @view A[(i-1)*m .+ (1:m), :]
        mul!(temp, B, Ai) # temp = B * Ai
        mul!(result, Ai', temp, true, true) # result = Ai' * temp + result
    end
    return result
end

On my machine, this requires only about 60% of time (and less memory) than any of the others.

Alternatively, consider using Kronecker.jl to form a lazy kronecker product. Although it may not yield significant improvements in this example.

Topic		Replies	Views
Performance discrepancy in sparse matrix product Performance	4	985	June 29, 2018
Fastest matrix multiplication General Usage mkl , linearalgebra , matrix	20	1597	February 19, 2024
Mul! vs SparseMatrix Performance	2	1086	December 12, 2020
Why does `mul!(u, A, v)` allocate when `A` is sparse and `u, v` are views? Performance linearalgebra , sparse	9	602	November 10, 2023
Help with this code (deallocation/pointers on sparse matrices) Performance array , sparse , matrices , sparsearrays	31	696	February 3, 2024

Mul! vs * benchmarks

Related topics