Vector Multiplication on Submatrices

If I have a matrix which is a (horizontal) collection of matrices,

A := \begin{bmatrix} A_0 & A_1 & A_2 & \dots & A_t \end{bmatrix} , \quad A_i \in \mathbb{R}^{n \times n}, A \in \mathbb{R}^{n \times tn}

and I want to do vector multiplication on each submatrix, i.e., compute

Av := \begin{bmatrix} A_0v & A_1v & \dots & A_tv \end{bmatrix} , \quad v \in \mathbb{R}^{n}, Av \in \mathbb{R}^{n \times t}

then what is the fastest (and cleanest if equally fast) way to do this?

I have three naive ways that work below and a plot of their results from BenchmarkTools, but the performance varies a ton between small and large dimension n. This leaves me feeling like there is a better solution.

using LinearAlgebra

t = 500
n = 3 # results below are with n = 3, 10, & 300

A, v = rand(n,n*t), rand(n)

foo1(A, v) = (A .* v)[:, 1:n:end]

foo2(A,v) = view(A .* v, :, 1:n:n*t)

function foo3(A,v)
		Av = zeros(n, t)
		for i in axes(Av,2)
		    Av[:,i] = view(A, :, (i-1)*n + 1 : i*n) * v;
		end
		return Av
end

Some options:

  1. Store it as a vertical concatenation of matrices, not horizontal. Then you can just do A*v to get a vertical concatenation of the results.

  2. If you are really talking about n=3, i.e. a collection of 3x3 matrices and 3-component vectors, than you might want to consider using StaticArrays.jl. Make A a 1d array of 3x3 SMatrix elements, and make v a 3-component SVector. Then do A .* Ref(v) to multiply each 3x3 matrix times v.

  3. You might want to take a more “global” view of your approach. I’m always a bit suspicious when someone is optimizing such a simple operation, and worry that they are taking a classic Matlab/Numpy vectorized approach, which is often suboptimal. It’s often better to combine many operations into a single loop rather than trying to optimize a bunch of simple “vector” operations separately. Loops can be fast in Julia!

6 Likes

Thank you @stevengj. I was so buried in my method I didn’t even think of 1.

I am certainly guilty of 3. often but not this time!

First, your functions foo1 and foo2 don’t seem to be doing what you want, but instead use pointwise multiplication and then subselect some elements:

using LinearAlgebra
n = 3; t = 2;
A = hcat(I(3), 2 .* I(3))
v = [1,2,3]
julia> foo1(A, v)
3×2 Matrix{Int64}:
 1  2
 0  0
 0  0

julia> foo2(A, v)
3×2 view(::Matrix{Int64}, :, 1:3:4) with eltype Int64:
 1  2
 0  0
 0  0

julia> foo3(A, v)
3×2 Matrix{Float64}:
 1.0  2.0
 2.0  4.0
 3.0  6.0

As alternatives for the last version, @stevengj already suggested A * v with A \in \mathbb{R}^{tn \times n}. As Julia is column-major it might be faster to use the first dimensions for the multiplication, i.e., store the transposed matrices A_i horizontally and then use v' * A.
Further, if A is stored as a 3-tensor, i.e., A \in \mathbb{R}^{n \times n \times t}, several libraries allowing for a split-apply-combine strategy can be used, e.g., my JJ.jl:

A = rand(n, n, t)
v = rand(n)
rank"2 * 1"(A, v)

I made some tests with the 2. option of @stevengj, which seems very promising for small values of n.

I assume that only foo3 is doing what you want, so we compare these two functions:

function foo3(A, v, n, t)
    Av = zeros(n, t)
    for i in axes(Av, 2)
        Av[:, i] = view(A, :, (i - 1) * n + 1 : i * n) * v;
    end
    return Av
end

foo4(A, v) = A .* Ref(v)

Note that foo3 here uses n and t as inputs, not as globals.
The function foo4 needs a different matrix A2 and vector v2:

function get_A2(A, n, t)
    A2 = Vector{SMatrix{n, n, Float64, n^2}}(undef, t)
    for i in eachindex(A2)
        A2[i] = SMatrix{n,n}(A[:, (i - 1) * n + 1 : i * n])
    end
    return A2
end
A1, v1 = rand(n, n * t), rand(n)
A2, v2 = get_A2(A1, n, t), SVector{n}(v1)
julia> typeof(A2)
Vector{SMatrix{3, 3, Float64, 9}} (alias for Array{SArray{Tuple{3, 3}, Float64, 2, 9}, 1})

julia> typeof(v2)
SVector{3, Float64} (alias for SArray{Tuple{3}, Float64, 1, 3})

Note that foo4 returns a Vector{SVector{n, Float64}} and not a Matrix{Float64} like foo3!
A conversion can be done with a function like this:

function convert_to_matrix(Av, n, t)
    Av_matrix = zeros(n, t)
    for i in axes(Av_matrix, 2)
        Av_matrix[:, i] = Av[i]
    end
    return Av_matrix
end

Benchmark results:

function benchmark(n::Int, t::Int=500)
    A1, v1 = rand(n, n * t), rand(n)
    A2, v2 = get_A2(A1, n, t), SVector{n}(v1)
    @btime foo3($A1, $v1, $n, $t)
    @btime foo4($A2, $v2)
    @test foo3(A1, v1, n, t) ≈ convert_to_matrix(foo4(A2, v2), n, t)
end
julia> benchmark(3)
  40.797 μs (501 allocations: 50.94 KiB)
  1.486 μs (1 allocation: 11.88 KiB)
Test Passed

julia> benchmark(10)
  58.837 μs (502 allocations: 109.42 KiB)
  8.749 μs (2 allocations: 39.11 KiB)
Test Passed

julia> benchmark(40)
  176.178 μs (502 allocations: 351.61 KiB)
  140.093 μs (2 allocations: 156.30 KiB)
Test Passed

julia> benchmark(50)
  284.541 μs (502 allocations: 437.55 KiB)
  335.238 μs (2 allocations: 195.36 KiB)
Test Passed
Complete MWE
using LinearAlgebra, StaticArrays, BenchmarkTools, Test

function foo3(A, v)
    Av = zeros(n, t)
    for i in axes(Av, 2)
        Av[:, i] = view(A, :, (i - 1) * n + 1 : i * n) * v;
    end
    return Av
end

foo4(A, v) = A .* Ref(v)

function get_A2(A, n, t)
    A2 = Vector{SMatrix{n, n, Float64, n^2}}(undef, t)
    for i in eachindex(A2)
        A2[i] = SMatrix{n,n}(A[:, (i - 1) * n + 1 : i * n])
    end
    return A2
end

function convert_to_matrix(Av, n, t)
    Av_ = zeros(n, t)
    for i in axes(Av_, 2)
        Av_[:, i] = Av[i]
    end
    return Av_
end

function benchmark(n::Int, t::Int=500)
    A1, v1 = rand(n, n * t), rand(n)
    A2, v2 = get_A2(A1, n, t), SVector{n}(v1)
    @btime foo3($A1, $v1, $n, $t)
    @btime foo4($A2, $v2)
    @test foo3(A1, v1, n, t) ≈ convert_to_matrix(foo4(A2, v2), n, t)
end

benchmark(3)
benchmark(10)
benchmark(40)
benchmark(50)

This will not get the full benefit of StaticArrays. You need n to be a compile-time constant, e.g. passing it via Val, or by the length of an SVector v

1 Like

Thank you @stevengj for your reply!

What would be the correct solution with Val? Unfortunately, I am not able to get a further performance gain.

function get_A2(A::Matrix{T}, n, t, ::Val{N}) where {N, T}
    A2 = Vector{SMatrix{N, N, Float64, N^2}}(undef, t)
    for i in eachindex(A2)
        A2[i] = SMatrix{N, N}(A[:, (i - 1) * n + 1 : i * n])
    end
    return A2
end

function get_v2(x::Vector{T}, ::Val{N}) where {T,N}
    return SVector{N, T}(x)
end

function benchmark(n::Int, t::Int=500)
    A1, v1 = rand(n, n * t), rand(n)
    A2, v2 = get_A2(A1, n, t, Val(n)), get_v2(v1, Val(n))
    @btime foo3($A1, $v1, $n, $t)
    @btime foo4($A2, $v2)
    @test foo3(A1, v1, n, t) ≈ convert_to_matrix(foo4(A2, v2), n, t)
end
julia> benchmark(3)
  40.616 μs (501 allocations: 50.94 KiB)
  1.485 μs (1 allocation: 11.88 KiB)
Test Passed
1 Like

foo3 is actually pretty slow compared to foo1/2 for small dim. However, vertical storing + broadcasting + reshape() surpasses all the others I tried for small and large dimension (n ∈ [1, 300]), although I did not try StaticArrays.

If we’re applying some minimal effort to be efficient here, one should avoid the unnecessary temporaries with this alternative to foo3:

using LinearAlgebra

function foo3a(A, v, n, t)
    Av = similar(A, (n, t))
    for i in axes(Av, 2)
        @views mul!(Av[:, i], A[:, (i - 1) * n + 1 : i * n], v);
    end
    return Av
end

Although at small sizes, there’s still no beating StaticArrays.

Oh, since you’re interpolating into your @btime code it should be fine. In an actual application, however, you’d want to make sure that the size is set statically as early as possible.

1 Like