Outer product of each row of a matrix

atteson · October 25, 2024, 9:11pm

I’m looking for an efficient way to take the outer product of each row of a matrix, e.g.:

julia> A = [1 2;4 8]
2×2 Matrix{Int64}:
 1  2
 4  8

julia> cat([r * r' for r in eachrow(A)]..., dims=3)
2×2×2 Array{Int64, 3}:
[:, :, 1] =
 1  2
 2  4

[:, :, 2] =
 16  32
 32  64

julia>

The number of rows will become large so I’m more interested in efficiency than conciseness. The code above is not efficient and I’m ideally looking for something that will work with CUDA arrays and so leverages mostly broadcasting.

bertschi · October 25, 2024, 10:10pm

stack(r * r' for r in eachrow(A))

rocco_sprmnt21 · October 25, 2024, 10:15pm

Try this one though I don’t know what CUDA compatibility implies

function outerprod(A)
    r,c=size(A)
    rs=Vector{eltype(A)}(undef,c^2*r)
    d=1
    for r in eachrow(A)
        for e in r
            copyto!(rs,d,e.*r,1,c)
            d+=c
        end
    end
    reshape(rs,c,c,:)
end

or this

function outerprod1(A)
    rs=Array{eltype(A)}(undef,size(A,2),size(A,2),size(A,1))    
    for i in axes(A,2), j in axes(A,2), k in axes(A,1) 
        rs[i,j,k] = A[k,i]*A[k,j]
    end
    rs
end

mcabbott · October 25, 2024, 11:39pm

You can do this as one broadcast operation:

julia> A = rand(2,2);

julia> B = stack(r * r' for r in eachrow(A));

julia> B ≈ cat([r * r' for r in eachrow(A)]..., dims=3)
true

julia> B ≈ let A3 = insertdims(A, dims=3)  # julia 1.11, or `using Compat`
          PermutedDimsArray(A3, (2,3,1)) .* PermutedDimsArray(A3, (3,2,1))
       end
true

julia> using TensorCast, TransmuteDims  # nicer syntax for such operations

julia> B ≈ @cast B2[i,j,k] := A[k,i] * A[k,j]
true

julia> @pretty @cast B2[i,j,k] := A[k,i] * A[k,j]  # @pretty is like @macroexpand
begin
    @boundscheck ndims(A) == 2 || throw(ArgumentError("expected a 2-tensor A[k, i]"))
    local rat = transmute(A, Val((2, nothing, 1)))
    local cormorant = transmute(A, Val((nothing, 2, 1)))
    B2 = @__dot__(rat * cormorant)
end

julia> B ≈ transmute(A, (2,3,1)) .* transmute(A, (3,2,1))
true

This should be faster on GPU, as loops like for r in eachrow(A) involve many separate interactions:

julia> using CUDA, BenchmarkTools

julia> A = rand(200,30) |> cu;

julia> B = @btime CUDA.@sync stack(r * r' for r in eachrow($A));
  11.002 ms (21005 allocations: 456.47 KiB)

julia> B ≈ @btime CUDA.@sync cat([r * r' for r in eachrow($A)]..., dims=3)
  13.150 ms (27820 allocations: 1018.27 KiB)
true

julia> B ≈ @btime CUDA.@sync let A3 = insertdims($A, dims=3)
          PermutedDimsArray(A3, (2,3,1)) .* PermutedDimsArray(A3, (3,2,1))
       end
  60.674 μs (95 allocations: 2.72 KiB)
true

julia> B ≈ @btime CUDA.@sync @cast B2[i,j,k] := $A[k,i] * $A[k,j]
  57.205 μs (85 allocations: 1.95 KiB)
true

julia> 11_002 / 60.7
181.2520593080725

rocco_sprmnt21 · October 26, 2024, 12:19pm

A curiosity:
what makes

stack(r * r' for r in eachrow(A))

perform better than

function outerprod1(A)
    rs=Array{eltype(A)}(undef,size(A,2),size(A,2),size(A,1))    
    for i in axes(A,2), j in axes(A,2), k in axes(A,1) 
        rs[i,j,k] = A[k,i]*A[k,j]
    end
    rs
end

?

eldee · October 26, 2024, 2:44pm

Note that switching the order of i, j, k around makes outerprod1 faster than the stack approach, though the relative performance depends on the size of A.

using BenchmarkTools

function outerprod1_ijk(A)
    rs = Array{eltype(A)}(undef, size(A,2), size(A,2), size(A,1))    
    for i in axes(A,2), j in axes(A,2), k in axes(A,1)   # k changes fastest (cf. reads in A)
        rs[i,j,k] = A[k,i] * A[k,j]
    end
    return rs
end

function outerprod1_kji(A)
    rs = Array{eltype(A)}(undef, size(A,2), size(A,2), size(A,1))    
    for k in axes(A,1), j in axes(A,2), i in axes(A,2)  # i changes fastest (cf. writes in rs)
        rs[i,j,k] = A[k,i] * A[k,j]
    end
    return rs
end

outerprod_stack(A) = stack(r * r' for r in eachrow(A))

A = rand(100, 100)
@btime outerprod1_ijk($A);
#    4.658 ms (2 allocations: 7.63 MiB)
@btime outerprod1_kji($A);
#    1.541 ms (2 allocations: 7.63 MiB)
@btime outerprod_stack($A);
#    1.642 ms (202 allocations: 15.26 MiB)

A = rand(10, 1000)
@btime outerprod1_ijk($A);
#    59.464 ms (2 allocations: 76.29 MiB)
@btime outerprod1_kji($A);
#    16.400 ms (2 allocations: 76.29 MiB)
@btime outerprod_stack($A);
#    30.032 ms (22 allocations: 152.59 MiB)

A = rand(1000, 10)
@btime outerprod1_ijk($A);
#    167.800 μs (2 allocations: 781.31 KiB)
@btime outerprod1_kji($A);
#    102.900 μs (2 allocations: 781.31 KiB)
@btime outerprod_stack($A);
#    191.900 μs (1002 allocations: 1.62 MiB)

So I would assume the speed difference has to do with memory ordering.

Topic		Replies	Views
Outer broadcasting? General Usage	11	1974	April 5, 2019
Diagonal elements of matrix product General Usage question , performance , linearalgebra	12	2584	November 2, 2022
Batched matrix multiplication in CUDA General Usage performance , cuda	8	819	November 30, 2023
Multiplication of sub matrices General Usage	7	661	September 7, 2018
Generic way to apply function to a dimension of a matrix New to Julia array	6	4344	February 1, 2021

Outer product of each row of a matrix

Related topics