Outer product of each row of a matrix

I’m looking for an efficient way to take the outer product of each row of a matrix, e.g.:

julia> A = [1 2;4 8]
2×2 Matrix{Int64}:
 1  2
 4  8

julia> cat([r * r' for r in eachrow(A)]..., dims=3)
2×2×2 Array{Int64, 3}:
[:, :, 1] =
 1  2
 2  4

[:, :, 2] =
 16  32
 32  64

julia> 

The number of rows will become large so I’m more interested in efficiency than conciseness. The code above is not efficient and I’m ideally looking for something that will work with CUDA arrays and so leverages mostly broadcasting.

1 Like
stack(r * r' for r in eachrow(A))
2 Likes

Try this one though I don’t know what CUDA compatibility implies

function outerprod(A)
    r,c=size(A)
    rs=Vector{eltype(A)}(undef,c^2*r)
    d=1
    for r in eachrow(A)
        for e in r
            copyto!(rs,d,e.*r,1,c)
            d+=c
        end
    end
    reshape(rs,c,c,:)
end

or this

function outerprod1(A)
    rs=Array{eltype(A)}(undef,size(A,2),size(A,2),size(A,1))    
    for i in axes(A,2), j in axes(A,2), k in axes(A,1) 
        rs[i,j,k] = A[k,i]*A[k,j]
    end
    rs
end

You can do this as one broadcast operation:

julia> A = rand(2,2);

julia> B = stack(r * r' for r in eachrow(A));

julia> B ≈ cat([r * r' for r in eachrow(A)]..., dims=3)
true

julia> B ≈ let A3 = insertdims(A, dims=3)  # julia 1.11, or `using Compat`
          PermutedDimsArray(A3, (2,3,1)) .* PermutedDimsArray(A3, (3,2,1))
       end
true

julia> using TensorCast, TransmuteDims  # nicer syntax for such operations

julia> B ≈ @cast B2[i,j,k] := A[k,i] * A[k,j]
true

julia> @pretty @cast B2[i,j,k] := A[k,i] * A[k,j]  # @pretty is like @macroexpand
begin
    @boundscheck ndims(A) == 2 || throw(ArgumentError("expected a 2-tensor A[k, i]"))
    local rat = transmute(A, Val((2, nothing, 1)))
    local cormorant = transmute(A, Val((nothing, 2, 1)))
    B2 = @__dot__(rat * cormorant)
end

julia> B ≈ transmute(A, (2,3,1)) .* transmute(A, (3,2,1))
true

This should be faster on GPU, as loops like for r in eachrow(A) involve many separate interactions:

julia> using CUDA, BenchmarkTools

julia> A = rand(200,30) |> cu;

julia> B = @btime CUDA.@sync stack(r * r' for r in eachrow($A));
  11.002 ms (21005 allocations: 456.47 KiB)

julia> B ≈ @btime CUDA.@sync cat([r * r' for r in eachrow($A)]..., dims=3)
  13.150 ms (27820 allocations: 1018.27 KiB)
true

julia> B ≈ @btime CUDA.@sync let A3 = insertdims($A, dims=3)
          PermutedDimsArray(A3, (2,3,1)) .* PermutedDimsArray(A3, (3,2,1))
       end
  60.674 μs (95 allocations: 2.72 KiB)
true

julia> B ≈ @btime CUDA.@sync @cast B2[i,j,k] := $A[k,i] * $A[k,j]
  57.205 μs (85 allocations: 1.95 KiB)
true

julia> 11_002 / 60.7
181.2520593080725
1 Like

A curiosity:
what makes

stack(r * r' for r in eachrow(A))

perform better than

function outerprod1(A)
    rs=Array{eltype(A)}(undef,size(A,2),size(A,2),size(A,1))    
    for i in axes(A,2), j in axes(A,2), k in axes(A,1) 
        rs[i,j,k] = A[k,i]*A[k,j]
    end
    rs
end

?

Note that switching the order of i, j, k around makes outerprod1 faster than the stack approach, though the relative performance depends on the size of A.

using BenchmarkTools

function outerprod1_ijk(A)
    rs = Array{eltype(A)}(undef, size(A,2), size(A,2), size(A,1))    
    for i in axes(A,2), j in axes(A,2), k in axes(A,1)   # k changes fastest (cf. reads in A)
        rs[i,j,k] = A[k,i] * A[k,j]
    end
    return rs
end

function outerprod1_kji(A)
    rs = Array{eltype(A)}(undef, size(A,2), size(A,2), size(A,1))    
    for k in axes(A,1), j in axes(A,2), i in axes(A,2)  # i changes fastest (cf. writes in rs)
        rs[i,j,k] = A[k,i] * A[k,j]
    end
    return rs
end

outerprod_stack(A) = stack(r * r' for r in eachrow(A))

A = rand(100, 100)
@btime outerprod1_ijk($A);
#    4.658 ms (2 allocations: 7.63 MiB)
@btime outerprod1_kji($A);
#    1.541 ms (2 allocations: 7.63 MiB)
@btime outerprod_stack($A);
#    1.642 ms (202 allocations: 15.26 MiB)

A = rand(10, 1000)
@btime outerprod1_ijk($A);
#    59.464 ms (2 allocations: 76.29 MiB)
@btime outerprod1_kji($A);
#    16.400 ms (2 allocations: 76.29 MiB)
@btime outerprod_stack($A);
#    30.032 ms (22 allocations: 152.59 MiB)

A = rand(1000, 10)
@btime outerprod1_ijk($A);
#    167.800 μs (2 allocations: 781.31 KiB)
@btime outerprod1_kji($A);
#    102.900 μs (2 allocations: 781.31 KiB)
@btime outerprod_stack($A);
#    191.900 μs (1002 allocations: 1.62 MiB)

So I would assume the speed difference has to do with memory ordering.

1 Like