You can do this as one broadcast operation:
julia> A = rand(2,2);
julia> B = stack(r * r' for r in eachrow(A));
julia> B ≈ cat([r * r' for r in eachrow(A)]..., dims=3)
true
julia> B ≈ let A3 = insertdims(A, dims=3) # julia 1.11, or `using Compat`
PermutedDimsArray(A3, (2,3,1)) .* PermutedDimsArray(A3, (3,2,1))
end
true
julia> using TensorCast, TransmuteDims # nicer syntax for such operations
julia> B ≈ @cast B2[i,j,k] := A[k,i] * A[k,j]
true
julia> @pretty @cast B2[i,j,k] := A[k,i] * A[k,j] # @pretty is like @macroexpand
begin
@boundscheck ndims(A) == 2 || throw(ArgumentError("expected a 2-tensor A[k, i]"))
local rat = transmute(A, Val((2, nothing, 1)))
local cormorant = transmute(A, Val((nothing, 2, 1)))
B2 = @__dot__(rat * cormorant)
end
julia> B ≈ transmute(A, (2,3,1)) .* transmute(A, (3,2,1))
true
This should be faster on GPU, as loops like for r in eachrow(A)
involve many separate interactions:
julia> using CUDA, BenchmarkTools
julia> A = rand(200,30) |> cu;
julia> B = @btime CUDA.@sync stack(r * r' for r in eachrow($A));
11.002 ms (21005 allocations: 456.47 KiB)
julia> B ≈ @btime CUDA.@sync cat([r * r' for r in eachrow($A)]..., dims=3)
13.150 ms (27820 allocations: 1018.27 KiB)
true
julia> B ≈ @btime CUDA.@sync let A3 = insertdims($A, dims=3)
PermutedDimsArray(A3, (2,3,1)) .* PermutedDimsArray(A3, (3,2,1))
end
60.674 μs (95 allocations: 2.72 KiB)
true
julia> B ≈ @btime CUDA.@sync @cast B2[i,j,k] := $A[k,i] * $A[k,j]
57.205 μs (85 allocations: 1.95 KiB)
true
julia> 11_002 / 60.7
181.2520593080725