Repeated vector vector multiplication to obtain matrix

In my application I am running many iterations of a multiplication like this mat = vec1 .* vec2'.

The current implementation is like this

dP_dw = zeros(Float64, (number_of_states, settings.n_vis * 2, settings.n_hid))
for a = 1:number_of_states
    dP_dw[a, :, :] = states[a, :] .* tanh_term[a, :]'

I was wondering if there was a more ‘julianic’ / faster way of doing this instead of for loops using the broadcast function, but I cannot put it together as customizing it seems tough.

Here’s a quick attempt:

function outers1(states, tanh_term) 
    number_of_states = size(states, 1)
    size(states, 1) == number_of_states || error()
    dP_dw = zeros(Float64, (number_of_states, size(states,2), size(tanh_term,2)))
    @inbounds for a = 1:number_of_states
        @views dP_dw[a, :, :] .= states[a, :] .* tanh_term[a, :]'

using Einsum, Test
outers2(states, tanh_term) = @einsum dP_dw[a,s,t] := states[a, s] * tanh_term[a, t]
outers3(states, tanh_term) = @vielsum dP_dw[a,s,t] := states[a, s] * tanh_term[a, t]

outers4(states, tanh_term) = states .* reshape(tanh_term, size(states, 1), 1, :)

N = 50; states, tanh_term = randn(N,N), randn(N,N);
@test outers1(states, tanh_term) ≈ outers2(states, tanh_term) ≈ outers4(states, tanh_term)

#== results ==#

julia> @btime outers0($states, $tanh_term); # as in question
  364.486 μs (202 allocations: 1.96 MiB)

julia> @btime outers1($states, $tanh_term); # with @inbounds, @views, and .=
  189.673 μs (2 allocations: 976.64 KiB)

julia> @btime outers2($states, $tanh_term); 
  62.388 μs (2 allocations: 976.64 KiB)

julia> @btime outers3($states, $tanh_term);
  52.462 μs (62 allocations: 984.86 KiB)

julia> @btime outers4($states, $tanh_term);                   
  63.387 μs (4 allocations: 976.75 KiB) 

If you look at @macroexpand1 @einsum dP_dw[a,s,t] := states[a, s] * tanh_term[a, t], the main change is that it orders the loops with a innermost.

1 Like

First reconsider whether you should construct such matrices at all. You are constructing rank-1 matrices M = uv^*, which requires \Theta(n^2) storage and time for n-component vectors, and naively multiplying such a matrix by a vector (Mx) requires \Theta(n^2) time. In contrast, storing u and v separately requires \Theta(n) storage and you can multiply (uv^*)x = u(v^*x) in \Theta(n) time.

If you must compute the matrices explicitly, then

  1. Realize that dP_dw[a, :, :] = states[a, :] .* tanh_term[a, :]' first allocates two vectrors (states[a, :] and tanh_term[a, :]) since slicing makes a copy in Julia (unless you use @views), then allocates a matrix, then writes that matrix into dP_dw. That is a lot of allocations.

  2. You can partly eliminate the allocations by using .= and @views as in @mcabbot’s post, but I think you may need Julia 1.5 for this to elide allocations of view objects?

  3. If you have small fixed-size arrays whose size is known at compile-time, e.g. 3-component or 4-component arrays, you should use the StaticArrays package (which will completely unroll your loops and eliminate all allocations).

(I’m assuming that your critical code is actually in a function, of course — don’t use global loops.)


Thanks for the effort, I appreciate it! Impressive speed up!

Sadly I do need them, as these matrices represent derivatives in a neural network and I need every possible combination of the different neurons (fully connected layers), but it’s a good point.

a will usually be as large as 10000, while the second dimension s in @mcabbott post will be as large as 40 and t can go up to 20 - 30. StaticArrays won’t help sadly.

states is a Boolean Array. Does Einsum / Vielsum in some way or another incorporate this?

I’ll do a bit of benchmarking myself and see if I can make any further improvements. Thanks a lot again to both of you! (and no worries, everything’s in a function)