# Repeated vector vector multiplication to obtain matrix

In my application I am running many iterations of a multiplication like this `mat = vec1 .* vec2'`.

The current implementation is like this

``````dP_dw = zeros(Float64, (number_of_states, settings.n_vis * 2, settings.n_hid))
for a = 1:number_of_states
dP_dw[a, :, :] = states[a, :] .* tanh_term[a, :]'
end
``````

I was wondering if there was a more ‘julianic’ / faster way of doing this instead of for loops using the broadcast function, but I cannot put it together as customizing it seems tough.

Here’s a quick attempt:

``````function outers1(states, tanh_term)
number_of_states = size(states, 1)
size(states, 1) == number_of_states || error()
dP_dw = zeros(Float64, (number_of_states, size(states,2), size(tanh_term,2)))
@inbounds for a = 1:number_of_states
@views dP_dw[a, :, :] .= states[a, :] .* tanh_term[a, :]'
end
dP_dw
end

using Einsum, Test
outers2(states, tanh_term) = @einsum dP_dw[a,s,t] := states[a, s] * tanh_term[a, t]
outers3(states, tanh_term) = @vielsum dP_dw[a,s,t] := states[a, s] * tanh_term[a, t]

outers4(states, tanh_term) = states .* reshape(tanh_term, size(states, 1), 1, :)

N = 50; states, tanh_term = randn(N,N), randn(N,N);
@test outers1(states, tanh_term) ≈ outers2(states, tanh_term) ≈ outers4(states, tanh_term)

#== results ==#

julia> @btime outers0(\$states, \$tanh_term); # as in question
364.486 μs (202 allocations: 1.96 MiB)

julia> @btime outers1(\$states, \$tanh_term); # with @inbounds, @views, and .=
189.673 μs (2 allocations: 976.64 KiB)

julia> @btime outers2(\$states, \$tanh_term);
62.388 μs (2 allocations: 976.64 KiB)

julia> @btime outers3(\$states, \$tanh_term);
52.462 μs (62 allocations: 984.86 KiB)

julia> @btime outers4(\$states, \$tanh_term);
63.387 μs (4 allocations: 976.75 KiB)
``````

If you look at `@macroexpand1 @einsum dP_dw[a,s,t] := states[a, s] * tanh_term[a, t]`, the main change is that it orders the loops with `a` innermost.

1 Like

First reconsider whether you should construct such matrices at all. You are constructing rank-1 matrices M = uv^*, which requires \Theta(n^2) storage and time for n-component vectors, and naively multiplying such a matrix by a vector (Mx) requires \Theta(n^2) time. In contrast, storing u and v separately requires \Theta(n) storage and you can multiply (uv^*)x = u(v^*x) in \Theta(n) time.

If you must compute the matrices explicitly, then

1. Realize that `dP_dw[a, :, :] = states[a, :] .* tanh_term[a, :]'` first allocates two vectrors (`states[a, :]` and `tanh_term[a, :]`) since slicing makes a copy in Julia (unless you use `@views`), then allocates a matrix, then writes that matrix into `dP_dw`. That is a lot of allocations.

2. You can partly eliminate the allocations by using `.=` and `@views` as in @mcabbot’s post, but I think you may need Julia 1.5 for this to elide allocations of view objects?

3. If you have small fixed-size arrays whose size is known at compile-time, e.g. 3-component or 4-component arrays, you should use the StaticArrays package (which will completely unroll your loops and eliminate all allocations).

(I’m assuming that your critical code is actually in a function, of course — don’t use global loops.)

2 Likes

Thanks for the effort, I appreciate it! Impressive speed up!

Sadly I do need them, as these matrices represent derivatives in a neural network and I need every possible combination of the different neurons (fully connected layers), but it’s a good point.

`a` will usually be as large as 10000, while the second dimension `s` in @mcabbott post will be as large as 40 and `t` can go up to 20 - 30. StaticArrays won’t help sadly.

`states` is a Boolean Array. Does Einsum / Vielsum in some way or another incorporate this?

I’ll do a bit of benchmarking myself and see if I can make any further improvements. Thanks a lot again to both of you! (and no worries, everything’s in a function)