Speeding up a sum involving 3 matrices

Thanks for your answers. It made a 20x improvement in my overall computation time.

May I get your help with another related task? I need to compute a 10,000+ row vector in which the i^{th} row is \sum_{jk} x^{}_{ij}z^{}_{ik}\pi^{}_{jk} where j,k < 10. x, z are data and \pi_{jk} are parameters to be estimated. I have written the following code to implement this sum. Are there any improvements I can make here?

x = data[:,1:8];
z = data[:,9:16];
pi = rand(Float64, (size(x,2), size(z,2)));

function nlsum(pi)
    s = zeros(eltype(pi), size(x, 1), 1)
    for j in axes(pi, 1)
        for k in axes(pi, 2)
            s[:] += x[:, j].*z[:, k].* pi[j, k]
    return s    

For one thing, your code is allocating like crazy, since s[:] += x[:, j].*z[:, k].* pi[j, k] allocates 4 temporary arrays on every iteration. You could use views, or just write out the loop over i, or use something like Tullio.jl. Also, don’t use global variables — pass x and z as parameters to your function. Just changing your function to @views function nlsum(x,z,pi) speeds it up by a factor of 3x on my machine.

Alternatively, if you think in terms of matrix operations, your function is exactly:

nlsum2(x,z,π)= sum(x .* (z * π'), dims=2)

and this gives me another factor of 25x, for overall almost 100x speedup. (And you could eke out some more performance by optimizing out the extra allocations. Tullio.jl is also worth trying.)

PS. Julia has 1-dimensional arrays, unlike Matlab. You can allocate s as a 1d array instead of a 2d array with 1 column.


pi is a built-in constant (equal to 3.14…) Overwriting it in global scope is probably not a good idea.

This is totally safe (if a bit unusual in the case of pi) — it won’t affect usage of pi in other modules. (You are shadowing MathConstants.pi with a new binding, not overwriting it.)

(In a real application you’d probably be putting this code into functions anyway.)

It’s actually pretty crucial that this is fine — Base alone exports almost 1000 symbols, so if shadowing names were unsafe you’d see a lot of inadvertent breakage, and it would be dangerous to add new exports in future versions.


Interesting, I find that this a factor of 2 slower on my machine than the OP’s version with your suggested fixes. What sizes did you use? I took the data to be 10_000 long in its first dimension.

Here’s my code, plus a Tullio.jl version that smokes them both:

function nlsum1(x, z, pi)
    s = zeros(eltype(pi), size(x, 1))
    for j in axes(pi, 1)
        for k in axes(pi, 2)
            @views s[:] .+= x[:, j].*z[:, k].* pi[j, k]
    return s    

nlsum2(x,z,π) = sum(x .* (z * π'), dims=2)

using Tullio, LoopVectorization
nlsum_3(x,z,π) = @tullio s[i] := x[i,j]*z[i,k]*π[j, k];

and timings:

julia> let x = randn(10_000, 8), z = randn(10_000, 5), π = rand(8, 5)
           out1 = @btime nlsum1($x, $z, $π)
           out2 = @btime nlsum2($x, $z, $π)
           out3 = @btime nlsum3($x, $z, $π)
           out1 ≈ out2 ≈ out3
  111.783 μs (2 allocations: 78.17 KiB)
  198.939 μs (10 allocations: 1.30 MiB)
  28.667 μs (19 allocations: 79.02 KiB)