# Sum operations between arrays

Hello there,

I was wondering if you knew how to efficiently compute the following matrices in Julia. In Matlab I am getting the result 3-6x faster than in Julia, which makes me think that there is room for improvement perhaps? Since I need to compute these thousands of times in my code, I am really interested to optimize this bit…

So, here is the computation in Matlab. I take advantage of the fact that I can perform the “all-against-all” summation in matlab by summing a Nx1 with a 1xN vector, which gives the NxN matrix. In Julia, as far as I know, it is not possible.

``````function h = myfunc(v,vp)
% Both v and vp are column vectors

f = (v.^(1/3)+(vp.^(1/3))').^2.*((v.^(2/9)+(vp.^(2/9))').^(1/2));
g = exp(-(((v*vp').^(1/3))./(v.^(1/3)+(vp.^(1/3))')).^4);
h = f.*g;

end
``````

I test the function with the following:

``````x=linspace(1,1000,1000)';
tic
[f,  g] = myfun(x,x);
toc
``````

Result:

``````Elapsed time is 0.093312 seconds.
``````

Now, in Julia, since I cannot perform the NxN summation this way, I can only think of a for-loop. So I use two functions:

``````function product(v,vp)
f = zeros(Float64,length(v),length(vp));
g = zeros(Float64,length(v),length(vp));
for i in eachindex(v), j in eachindex(vp)
@inbounds f[j,i] = (v[i]^(1/3)+(vp[j]^(1/3)))^2 *((v[i]^(2/9)+(vp[j]^(2/9)))^(1/2))*2^(1/3);
@inbounds g[j,i] = exp(-(((v[i]*vp[j])^(1/3))/(v[i]^(1/3)+(vp[j]^(1/3))))^4);
end
return f, g
end
``````
``````function myfun(v,vp)
f, g = product(v,vp);
h = @. f*g;
return h
end
``````

I test with the same input I used in Matlab (`x=collect(1:1000)*1.0`) and, using BenchmarkTools I get:

``````596.179 ms (8 allocations: 22.89 MiB)
``````

So Matlab is 6.3 times faster.

Now, I realize that, since I am evaluating this function using the same vector for both inputs, the result is a symmetric matrix. So I can improve things using LinearAlgebra:

``````function product2(v,vp)
f = zeros(Float64,length(v),length(vp));
g = zeros(Float64,length(v),length(vp));
N = 1;
for i in eachindex(v)
for j = 1:N
@inbounds f[j,i] = (v[i]^(1/3)+(vp[j]^(1/3)))^2 *((v[i]^(2/9)+(vp[j]^(2/9)))^(1/2))*2^(1/3);
@inbounds g[j,i] = exp(-(((v[i]*vp[j])^(1/3))/(v[i]^(1/3)+(vp[j]^(1/3))))^4);
end
N = N+1;
end
f = Symmetric(f);
g = Symmetric(g);
return f, g
end
``````

which gives:

``````293.099 ms (10 allocations: 22.89 MiB)
``````

Better, but still 3x slower than Matlab.

I would really appreciate any thoughts on this.

3 Likes

Sure it is. That’s broadcasting! Just use `.+`, or simply

``````f = @. (v^(1/3)+(vp^(1/3))')^2 * ((v^(2/9)+(vp^(2/9))')^(1/2))
``````

However, this is one of those cases where loop fusion may hurt you (see also #20875) — you end up computing things like `v^(1/3)` N² times instead of N times, whereas Matlab’s non-fused vector operations compute these powers only once per element. Your nested-loop implementation has the same problem (but it’s even more obvious there because the loops are explicit) — you are computing all of your `v` and `vp` powers N² times.

So it might be faster to hoist those computations out of the inner loops, even at the expense of allocating some temporary arrays:

``````function product3(v, vp)
v_1_3 = cbrt.(v)
vp_1_3 = cbrt.(vp)
v_2_9 = cbrt.(v_1_3).^2
vp_2_9 = cbrt.(vp_1_3).^2
f = @. (v_1_3 + vp_1_3')^2 * sqrt(v_2_9 + vp_2_9')
g = @. exp(-((v_1_3 * vp_1_3') / (v_1_3 + vp_1_3'))^4)
return f, g
end
``````

(Note that I would tend to use `sqrt` and `cbrt` rather than `^(1/2)` and `^(1/3)`, respectively.
Note also that your Julia implementation also includes a `* 2^(1/3)` factor that is not in your Matlab code.)

On my machine, this is 5x faster than your initial two-loop version:

``````julia> @btime product(\$x,\$x);
202.658 ms (5 allocations: 15.26 MiB)

julia> @btime product3(\$x,\$x);
42.112 ms (21 allocations: 15.29 MiB)
``````

By the way, is Matlab using multiple threads? Try running Matlab with `-singleCompThread` to see what its single-threaded performance is, so that you are comparing apples to apples.

(Conversely, you can use threads in Julia too with `Threads.@threads for ... end`, but it’s good to optimize the single-threaded performance before you start messing with parallelism.)

11 Likes

I was able to save another 2 ms (5%) on my machine by writing out the loops in order to eliminate the `vp_1_3` and `vp_2_9` allocations:

``````function product4(v, vp)
vp_1_3 = cbrt.(vp)
vp_2_9 = cbrt.(vp_1_3).^2
# compute the output type rather than hard-coding Float64, to be more type-generic
f = Matrix{typeof(zero(eltype(vp_1_3)) + cbrt(zero(eltype(v))))}(undef, length(vp),length(v))
g = similar(f)
@inbounds for i in eachindex(v)
v_1_3 = cbrt(v[i]); v_2_9 = cbrt(v_1_3)^2
for j in eachindex(vp)
f[j,i] = (v_1_3 + vp_1_3[j])^2 * sqrt(v_2_9 + vp_2_9[j])
g[j,i] = exp(-((v_1_3 * vp_1_3[j]) / (v_1_3 + vp_1_3[j]))^4)
end
end
return f, g
end
``````

(and of course you can still save another factor of 2 by specializing for the symmetric case).

7 Likes

This also gets a nice boost from LoopVectorization. (Although perhaps this should come with a warning that it’s slightly experimental, unlike the above. For instance it appears not to like powers, right now.)

``````julia> using LoopVectorization

julia> function product6(v, vp)
vp_1_3 = cbrt.(vp)
vp_2_9 = cbrt.(vp_1_3).^2
f = Matrix{Float64}(undef, length(vp),length(v))
g = similar(f)
@avx for i in eachindex(v)
v_1_3 = cbrt(v[i])
c = cbrt(v_1_3)
v_2_9 = c * c
for j in eachindex(vp)
d = (v_1_3 + vp_1_3[j])
f[j,i] = d * d * sqrt(v_2_9 + vp_2_9[j])
a = (v_1_3 * vp_1_3[j]) / (v_1_3 + vp_1_3[j])
b = a * a
g[j,i] = exp(-b * b)
end
end
return f, g
end

julia> @btime product4(\$x, \$x); # stevengj's function above
40.186 ms (7 allocations: 15.27 MiB)

julia> @btime product6(\$x, \$x); # using LoopVectorization
3.881 ms (7 allocations: 15.27 MiB)

julia> product4(x, x) ≈ product6(x, x)
true
``````

In fact, Julia’s own vectorisation gets you half-way there. Writing out the powers (`b = a * a` etc.) but just using `@inbounds` instead of `@avx`, I get this:

``````julia> @btime product8(\$x, \$x);
13.986 ms (7 allocations: 15.27 MiB)
``````
7 Likes

Hey, thanks a lot.

Sure it is. That’s broadcasting! Just use `.+` ,

I must say, I am surprised… I just assumed that `.+` would do the elementwise sum, not the entire matrix! Great to know!

However, this is one of those cases where loop fusion may hurt you (see also #20875) — you end up computing things like `v^(1/3)` N² times instead of N times, whereas Matlab’s non-fused vector operations compute these powers only once per element. Your nested-loop implementation has the same problem (but it’s even more obvious there because the loops are explicit) — you are computing all of your `v` and `vp` powers N² times.

Great! Thanks for the info!

So it might be faster to hoist those computations out of the inner loops, even at the expense of allocating some temporary arrays:

Fantastic! Just tried it out and worked perfectly (30 ms).

Thank you a lot!

1 Like

You don’t need to write out the powers, just use `@fastmath`. I did this on the `for j` loop, i.e. writing `@fastmath for j in eachindex(vp)`, and I got

``````julia> @btime product4(\$x,\$x);
13.455 ms (7 allocations: 15.27 MiB)
``````

The reason for this is that computing `x^4` by `(x^2)^2` is slightly less accurate, so it is not done by default. `@fastmath` switches over to faster (but less accurate) small-integer powers. (See also the discussion in #20637.)

3 Likes

Anyone know why the `@avx` macro chokes on literal powers? Is it just ‘not implemented yet’?

mcabbot filed this issue:

https://github.com/chriselrod/LoopVectorization.jl/issues/85

It is a case of “not implemented yet”, but it’s also not the trivial problem of needing to add missing methods. LoopVectorization will need special-case handling for `^`, just like Julia’s lowering has.

3 Likes

Did you try it with `@fastmath` as noted above?

I added support in LoopVectorization 0.6.29, so it should work now. Please file new issues if any more problems come up!

That `@fastmath` produces `pow_fast(x, ::Val{p})` is neat:

``````ulia> @macroexpand @fastmath a ^ 2
:(Base.FastMath.pow_fast(a, Val{2}()))
``````

I implemented the special handling of `^` to directly insert the power-by-squaring if the exponent is a literal, to improve cost modeling. That one `pow_fast` has a different cost to another would require a larger change to the package.
`a ^ b` is expensive in terms of time and register use (you probably wouldn’t want to unroll the loops at all), but `a ^ 0`, `a ^ 1`, and `a ^ 2` are cheap, and `a ^ -1` isn’t all that expensive (and doesn’t cost any extra registers).

6 Likes

I tried out the new `Threads.@spawn` functionality, and made a special implementation for the symmetric case:

``````
using Base.Threads: @spawn, @sync
using LinearAlgebra: Symmetric

function tproduct(x, y)
Y3 = cbrt.(y)
Y9 = cbrt.(Y3).^2
H = Matrix{eltype(x)}(undef, length(x), length(y))
@fastmath @sync for i in eachindex(x)
@spawn begin
x3 = cbrt(x[i])
x9 = cbrt(x3)^2
for j in eachindex(y)
A = x3 + Y3[j]
B = x3 * Y3[j]
C = x9 + Y9[j]
f = A^2 * sqrt(C)
g = exp(-(B / A)^4)
H[j, i] = f * g
end
end
end
return H
end

# symmetric case
function tproduct(x)
X3 = cbrt.(x)
X9 = cbrt.(X3).^2
H = Matrix{eltype(x)}(undef, length(x), length(x))
@fastmath @sync for i in eachindex(x)
@spawn begin
x3 = X3[i]
x9 = X9[i]
for j in firstindex(x):i
A = x3 + X3[j]
B = x3 * X3[j]
C = x9 + X9[j]
f = A^2 * sqrt(C)
g = exp(-(B / A)^4)
H[j, i] = f * g
end
end
end
return Symmetric(H)
end
``````

Comparing it to the `@avx` version, using 4 threads:

``````julia> @btime product6(x,x) setup=(x=rand(1000));
6.499 ms (4 allocations: 7.64 MiB)

julia> @btime tproduct(x, x) setup=(x=rand(1000));
3.689 ms (6015 allocations: 8.36 MiB)

julia> @btime tproduct(x) setup=(x=rand(1000));
1.990 ms (6016 allocations: 8.35 MiB)
``````

The last, symmetric, case is almost 30x as fast as the Matlab code, also with 4 threads, though the Matlab code does not exploit symmetry. Using `@spawn` rather than `@threads` is advantageous for the symmetric case, especially.

BTW: how’s the interaction between the `@avx` macro and multithreading? Can they co-exist, or does `@avx` already exploit multiple threads?

4 Likes

Yep, `@avx` plays nice with threads. Here’s an example on 6 threads:

``````using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra: Symmetric
using LoopVectorization: @avx

function tproduct_avx(x, y)
@fastmath begin
Y3 = cbrt.(y)
Y9 = cbrt.(Y3).^2
end
lx = length(x)
ly = length(y)
H = Matrix{eltype(x)}(undef, lx, ly)

@sync for x_index_chunk in Iterators.partition(eachindex(x), lx ÷ 2nthreads())
@spawn begin
@avx for i ∈ x_index_chunk
x3 = cbrt(x[i])
x9 = cbrt(x3)^2
for j in eachindex(y)
A = x3 + Y3[j]
B = x3 * Y3[j]
C = x9 + Y9[j]
f = A^2 * sqrt(C)
g = exp(-(B/A)^4)
H[j, i] = f * g
end
end
end
end
return H
end
``````
``````julia> nthreads()
6

jullia> @btime tproduct(x, x) setup=(x=rand(1000));
1.821 ms (6015 allocations: 8.41 MiB)

julia> @btime tproduct_avx(x, x) setup=(x=rand(1000));
761.404 μs (100 allocations: 7.66 MiB)
``````

LoopVectorization.jl doesn’t yet know how to do triangular loops so doing the symmetric case would require some gymnastics currently, but this is on the issue tracker as something that is intended to be supported eventually Support for 'triangular' loops · Issue #77 · JuliaSIMD/LoopVectorization.jl · GitHub

Edit: I changed an error in the implementation of `tproduct_avx`

3 Likes

Very cool.

Unfortunately, it doesn’t yet handle ‘triangular’ iteration, as far as I understand, so we can’t leverage the advantages of the symmetric case. Edit: what you said, I mean.

Okay, so it turns out that for this case, one doesn’t really need to care about vectorizing the outer loop, so this implementation of the Symmetric case with LoopVectorization is able to get the best of both worlds:

``````using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra: Symmetric
using LoopVectorization: @avx

function tproduct_avx(x)
X3 = @avx cbrt.(x)
X9 = @avx cbrt.(X3).^2
lx = length(x)
H = Matrix{eltype(x)}(undef, lx, lx)
@sync for x_index_chunk in Iterators.partition(eachindex(x), lx ÷ 2nthreads())
@spawn begin
for i ∈ x_index_chunk
x3 = X3[i]
x9 = X9[i]
js = firstindex(x):i
@avx for j in js
A = x3 + X3[j]
B = x3 * X3[j]
C = x9 + X9[j]
f = A^2 * sqrt(C)
g = exp(-(B/A)^4)
H[j, i] = f * g
end
end
end
end
return Symmetric(H)
end
``````
``````julia> @btime tproduct(x) setup=(x=rand(1000));
1.041 ms (6016 allocations: 8.39 MiB)

julia> @btime tproduct_avx(x) setup=(x=rand(1000));
404.577 μs (88 allocations: 7.66 MiB)
``````

Hats off to @Elrod, the man is a wizard.

13 Likes 15 Likes

One question, though. What’s the purpose of the explicit chunking? I’m getting better performance (>10%) with just ordinary `eachindex`:

``````
function tproduct(x)
X3 = cbrt.(x)
X9 = cbrt.(X3).^2
H = Matrix{eltype(x)}(undef, length(x), length(x))
@sync for i in eachindex(x)
@spawn begin
x3 = X3[i]
x9 = X9[i]
@avx for j in firstindex(x):i
A = x3 + X3[j]
B = x3 * X3[j]
C = x9 + X9[j]
f = A^2 * sqrt(C)
g = exp(-(B / A)^4)
H[j, i] = f * g
end
end
end
return Symmetric(H)
end
``````

Edit: BTW, I’ve realized that, even though there are `eachindex` and `firstindex` in here, this won’t work too well for inputs with non-standard indexing, since I’m using the ordinary `Matrix` type, and using the same indices for inputs and outputs. So the code is not as generic as it looks.

Because the early iterations of `for i in eachindex(x)` are much faster than the later iterations, evenly splitting them up is probably not the way to go. However, that still seems to be a lot faster on my computer than `@spawn` on every iteration.
Your `tproduct`:

``````julia> @benchmark tproduct(\$x)
BenchmarkTools.Trial:
memory estimate:  8.39 MiB
allocs estimate:  6015
--------------
minimum time:     744.784 μs (0.00% GC)
median time:      831.465 μs (0.00% GC)
mean time:        1.072 ms (16.54% GC)
maximum time:     2.916 ms (69.94% GC)
--------------
samples:          4654
evals/sample:     1
``````

This is hardly faster than not using threads at all:

``````julia> @benchmark tproduct_nothreads(\$x)
BenchmarkTools.Trial:
memory estimate:  7.65 MiB
allocs estimate:  5
--------------
minimum time:     833.721 μs (0.00% GC)
median time:      855.760 μs (0.00% GC)
mean time:        1.016 ms (8.90% GC)
maximum time:     2.251 ms (26.65% GC)
--------------
samples:          4912
evals/sample:     1
``````

while Mason’s was much faster, thanks to the lower overhead of few `@spawn`s:

``````julia> @benchmark tproduct_avx(\$x)
BenchmarkTools.Trial:
memory estimate:  7.67 MiB
allocs estimate:  240
--------------
minimum time:     85.435 μs (0.00% GC)
median time:      100.729 μs (0.00% GC)
mean time:        188.287 μs (43.70% GC)
maximum time:     901.825 μs (76.78% GC)
--------------
samples:          10000
evals/sample:     1
``````

We can improve on this by trying to address the issue of uneven numbers of iterations on the inner loop. Basically, we want to partition the outer loop into `nthreads()` pieces that have similar numbers of total inner loop iterations.
This means that the early outer loop chunks must be longer than the later outer loop chunks.

The total number of triangular iterations is `div(N * (N + 1), 2)`.

``````julia> binomial(length(x) + 1,2)
500500

julia> bin2p1(x::Integer) = (x * (x + 1)) >>> 1
bin2p1 (generic function with 2 methods)

julia> bin2p1(1000)
500500
``````

We can invert this and solve.

``````julia> outer_iter(n) = round(Int, @fastmath 0.5sqrt(1 + 8n)-0.5  )
outer_iter (generic function with 1 method)

julia> @. outer_iter(500500 * (0:8) ÷ 8)
9-element Array{Int64,1}:
0
353
500
612
707
790
866
935
1000
``````

Now these are boundaries for the outer loop that should give roughly similar numbers of total loop iterations, if we were to use 8 chunks and `length(x) == 1000`.

Putting this together:

``````using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra: Symmetric
using LoopVectorization: @avx

outer_iter(n) = round(Int, @fastmath 0.5sqrt(1 + 8n)-0.5  )
bin2p1(x::AbstractVector) = bin2p1(length(x))
bin2p1(x::Integer) = (x * (x + 1)) >>> 1

X3 = @avx cbrt.(x)
X9 = @avx cbrt.(X3).^2
H = Matrix{eltype(x)}(undef, length(x), length(x))
niter = bin2p1(length(x))
@sync for tid in 1:nchunks
@spawn begin
start = outer_iter(((tid - 1) * niter) ÷ nchunks) + 1
stop = outer_iter((tid * niter) ÷ nchunks)
for i in start:stop
x3 = X3[i]
x9 = X9[i]
@avx for j in firstindex(x):i
A = x3 + X3[j]
B = x3 * X3[j]
C = x9 + X9[j]
f = A^2 * sqrt(C)
g = exp(-(B / A)^4)
H[j, i] = f * g
end
end
end
end
return Symmetric(H)
end
``````

This yields

``````julia> @benchmark tproduct_threads(\$x)
BenchmarkTools.Trial:
memory estimate:  7.66 MiB
allocs estimate:  119
--------------
minimum time:     57.702 μs (0.00% GC)
median time:      70.051 μs (0.00% GC)
mean time:        234.578 μs (46.76% GC)
maximum time:     1.574 ms (56.24% GC)
--------------
samples:          10000
evals/sample:     1

julia> versioninfo()
Julia Version 1.5.0-DEV.585
Commit 4213cc651f* (2020-04-07 08:45 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
``````
2 Likes

Weird. I saw a 3x speedup from four threads, and also tproduct was 10% faster than tproduct_avx on my laptop.

Rerunning on Julia 1.4, where threading is in drastically better shape than Julia master, and using `@avx` for the `cbrt` broadcasts in each version:

``````julia> x = rand(1000);

BenchmarkTools.Trial:
memory estimate:  7.66 MiB
allocs estimate:  119
--------------
minimum time:     51.276 μs (0.00% GC)
median time:      82.728 μs (0.00% GC)
mean time:        296.117 μs (49.80% GC)
maximum time:     1.547 ms (64.31% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark tproduct_avx(\$x)
BenchmarkTools.Trial:
memory estimate:  7.67 MiB
allocs estimate:  240
--------------
minimum time:     62.949 μs (0.00% GC)
median time:      97.996 μs (0.00% GC)
mean time:        177.327 μs (45.42% GC)
maximum time:     899.084 μs (81.76% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark tproduct(\$x)
BenchmarkTools.Trial:
memory estimate:  8.39 MiB
allocs estimate:  6015
--------------
minimum time:     566.186 μs (0.00% GC)
median time:      681.781 μs (0.00% GC)
mean time:        915.924 μs (19.56% GC)
maximum time:     2.739 ms (43.76% GC)
--------------
samples:          5444
evals/sample:     1

BenchmarkTools.Trial:
memory estimate:  7.65 MiB
allocs estimate:  5
--------------
minimum time:     662.203 μs (0.00% GC)
median time:      672.920 μs (0.00% GC)
mean time:        740.551 μs (8.30% GC)
maximum time:     1.306 ms (44.12% GC)
--------------
samples:          6727
evals/sample:     1
``````

Seems I’m seeing all around much better performance on Julia 1.4 than Julia master. [EDIT: that is because I had configured SLEEFPirates differently on Julia master; reverting to the default results in roughly equivalent single threaded performance.]
But I am still seeing comparable performance between single threaded, and 18 threads with an `@spawn` on each iteration.

Which, given that we have 1000 iterations, and the overhead of `@spawn`:

``````julia> @benchmark @spawn nothing
BenchmarkTools.Trial:
memory estimate:  719 bytes
allocs estimate:  4
--------------
minimum time:     485.450 ns (0.00% GC)
median time:      572.517 ns (0.00% GC)
mean time:        615.971 ns (6.23% GC)
maximum time:     4.161 μs (77.13% GC)
--------------
samples:          9812
evals/sample:     825
``````

means it seems like it’s to be expected. 1000 `@spawn`s look like they’d take at least 485 microseconds. Even with just 18 `@spawn`s, they’d make up about 20% of the runtime of the `51.276 μs` minimum time from `tproduct_threads`. Minimizing the number of `@spawn`s seems key.

But that’s just because of how fast this code is. If it was taking milliseconds, we wouldn’t really care about half a microsecond.

2 Likes

that’s interesting, I see ` 648.043 μs (6016 allocations: 8.39 MiB)` on your version and `401.126 μs (88 allocations: 7.66 MiB)` on mine. Of course, when I quickly whipped that code up, I didn’t think about the fact that each iteration of the triangular loop takes a different amount of time, so Chris is right that it would be best to be more intelligent about how the threads are spawned.

What sort of processor does your laptop have? Maybe the difference is that it’s unable to take advantage of any AVX instructions?

Interestingly, your code is actually slightly slower than my loop naive code on on my machine:

``````julia> @btime tproduct_threads(x)  setup=(x=rand(1000));
430.115 μs (45 allocations: 7.65 MiB)
``````

but I suspect that the differences matter more on your system because the code is running so much faster on it and you have so many more threads available. My code just spawns twice as many tasks as the number of threads and hopes for the best, which probably isn’t so great if you have 18 threads.