# Speed up simple product accumulator loop

Hello Julia community,

I need help with speeding up an inner loop in a larger project (running on Julia 1.5.0). Here is a MWE. The loop accumulates multiplicatively into a vector.

``````using BenchmarkTools

function signal1(k,om,t)
V = ones(Float64,size(t))
for p = 1:length(om)
@. V *= 1 - k[p]*sin(om[p]*t)^4
end
return V
end

N = 200   # typically > 50_000
nt = 101   # typically 200-500

k = rand(N,N)
om = rand(N,N)
t = collect(range(0,0.5,length=nt))

@btime signal1(\$k,\$om,\$t);
``````

The code appears type stable according to `@code_warntype`. Adding `@simd` or `@inbounds` to the loop makes no difference. `@btime` shows only one memory allocation, so the `@.` macro appears to be working.

As is, this is already 9x faster than my original Matlab version - impressive! However, I am wondering whether I am leaving performance on the table. Are there any ways to speed this up, without destroying readability?

Thanks.

EDIT: remove `V` from function argument list.

For real input, you could use @avx offered by LoopVectorization before @. to achieve better simd.

1 Like

Some quick attempts:

``````julia> @btime signal1(\$k,\$om,\$t); # defining it without argument V
117.286 ms (1 allocation: 896 bytes)

julia> using Tullio

julia> @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;

julia> V ≈ signal1(k,om,t)
true

julia> @btime @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;
10.760 ms (125 allocations: 8.28 KiB)

julia> using LoopVectorization

julia> @btime @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;
6.161 ms (125 allocations: 8.28 KiB)
``````

(Edit: adding `threads=false avx=false fastmath=false` makes this slow again. I think `@fastmath` helps power `^4` quite a bit.)

1 Like

One thing that will slow your computation down but make it much more accurate is accumulating additively in the log domain. Especially if the om terms are near 0, this will be neck more numerically stable.

3 Likes

Thank you all for these great suggestions (using `@avx` from LoopVectorization, using `@tullio` from Tullio, and using the logarithm for better accuracy)!

Here are the results (Julia 1.5.0, Win10, i7 5600U with 4 threads):

``````using BenchmarkTools
using LoopVectorization
using Tullio

# original reference implementation
function signal1(k,om,t)
V = ones(Float64,size(t))
for p = 1:length(om)
@. V *= 1 - k[p]*sin(om[p]*t)^4
end
return V
end

# addition of @avx from LoopVectorization
function signal2(k,om,t)
V = ones(Float64,size(t))
for p = 1:length(om)
@avx @. V *= 1 - k[p]*sin(om[p]*t)^4
end
return V
end

# using @tullio from Tullio
signal3(k,om,t) = @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4

# accumulating in the log domain (for improved accuracy)
signal4(k,om,t) = exp.(@tullio V[a] := log1p(-k[p,q] * sin(om[p,q] * t[a])^4))

N = 200   # typically > 50_000
nt = 101   # typically 200-500

k = rand(N,N)
om = rand(N,N)
t = collect(range(0,0.5,length=nt))

V1 = signal1(k,om,t)
@show signal2(k,om,t) ≈ V1
@show signal3(k,om,t) ≈ V1
@show signal4(k,om,t) ≈ V1

@btime signal1(\$k,\$om,\$t); # original implementation
@btime signal2(\$k,\$om,\$t); # with @avx
@btime signal3(\$k,\$om,\$t); # with @tullio
@btime signal4(\$k,\$om,\$t); # via log domain, with @tullio
``````

gives

``````Threads.nthreads() = 4
signal2(k, om, t) ≈ V1 = true
signal3(k, om, t) ≈ V1 = true
signal4(k, om, t) ≈ V1 = true
61.664 ms (1 allocation: 896 bytes)
19.983 ms (1 allocation: 896 bytes)
9.676 ms (160 allocations: 10.84 KiB)
27.789 ms (158 allocations: 11.59 KiB)
``````

`signal3` with the `@tullio` one-line definition is clearly the fastest. Thanks!

The penalty for working in log is not so bad here. One further (perhaps obscure) feature is that you may write this, to apply `exp` after the sum:

``````@tullio V[a] := exp <| log1p(-k[p,q] * sin(om[p,q] * t[a])^4)
``````
2 Likes

This reversed pipe operator looks actually quite intuitive to me.

`Tullio` is really impressive. It makes life so much easier in two respects, (a) taking the pain out of writing hot loops over arrays, and (b) automatically handling low-level performance annotations.