Speed up simple product accumulator loop

Hello Julia community,

I need help with speeding up an inner loop in a larger project (running on Julia 1.5.0). Here is a MWE. The loop accumulates multiplicatively into a vector.

using BenchmarkTools

function signal1(k,om,t)
    V = ones(Float64,size(t))
    for p = 1:length(om)
        @. V *= 1 - k[p]*sin(om[p]*t)^4
    end
    return V
end

N = 200   # typically > 50_000
nt = 101   # typically 200-500

k = rand(N,N) 
om = rand(N,N)
t = collect(range(0,0.5,length=nt))

@btime signal1($k,$om,$t);

The code appears type stable according to @code_warntype. Adding @simd or @inbounds to the loop makes no difference. @btime shows only one memory allocation, so the @. macro appears to be working.

As is, this is already 9x faster than my original Matlab version - impressive! However, I am wondering whether I am leaving performance on the table. Are there any ways to speed this up, without destroying readability?

Thanks.

EDIT: remove V from function argument list.

For real input, you could use @avx offered by LoopVectorization before @. to achieve better simd.

1 Like

Some quick attempts:

julia> @btime signal1($k,$om,$t); # defining it without argument V
  117.286 ms (1 allocation: 896 bytes)

julia> using Tullio

julia> @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;

julia> V ≈ signal1(k,om,t)
true

julia> @btime @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;
  10.760 ms (125 allocations: 8.28 KiB)

julia> using LoopVectorization

julia> @btime @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;
  6.161 ms (125 allocations: 8.28 KiB)

(Edit: adding threads=false avx=false fastmath=false makes this slow again. I think @fastmath helps power ^4 quite a bit.)

1 Like

One thing that will slow your computation down but make it much more accurate is accumulating additively in the log domain. Especially if the om terms are near 0, this will be neck more numerically stable.

3 Likes

Thank you all for these great suggestions (using @avx from LoopVectorization, using @tullio from Tullio, and using the logarithm for better accuracy)!

Here are the results (Julia 1.5.0, Win10, i7 5600U with 4 threads):

using BenchmarkTools
using LoopVectorization
using Tullio

# original reference implementation
function signal1(k,om,t)
    V = ones(Float64,size(t))
    for p = 1:length(om)
        @. V *= 1 - k[p]*sin(om[p]*t)^4
    end
    return V
end

# addition of @avx from LoopVectorization
function signal2(k,om,t)
    V = ones(Float64,size(t))
    for p = 1:length(om)
        @avx @. V *= 1 - k[p]*sin(om[p]*t)^4
    end
    return V
end

# using @tullio from Tullio 
signal3(k,om,t) = @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4

# accumulating in the log domain (for improved accuracy)
signal4(k,om,t) = exp.(@tullio V[a] := log1p(-k[p,q] * sin(om[p,q] * t[a])^4))

N = 200   # typically > 50_000
nt = 101   # typically 200-500

k = rand(N,N) 
om = rand(N,N)
t = collect(range(0,0.5,length=nt))

V1 = signal1(k,om,t)
@show signal2(k,om,t) ≈ V1
@show signal3(k,om,t) ≈ V1
@show signal4(k,om,t) ≈ V1

@show Threads.nthreads()

@btime signal1($k,$om,$t); # original implementation
@btime signal2($k,$om,$t); # with @avx
@btime signal3($k,$om,$t); # with @tullio
@btime signal4($k,$om,$t); # via log domain, with @tullio

gives

Threads.nthreads() = 4
signal2(k, om, t) ≈ V1 = true
signal3(k, om, t) ≈ V1 = true
signal4(k, om, t) ≈ V1 = true
  61.664 ms (1 allocation: 896 bytes)
  19.983 ms (1 allocation: 896 bytes)
  9.676 ms (160 allocations: 10.84 KiB)
  27.789 ms (158 allocations: 11.59 KiB)

signal3 with the @tullio one-line definition is clearly the fastest. Thanks!

The penalty for working in log is not so bad here. One further (perhaps obscure) feature is that you may write this, to apply exp after the sum:

@tullio V[a] := exp <| log1p(-k[p,q] * sin(om[p,q] * t[a])^4)
2 Likes

This reversed pipe operator looks actually quite intuitive to me.

Tullio is really impressive. It makes life so much easier in two respects, (a) taking the pain out of writing hot loops over arrays, and (b) automatically handling low-level performance annotations.