Speed up simple product accumulator loop

stst · August 14, 2020, 7:43am

Hello Julia community,

I need help with speeding up an inner loop in a larger project (running on Julia 1.5.0). Here is a MWE. The loop accumulates multiplicatively into a vector.

using BenchmarkTools

function signal1(k,om,t)
    V = ones(Float64,size(t))
    for p = 1:length(om)
        @. V *= 1 - k[p]*sin(om[p]*t)^4
    end
    return V
end

N = 200   # typically > 50_000
nt = 101   # typically 200-500

k = rand(N,N) 
om = rand(N,N)
t = collect(range(0,0.5,length=nt))

@btime signal1($k,$om,$t);

The code appears type stable according to @code_warntype. Adding @simd or @inbounds to the loop makes no difference. @btime shows only one memory allocation, so the @. macro appears to be working.

As is, this is already 9x faster than my original Matlab version - impressive! However, I am wondering whether I am leaving performance on the table. Are there any ways to speed this up, without destroying readability?

Thanks.

EDIT: remove V from function argument list.

N5N3 · August 14, 2020, 8:03am

For real input, you could use @avx offered by LoopVectorization before @. to achieve better simd.

mcabbott · August 14, 2020, 9:18am

Some quick attempts:

julia> @btime signal1($k,$om,$t); # defining it without argument V
  117.286 ms (1 allocation: 896 bytes)

julia> using Tullio

julia> @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;

julia> V ≈ signal1(k,om,t)
true

julia> @btime @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;
  10.760 ms (125 allocations: 8.28 KiB)

julia> using LoopVectorization

julia> @btime @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4;
  6.161 ms (125 allocations: 8.28 KiB)

(Edit: adding threads=false avx=false fastmath=false makes this slow again. I think @fastmath helps power ^4 quite a bit.)

Oscar_Smith · August 14, 2020, 9:45am

One thing that will slow your computation down but make it much more accurate is accumulating additively in the log domain. Especially if the om terms are near 0, this will be neck more numerically stable.

stst · August 14, 2020, 6:53pm

Thank you all for these great suggestions (using @avx from LoopVectorization, using @tullio from Tullio, and using the logarithm for better accuracy)!

Here are the results (Julia 1.5.0, Win10, i7 5600U with 4 threads):

using BenchmarkTools
using LoopVectorization
using Tullio

# original reference implementation
function signal1(k,om,t)
    V = ones(Float64,size(t))
    for p = 1:length(om)
        @. V *= 1 - k[p]*sin(om[p]*t)^4
    end
    return V
end

# addition of @avx from LoopVectorization
function signal2(k,om,t)
    V = ones(Float64,size(t))
    for p = 1:length(om)
        @avx @. V *= 1 - k[p]*sin(om[p]*t)^4
    end
    return V
end

# using @tullio from Tullio 
signal3(k,om,t) = @tullio (*) V[a] := 1 - k[p,q] * sin(om[p,q] * t[a])^4

# accumulating in the log domain (for improved accuracy)
signal4(k,om,t) = exp.(@tullio V[a] := log1p(-k[p,q] * sin(om[p,q] * t[a])^4))

N = 200   # typically > 50_000
nt = 101   # typically 200-500

k = rand(N,N) 
om = rand(N,N)
t = collect(range(0,0.5,length=nt))

V1 = signal1(k,om,t)
@show signal2(k,om,t) ≈ V1
@show signal3(k,om,t) ≈ V1
@show signal4(k,om,t) ≈ V1

@show Threads.nthreads()

@btime signal1($k,$om,$t); # original implementation
@btime signal2($k,$om,$t); # with @avx
@btime signal3($k,$om,$t); # with @tullio
@btime signal4($k,$om,$t); # via log domain, with @tullio

gives

Threads.nthreads() = 4
signal2(k, om, t) ≈ V1 = true
signal3(k, om, t) ≈ V1 = true
signal4(k, om, t) ≈ V1 = true
  61.664 ms (1 allocation: 896 bytes)
  19.983 ms (1 allocation: 896 bytes)
  9.676 ms (160 allocations: 10.84 KiB)
  27.789 ms (158 allocations: 11.59 KiB)

signal3 with the @tullio one-line definition is clearly the fastest. Thanks!

mcabbott · August 14, 2020, 8:16pm

The penalty for working in log is not so bad here. One further (perhaps obscure) feature is that you may write this, to apply exp after the sum:

@tullio V[a] := exp <| log1p(-k[p,q] * sin(om[p,q] * t[a])^4)

stst · August 17, 2020, 7:52am

This reversed pipe operator looks actually quite intuitive to me.

Tullio is really impressive. It makes life so much easier in two respects, (a) taking the pain out of writing hot loops over arrays, and (b) automatically handling low-level performance annotations.

Topic		Replies	Views
Outperformed by Matlab Performance matlab , multithreading , linearalgebra , tullio , loopvectorization	54	4186	November 23, 2021
Speeding up Matrix multiplication involving dot and hadamard product New to Julia question , performance , vector , matrices , loopvectorization	11	1628	February 9, 2022
Parallelization efficiency for @tensor @tullio Numerics	5	196	June 30, 2023
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8384	March 5, 2017
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4827	July 19, 2021

Speed up simple product accumulator loop

Related topics