`rev_cumsum_exp!()` optimization

Another weird operation to be optimized:

function rev_cumsum_exp!(A,B,source)
    # This is equivalent to : 
    #       A .= exp(source)
    #       B .= 1 ./ reverse(cumsum(reverse(A))))
    s = zero(eltype(A))
    for j in length(source):-1:1
        A[j] = exp(source[j])
        s += A[j]
        B[j] = inv(s)
    end
    return nothing
end

# Sample data: 
using BenchmarkTools
N=10_000
source, A, B = randn(N), zeros(N), zeros(N)
@benchmark rev_cumsum_exp!(A,B,source)

This yields:

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):   85.600 ΞΌs … 737.400 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):      96.600 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   107.616 ΞΌs Β±  32.025 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–… β–ˆβ–‚ β–‡  ▆▁ ▁▄   β–…                       ▂▁   ▁▁           β–‚   ▁
  β–ˆβ–†β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–…β–†β–ˆβ–‡β–†β–†β–‡β–ˆβ–…β–…β–†β–…β–†β–„β–…β–…β–„β–…β–‚β–…β–„β–…β–„β–…β–†β–…β–ˆβ–ˆβ–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–‡β–†β–„β–†β–‡β–…β–†β–…β–ˆβ–‡β–ˆβ–†β–† β–ˆ
  85.6 ΞΌs       Histogram: log(frequency) by time        211 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

But If i try to profile it:

N=100_000_000
source, A, B = randn(N), zeros(N), zeros(N)
@profview rev_cumsum_exp!(A,B,source)

I get:

Is there still something to be done or am I doomed ? I need both the A and B outputs to be produced..

You could turn off bounds checking; this gives a 10% speed-up on my device for your specific example.

1 Like

Most of the time is in exp, at least for me. I got little gain from multi-threading that, but using AppleAccelerate got a factor of 2. (I think MKL.jl has similar β€œvectorised” functions, for intel & friends.)

julia> @btime rev_cumsum_exp!($A,$B,$source)
  30.333 ΞΌs (0 allocations: 0 bytes)

julia> @btime $A .= exp.($source);
  24.708 ΞΌs (0 allocations: 0 bytes)

julia> @btime AppleAccelerate.exp!($A, $source);
  9.166 ΞΌs (0 allocations: 0 bytes)

julia> function twopass!(A, B, src)
         # A .= exp.(src)  # with this broadcast, same time as one-pass fn above
         AppleAccelerate.exp!(A, src)
         s = zero(eltype(A))
         for j in reverse(eachindex(A,B))
           s += A[j]
           B[j] = inv(s)
         end
       end
twopass! (generic function with 1 method)

julia> @btime twopass!($A,$B,$source)
  14.667 ΞΌs (0 allocations: 0 bytes)

# Try Float32 too:
julia> source, A, B = randn(Float32, N), zeros(Float32, N), zeros(Float32, N);

julia> @btime $A .= exp.($source);
  24.709 ΞΌs (0 allocations: 0 bytes)

julia> @btime AppleAccelerate.exp!($A, $source);
  3.396 ΞΌs (0 allocations: 0 bytes)

julia> @btime twopass!($A,$B,$source)
  8.666 ΞΌs (0 allocations: 0 bytes)
1 Like

Maybe there is some approximation trick which can be used (e.g. using exp(x) β‰ˆ 1+x). In any case, that would require more context on the problem.
Can you give a sample of the values which would go into source?