Speeding up evaluation of small exponential-like series

I have an application where the following function to evaluate an exponential-like series is called many times for different p vectors (the last two inputs are fixed), and takes a good portion of the running time. I was wondering if anyone had suggestions for speeding up this function. Note that the number of terms in the series / length of p is fixed.

We’ve played around with using static vectors without any noticeable speedup, but I can’t say that we used them optimally. I can see several more potential optimizations like not evaluating exponential terms when p[i]*t is small enough, or pre-summing some of the coefficients to reduce the number of subtractions, but any other performance optimization suggestions would be very appreciated!

function expseries!(f, p, times, tswitch)
    i = 1
    @inbounds while times[i] <= tswitch
        t = times[i]
        f[i] = p[1] * (exp(p[2] * t)-1.0) +
            p[3] * (exp(p[4] * t)-1.0) +
            p[5] * (exp(p[6] * t)-1.0) +
            p[7] * (exp(p[8] * t)-1.0) +
            p[9] * t
        i += 1
    end

    switchval = p[1] * (exp(p[2] * tswitch)-1.0) +
            p[3] * (exp(p[4] * tswitch)-1.0) +
            p[5] * (exp(p[6] * tswitch)-1.0) +
            p[7] * (exp(p[8] * tswitch)-1.0) +
            p[9] * tswitch

    @inbounds while i <= length(times)
        t = times[i] - tswitch
        f[i] = p[10] * (exp(p[11] * t)-1.0) +
            p[12] * (exp(p[13] * t)-1.0) +
            p[14] * (exp(p[15] * t)-1.0) +
            p[16] * (exp(p[17] * t)-1.0) +
            p[18] * t + switchval
        i += 1
    end
    nothing
end

# example, representative data
expdata = zeros(600)
times = collect(0.0:600.0)
tswitch = 150.0
p = [-0.0023646887665542005, -0.17971576983395993, -0.00013140346280833014,
     0.022029709829507547, -0.01776402710248875, -0.06784184880837002,
     -0.535131802358805, -0.03029362682095963, 8.469962666009867e-5,
     0.000578370769602594, -0.17205707686147037, 0.19152866595531834,
     -0.006165370666034151, -1.5806598078316747e-8,  0.025194265785163362,
     0.06808509404148284, -0.01509240208654776, -0.00011611542785341768]

using Plots
expseries_general!(expdata, p, times, tswitch)
plot(times, expdata)

Note the times vector and tswitch are representative of common values we use for them.

edit: There are a ton of really great suggestions below, but it appears Discourse only allows choosing one “solution”. For that reason I’m not going to pick a particular one as there are a bunch that are really helpful. Thanks everyone who has responded – this has been a great resource for me!

This is not performance-related but you might be able to improve precision by switching to the expm1 function

https://docs.julialang.org/en/v1/base/math/#Base.expm1

2 Likes

Yes, we originally used that function but found it was much slower than exp(x)-1. In this case we don’t care about the accuracy enough for the slowdown to be worthwhile. But thanks for the suggestion!

In that case, you might want to try using Float32 and or @fastmath both of which will give you some speedup in exchange for accuracy loss.

2 Likes

Yes, good suggestions, thanks!

I’ll suggest you extract the expensive computation into its own function, so that you can optimize it more directly.

I used expm1 and saw a 20% slowdown, as you indicated. I see that this was already discussed. You can revert it to exp(x)-1 for more speed.

function pexpsum(p, t)
    s = p[1] * expm1(p[2] * t) +
        p[3] * expm1(p[4] * t) +
        p[5] * expm1(p[6] * t) +
        p[7] * expm1(p[8] * t) +
        p[9] * t
    return s
end

# pt1 = ntuple(i -> p[i], Val(9))
# pt2 = ntuple(i -> p[i+9], Val(9))
# f[i] = pexpsum(pt1, times[i]) # use before tswitch
# f[i] = pexpsum(pt2, times[i] - tswitch) + switchval # use after tswitch

Others have already suggested using Float32 precision or finding/writing a custom exp with less precision (i.e., @fastmath) for some additional corner-cutting.

The suggestion I’ll add is that you attempt a vectorized calculation of the 4 exp/expm1 in this kernel. SIMD.jl alleges to have a vectorized exp, although I have no experience with that package and can’t say how much it improves performance.

If that and other packages don’t work, if you rolled up your sleeves you could probably write your own version of exp/expm1 that operates on tuples and compiles to make heavy use of SIMD. Taking a glance at the source code in Base, this looks relatively doable. The main work is masking a bunch of operations to address the branches in the original without actual branching (which would usually prevent SIMD).

3 Likes

Yeah, I was secretly hoping someone could point me to a simple way to get this to SIMD – I had also seen the claim of a SIMD exp in SIMD.jl but have no experience writing such code so wanted to first see if someone knew an easy way to get it to work.

Also, several of the main users of the code are on M-series Macs, and I don’t have a good feeling for if SIMD tooling does much there currently. Does anyone know if libraries like SIMD.jl give any benefit on the ARM chips?

Actually, I got a big improvement to tuple performance with just the following where I (presumably) removed a few branches:

# compare to @edit Base.Math.exp_impl_fast(1.0, Val{:ℯ}())

function custom_exp_impl_fast(x::Float64, base)
    T = Float64
    N_float = muladd(x, Base.Math.LogBo256INV(base, T), Base.Math.MAGIC_ROUND_CONST(T))
    N = reinterpret(UInt64, N_float) .% Int32
    N_float = N_float - Base.Math.MAGIC_ROUND_CONST(T) #N_float now equals round(x*LogBo256INV(base, T))
    r = muladd(N_float, Base.Math.LogBo256U(base, T), x)
    r = muladd(N_float, Base.Math.LogBo256L(base, T), r)
    k = N >> 8
    # I suspect the table lookup in this next line does not vectorize
    jU = reinterpret(Float64, Base.Math.JU_CONST | (@inbounds Base.Math.J_TABLE[N&255 + 1] & Base.Math.JU_MASK))
    small_part = muladd(jU, Base.Math.expm1b_kernel.(base, r), jU)
    twopk = Int64(k) << 52
    y = reinterpret(T, twopk + reinterpret(Int64, small_part))
    y = ifelse(x >= Base.Math.MAX_EXP(base, T), Inf, y)
    y = ifelse(x <= -Base.Math.SUBNORM_EXP(base, T), 0.0, y)
    return y
end
julia> using BenchmarkTools

julia> x = ntuple(Float64, 4)
(1.0, 2.0, 3.0, 4.0)

julia> @btime Base.Math.exp_impl_fast.($x, Val{:ℯ}())
  15.531 ns (0 allocations: 0 bytes)
(2.718281828459045, 7.3890560989306495, 20.085536923187664, 54.598150033144236)

julia> @btime custom_exp_impl_fast.($x, Val{:ℯ}())
  7.407 ns (0 allocations: 0 bytes)
(2.718281828459045, 7.3890560989306495, 20.085536923187664, 54.598150033144236)

Comparing the @code_llvm for these 4-tuples, (i.e., @code_llvm debuginfo=:none custom_exp_impl_fast.(x, Val{:ℯ}())), we see that the Base.Math one uses almost no vectorized operations but this slight adjustment uses many.

Note that I didn’t even need to make this one operate on tuples. I’m on x86. I can’t speak to other chips, but I think most have decent SIMD faculties for basic operations like these.

If someone gets really excited about this, it might be worth thinking about a PR. The only change (as I recall) was removing the short-circuit Inf/0.0 outputs replacing them with post-operation ifelse swaps.

So there’s something to get you started. exp_impl (not-_fast) had a little more going on so would be more effort.

1 Like

Very cool, thanks for the suggestions!

This is another possible optimization for the function as it is used in the OP, i.e. with discrete constant timestepping:

function expseries3!(f, p, times, tswitch)
    t = times[1]
    dt = times[2]-times[1]
    preodd = (p[1],p[3],p[5],p[7])
    preeven = (p[2],p[4],p[6],p[8])
    e1 = @. preodd*exp(preeven*t)
    e2 = @. exp(preeven*dt)
    f[1] = sum(e1 .- preodd) + p[9]*t
    i = 2
    @inbounds while times[i] <= tswitch
        t += dt
        f[i] = f[i-1] + sum(e1 .* e2 .- e1) + p[9]*dt
        e1 = @. e1 * e2
        i += 1
    end
    posteven = (p[10],p[12],p[14],p[16])
    postodd = (p[11],p[13],p[15],p[17])
    switchval = sum(preodd .* (exp.(preeven .* tswitch).-1.0)) +
    p[9] * tswitch
    t = t - tswitch
    e1 = @. posteven*exp(postodd*t)
    e2 = @. exp(postodd*dt)
    f[i] = sum(e1 .- posteven) + p[18]*t + switchval
    @inbounds while i <= length(times)
        t += dt
        f[i] = f[i-1] + sum(e1 .* e2 .- e1) + p[18]*dt
        e1 = @. e1 * e2
        i += 1
    end
    nothing
end

Basically, the idea, is to calculate the exponential factors governing the development of the process for each step, and then just multiplying (faster!) by the factors.

Specifically, with the code above:

julia> expseries!(expdata, p, times, tswitch)

julia> expseries3!(expdata3, p, times, tswitch)

julia> expdata3 ≈ expdata
true

julia> @btime expseries!($expdata3, $p, $times, $tswitch)
  7.054 μs (0 allocations: 0 bytes)

julia> @btime expseries3!($expdata3, $p, $times, $tswitch)
  897.250 ns (0 allocations: 0 bytes)

which shows an ~8X improvement. This method is orthogonal to some other methods in this thread (i.e. using Float32). The downside might be some loss of accuracy (but the OP example didn’t seem to be derailed by this).

3 Likes

Another alternative based on tuples (or StaticArrays) + LoopVectorization, and sticking to the original proposal.

using LoopVectorization

function expseries_alternative!(f, p, times, tswitch)
    i = 1

    @inbounds while times[i] ≤ tswitch
        t     = times[i]
        f[i]  = foo1(p,t)
        i    += 1
    end

    switchval = foo1(p, tswitch)
 
    @inbounds while i ≤ length(times)
        t     = times[i] - tswitch
        f[i]  = foo2(p,t, switchval)            
        i    += 1
    end
    
    nothing
end

@inline function foo1(p, t)
    output = p[9] * t
    @turbo for i in 1:2:8
        output += p[i] * (exp(p[i+1] * t) - 1.0)
    end    

    return output
end

@inline function foo2(p, t, switchval)
    output = p[18] * t + switchval

    @turbo for i in 10:2:17
        output += p[i] * (exp(p[i+1] * t) - 1.0)
    end    

    return output
end

In my machine:
original → 6.525 μs
my alternative → 2.440 μs

1 Like

Look into if the Intel Vector Math Library can help you:

1 Like

Does p vary? Could it be treated as a constant? How do you compute the values of the elements of p?

SIMD implementations of common math functions are available through SIMDMathFunctions

using SIMD, SIMDMathFunctions

function pexpsumv(p, t)
    @inbounds  p1, p2, r = SIMD.Vec(p[1],p[3],p[5],p[7]), SIMD.Vec(p[2],p[4],p[6],p[8]), p[9]
#    @inbounds  p1, p2, r = SIMD.Vec(p[1],p[2],p[3],p[4]), SIMD.Vec(p[5],p[6],p[7],p[8]), p[9]
    pexpsumv(p1, p2, r, t)
end
pexpsumv(p, q, r, t) = muladd(r, t, sum(p * (exp(q * t)-1)))

p=randn(9)
@btime pexpsumv($p,$t)
1 Like

Thanks, while we don’t always have uniformly spaced times it is the most common case we are dealing with, and this is a great suggestion for that case!

Thanks! I had known about LoopVectorization, but was hesitant to use it given the statements in the README about lack of maintenance going forward and type stability issues on 1.10. However, this is a really nice and compact solution using it.

1 Like

This is probably worth looking into for our server workflows, thanks. (Often this code is being run on M-series Macs so I imagine MKL-based solutions won’t be usable there, but our servers are all intel.)

p varies almost every time we call this function from (currently) among something like 1,000,000 values (this may increase in the future). The values of p are pre-calculated outside of this application from fitting the exponential series to time-series data for known physical parameter values. We essentially have a map from 1,000,000 physical parameter sets to the 1,000,000 p vectors representing exponential series that best fit each original timeseries.

1 Like

Thanks for pointing out that library – I hadn’t seen it before and it looks very relevant here too!

In my experience, nothing beats LoopVectorization. It handles SIMD and a lot of optimizations automatically, while requiring minimum change in the code.

Just in case, notice that LoopVectorization will be deprecated, but because it’ll be replaced by an improved version. See here:

The issue is that there’ll be a gap until it’s released.

1 Like