Make EWMA as fast as pandas

AUK1939 · November 18, 2023, 7:06pm

Hi

I don’t think Julia has a EWMA implementation I could easily use (there are some old packages but they’re not maintained), so I tried to build my own based on the pandas implementation. This is what I’ve come up with

function ewma(x::AbstractArray{T}; n::Int64=10, alpha::T=2.0/(n+1)) where {T<:Real}
  N = length(x)
  @assert n<N && n>0 "Argument n out of bounds."
  out = Vector{T}(undef, N) 
  w = Vector{T}(undef, N) 

  w = [(1-alpha)^i for i in N-1:-1:0]
  w_sum = sum(w[end-n+2:end])
  @inbounds for i in Iterators.drop(eachindex(x), n-1)
     w_sum += w[end-i+1]
     out[i] = dot(@view(w[end-i+1:end]), @view(x[1:i])) / w_sum  
  end   
    out[1:(n-1)] .= NaN
    return out
  end

which is equivalent to the pandas call

df[“col”].ewm(span=14, min_periods=14, adjust=True).mean()

My julia code produces the same output but is over 5 times slower for an input vector of size 6000ish (on my machine).

I was wondering can we make my julia code as fast as the pandas ewma implementation? The formula for how pandas implements ewma is given here (see the adjust parameter). Advice appreciated

I did try the following optimization but that turned out to be twice as slow as the original, not sure why. Why does the line w[end - i] = w[end - i + 1] * c cause such a performance hit?

function ewma(x::AbstractArray{T}; n::Int64=10, alpha::T=2.0/(n+1)) where {T<:Real}
  N = length(x)
  @assert n<N && n>0 "Argument n out of bounds."
  out = Vector{T}(undef, N) 
  w = Vector{T}(undef, N) 

  c = 1-alpha
  w_sum = 1.0
  w[end] = 1.0
  @inbounds for i in Iterators.take(eachindex(x), N-1)
      w[end - i] = w[end - i + 1] * c  # c^i - why is multiplication slower than exponentiation?
      out[i] = dot(@view(w[end-i+1:end]), @view(x[1:i])) / w_sum  
      w_sum += w[end-i]
   end
   out[end] = dot(w, x) / w_sum    

   out[1:(n-1)] .= NaN
   return out
end

nsajko · November 19, 2023, 1:26am

This might just be a typo, but note that the first line has no effect, except possibly hurting performance, as the following line overwrites w anyway.

EDIT: deleted lots of irrelevant content.

Dan · November 19, 2023, 2:07am

Not sure I have the right formula, but it checked out in one test-case :

function ewma3(x::Vector{Float64}; n=10, alpha=2.0/(n+1))
    N = length(x)
    c = 1-alpha
    ci = 1.0
    r = 0.0
    res = similar(x)
    @inbounds for i in eachindex(res)
        r = c*r + alpha*x[i]
        ci *= c
        res[i] = i < n ? NaN : r/(1-ci)
    end
    return res
end

It is also a few times faster.

nsajko · November 19, 2023, 2:17am

Presumably this should just be alpha?

Dan · November 19, 2023, 2:18am

1-c looks better (doesn’t change benchmark)

nsajko · November 19, 2023, 2:20am

Not sure, but an extra subtraction seems like it could decrease the accuracy a tiny bit?

Dan · November 19, 2023, 2:21am

Maybe… I can change it. But what would really speed things up is somehow avoiding a division each iteration

nsajko · November 19, 2023, 2:30am

Maybe slightly faster: r += alpha*(x[i] - r)?

Dan · November 19, 2023, 2:36am

Not faster, but deffos educational: it shows r performing a ‘gradient step’ towards x[i] with learning rate alpha.

AUK1939 · November 19, 2023, 3:01am

@Dan This seems to check out for me and it’s way faster than pandas (by at least 20 times on my machine). I’m trying to work out the Maths in the general case. Did you rearrange the formula somehow? If I read the code correctly you’re computing the following

$$\frac{\alpha (1-\alpha )^2 x_1+\alpha (1-\alpha ) x_2+\alpha x_3}{1-(1-\alpha )^3}$$

What’s the logic behind this improvement?

Dan · November 19, 2023, 3:08am

The logic goes back to the idea of Exponentially Weighted Moving Average. The benefit of exponential weights is that the sum of all previous weights in the current step, is the sum of weights in the previous step times a factor plus a correction. So by multiplying by this factor, both numerator and denominator of the moving average can be calculated simply. The denominator is a simple geometric series, for which there is a formula. The numerator correction depends on the new data element x[i].
So no trick beyond the geometric series formula was used.

AUK1939 · November 19, 2023, 3:12am

@Dan ah I see the truncation is equivalent to the formula on the pandas page. Thanks

nsajko · November 19, 2023, 3:42am

Here’s a version that may be slightly faster, and I think more accurate:

function ewma_4(
  x::AbstractArray{T,m},
  n::Int = 10,
  α::T = T(T(UInt8(2))/(n+1)),
  c::T = true - α,
) where {T, m}
  res = similar(x)
  num = zero(T)::T
  den = zero(T)::T
  for i ∈ Base.OneTo(length(x))
    j = i - 1
    num = fma(num, c, x[begin + j])
    den = fma(den, c, true)
    res[begin + j] = num/den
  end
  res::AbstractArray{T,m}
end

nsajko · November 19, 2023, 3:54am

A slight modification allows using Symbolics for verifying that the formula checks out:

function ewma_4_sym(x::AbstractArray{T,m}, c::T) where {T, m}
  res = zeros(T, size(x))
  num = zero(T)::T
  den = zero(T)::T
  for i ∈ Base.OneTo(length(x))
    j = i - 1
    num = muladd(num, c, x[begin + j])
    den = muladd(den, c, true)
    res[begin + j] = num/den
  end
  res::AbstractArray{T,m}
end

julia> include("/home/nsajko/EWMA/EWMA.jl");

julia> using Symbolics

julia> @variables x[1:6]
1-element Vector{Symbolics.Arr{Num, 1}}:
 x[1:6]

julia> @variables c
1-element Vector{Num}:
 c

julia> EWMA.ewma_4_sym(x, c)
6-element Vector{Num}:
                                                               x[1]
                                             (x[2] + c*x[1]) / (true + c)
                                  (x[3] + c*(x[2] + c*x[1])) / (true + (true + c)*c)
                       (x[4] + c*(x[3] + c*(x[2] + c*x[1]))) / (true + c*(true + (true + c)*c))
            (x[5] + c*(x[4] + c*(x[3] + c*(x[2] + c*x[1])))) / (true + c*(true + c*(true + (true + c)*c)))
 (x[6] + c*(x[5] + c*(x[4] + c*(x[3] + c*(x[2] + c*x[1]))))) / (true + c*(true + c*(true + c*(true + (true + c)*c))))

Dan · November 19, 2023, 3:57am

Adding an @inbounds helps the benchmark. Also, limiting parameter to a Vector instead of where {m} seems acceptable in this case (unless a dim parameter is added).

adienes · November 19, 2023, 4:00am

on my machine gets about another 2x as fast if you use fma(num, c, x[i]) instead of num*c + x[i] and same for den

nsajko · November 19, 2023, 4:01am

Already made the edit a few minutes ago

lmiq · November 19, 2023, 11:26am

What’s the purpose of this type conversion? Particularly to an abstract type?

lmiq · November 19, 2023, 11:37am

Unfortunately I can’t test now. But how much faster is the fastest alternative relative to the simpler:

function ewma5(x::AbstractArray{T}, c::T) where {T}
  res = zeros(T, size(x))
  num = zero(T)
  den = zero(T)
  for i ∈ eachindex(x)
    j = i - 1
    num += c * x[begin + j]
    den += c
    res[begin + j] = num / den
  end
  return res
end

And to this one with @turbo (or @inbounds @simd) in the loop?

I’m asking because I feel that these excessive optimizations (although cool) scary new users and cause an impression that proper performance in Julia is only achieved with these tricks.

AUK1939 · November 19, 2023, 2:05pm

amazing! started off at ~5ms with my crappy algo, and this is now runs under 5µs.