Make EWMA as fast as pandas

Hi

I don’t think Julia has a EWMA implementation I could easily use (there are some old packages but they’re not maintained), so I tried to build my own based on the pandas implementation. This is what I’ve come up with

function ewma(x::AbstractArray{T}; n::Int64=10, alpha::T=2.0/(n+1)) where {T<:Real}
  N = length(x)
  @assert n<N && n>0 "Argument n out of bounds."
  out = Vector{T}(undef, N) 
  w = Vector{T}(undef, N) 

  w = [(1-alpha)^i for i in N-1:-1:0]
  w_sum = sum(w[end-n+2:end])
  @inbounds for i in Iterators.drop(eachindex(x), n-1)
     w_sum += w[end-i+1]
     out[i] = dot(@view(w[end-i+1:end]), @view(x[1:i])) / w_sum  
  end   
    out[1:(n-1)] .= NaN
    return out
  end

which is equivalent to the pandas call

df[“col”].ewm(span=14, min_periods=14, adjust=True).mean()

My julia code produces the same output but is over 5 times slower for an input vector of size 6000ish (on my machine).

I was wondering can we make my julia code as fast as the pandas ewma implementation? The formula for how pandas implements ewma is given here (see the adjust parameter). Advice appreciated

I did try the following optimization but that turned out to be twice as slow as the original, not sure why. Why does the line w[end - i] = w[end - i + 1] * c cause such a performance hit?

function ewma(x::AbstractArray{T}; n::Int64=10, alpha::T=2.0/(n+1)) where {T<:Real}
  N = length(x)
  @assert n<N && n>0 "Argument n out of bounds."
  out = Vector{T}(undef, N) 
  w = Vector{T}(undef, N) 

  c = 1-alpha
  w_sum = 1.0
  w[end] = 1.0
  @inbounds for i in Iterators.take(eachindex(x), N-1)
      w[end - i] = w[end - i + 1] * c  # c^i - why is multiplication slower than exponentiation?
      out[i] = dot(@view(w[end-i+1:end]), @view(x[1:i])) / w_sum  
      w_sum += w[end-i]
   end
   out[end] = dot(w, x) / w_sum    

   out[1:(n-1)] .= NaN
   return out
end
2 Likes

This might just be a typo, but note that the first line has no effect, except possibly hurting performance, as the following line overwrites w anyway.

EDIT: deleted lots of irrelevant content.

1 Like

Not sure I have the right formula, but it checked out in one test-case :stuck_out_tongue: :

function ewma3(x::Vector{Float64}; n=10, alpha=2.0/(n+1))
    N = length(x)
    c = 1-alpha
    ci = 1.0
    r = 0.0
    res = similar(x)
    @inbounds for i in eachindex(res)
        r = c*r + alpha*x[i]
        ci *= c
        res[i] = i < n ? NaN : r/(1-ci)
    end
    return res
end

It is also a few times faster.

3 Likes

Presumably this should just be alpha?

1 Like

1-c looks better (doesn’t change benchmark)

Not sure, but an extra subtraction seems like it could decrease the accuracy a tiny bit?

Maybe… I can change it. But what would really speed things up is somehow avoiding a division each iteration

Maybe slightly faster: r += alpha*(x[i] - r)?

Not faster, but deffos educational: it shows r performing a ‘gradient step’ towards x[i] with learning rate alpha.

1 Like

@Dan This seems to check out for me and it’s way faster than pandas (by at least 20 times on my machine). I’m trying to work out the Maths in the general case. Did you rearrange the formula somehow? If I read the code correctly you’re computing the following

$$\frac{\alpha (1-\alpha )^2 x_1+\alpha (1-\alpha ) x_2+\alpha x_3}{1-(1-\alpha )^3}$$

What’s the logic behind this improvement?

The logic goes back to the idea of Exponentially Weighted Moving Average. The benefit of exponential weights is that the sum of all previous weights in the current step, is the sum of weights in the previous step times a factor plus a correction. So by multiplying by this factor, both numerator and denominator of the moving average can be calculated simply. The denominator is a simple geometric series, for which there is a formula. The numerator correction depends on the new data element x[i].
So no trick beyond the geometric series formula was used.

@Dan ah I see the truncation is equivalent to the formula on the pandas page. Thanks

Here’s a version that may be slightly faster, and I think more accurate:

function ewma_4(
  x::AbstractArray{T,m},
  n::Int = 10,
  α::T = T(T(UInt8(2))/(n+1)),
  c::T = true - α,
) where {T, m}
  res = similar(x)
  num = zero(T)::T
  den = zero(T)::T
  for i ∈ Base.OneTo(length(x))
    j = i - 1
    num = fma(num, c, x[begin + j])
    den = fma(den, c, true)
    res[begin + j] = num/den
  end
  res::AbstractArray{T,m}
end
1 Like

A slight modification allows using Symbolics for verifying that the formula checks out:

function ewma_4_sym(x::AbstractArray{T,m}, c::T) where {T, m}
  res = zeros(T, size(x))
  num = zero(T)::T
  den = zero(T)::T
  for i ∈ Base.OneTo(length(x))
    j = i - 1
    num = muladd(num, c, x[begin + j])
    den = muladd(den, c, true)
    res[begin + j] = num/den
  end
  res::AbstractArray{T,m}
end
julia> include("/home/nsajko/EWMA/EWMA.jl");

julia> using Symbolics

julia> @variables x[1:6]
1-element Vector{Symbolics.Arr{Num, 1}}:
 x[1:6]

julia> @variables c
1-element Vector{Num}:
 c

julia> EWMA.ewma_4_sym(x, c)
6-element Vector{Num}:
                                                               x[1]
                                             (x[2] + c*x[1]) / (true + c)
                                  (x[3] + c*(x[2] + c*x[1])) / (true + (true + c)*c)
                       (x[4] + c*(x[3] + c*(x[2] + c*x[1]))) / (true + c*(true + (true + c)*c))
            (x[5] + c*(x[4] + c*(x[3] + c*(x[2] + c*x[1])))) / (true + c*(true + c*(true + (true + c)*c)))
 (x[6] + c*(x[5] + c*(x[4] + c*(x[3] + c*(x[2] + c*x[1]))))) / (true + c*(true + c*(true + c*(true + (true + c)*c))))
1 Like

Adding an @inbounds helps the benchmark. Also, limiting parameter to a Vector instead of where {m} seems acceptable in this case (unless a dim parameter is added).

1 Like

on my machine gets about another 2x as fast if you use fma(num, c, x[i]) instead of num*c + x[i] and same for den

4 Likes

Already made the edit a few minutes ago :grin:

2 Likes

What’s the purpose of this type conversion? Particularly to an abstract type?

Unfortunately I can’t test now. But how much faster is the fastest alternative relative to the simpler:

function ewma5(x::AbstractArray{T}, c::T) where {T}
  res = zeros(T, size(x))
  num = zero(T)
  den = zero(T)
  for i ∈ eachindex(x)
    j = i - 1
    num += c * x[begin + j]
    den += c
    res[begin + j] = num / den
  end
  return res
end

And to this one with @turbo (or @inbounds @simd) in the loop?

I’m asking because I feel that these excessive optimizations (although cool) scary new users and cause an impression that proper performance in Julia is only achieved with these tricks.

2 Likes

amazing! started off at ~5ms with my crappy algo, and this is now runs under 5µs.

3 Likes