# Make EWMA as fast as pandas

Hi

I don’t think Julia has a EWMA implementation I could easily use (there are some old packages but they’re not maintained), so I tried to build my own based on the pandas implementation. This is what I’ve come up with

function ewma(x::AbstractArray{T}; n::Int64=10, alpha::T=2.0/(n+1)) where {T<:Real}
N = length(x)
@assert n<N && n>0 "Argument n out of bounds."
out = Vector{T}(undef, N)
w = Vector{T}(undef, N)

w = [(1-alpha)^i for i in N-1:-1:0]
w_sum = sum(w[end-n+2:end])
@inbounds for i in Iterators.drop(eachindex(x), n-1)
w_sum += w[end-i+1]
out[i] = dot(@view(w[end-i+1:end]), @view(x[1:i])) / w_sum
end
out[1:(n-1)] .= NaN
return out
end


which is equivalent to the pandas call

My julia code produces the same output but is over 5 times slower for an input vector of size 6000ish (on my machine).

I was wondering can we make my julia code as fast as the pandas ewma implementation? The formula for how pandas implements ewma is given here (see the adjust parameter). Advice appreciated

I did try the following optimization but that turned out to be twice as slow as the original, not sure why. Why does the line w[end - i] = w[end - i + 1] * c cause such a performance hit?

function ewma(x::AbstractArray{T}; n::Int64=10, alpha::T=2.0/(n+1)) where {T<:Real}
N = length(x)
@assert n<N && n>0 "Argument n out of bounds."
out = Vector{T}(undef, N)
w = Vector{T}(undef, N)

c = 1-alpha
w_sum = 1.0
w[end] = 1.0
@inbounds for i in Iterators.take(eachindex(x), N-1)
w[end - i] = w[end - i + 1] * c  # c^i - why is multiplication slower than exponentiation?
out[i] = dot(@view(w[end-i+1:end]), @view(x[1:i])) / w_sum
w_sum += w[end-i]
end
out[end] = dot(w, x) / w_sum

out[1:(n-1)] .= NaN
return out
end

2 Likes

This might just be a typo, but note that the first line has no effect, except possibly hurting performance, as the following line overwrites w anyway.

EDIT: deleted lots of irrelevant content.

1 Like

Not sure I have the right formula, but it checked out in one test-case :

function ewma3(x::Vector{Float64}; n=10, alpha=2.0/(n+1))
N = length(x)
c = 1-alpha
ci = 1.0
r = 0.0
res = similar(x)
@inbounds for i in eachindex(res)
r = c*r + alpha*x[i]
ci *= c
res[i] = i < n ? NaN : r/(1-ci)
end
return res
end


It is also a few times faster.

3 Likes

Presumably this should just be alpha?

1 Like

1-c looks better (doesn’t change benchmark)

Not sure, but an extra subtraction seems like it could decrease the accuracy a tiny bit?

Maybe… I can change it. But what would really speed things up is somehow avoiding a division each iteration

Maybe slightly faster: r += alpha*(x[i] - r)?

Not faster, but deffos educational: it shows r performing a ‘gradient step’ towards x[i] with learning rate alpha.

1 Like

@Dan This seems to check out for me and it’s way faster than pandas (by at least 20 times on my machine). I’m trying to work out the Maths in the general case. Did you rearrange the formula somehow? If I read the code correctly you’re computing the following

$$\frac{\alpha (1-\alpha )^2 x_1+\alpha (1-\alpha ) x_2+\alpha x_3}{1-(1-\alpha )^3}$$

What’s the logic behind this improvement?

The logic goes back to the idea of Exponentially Weighted Moving Average. The benefit of exponential weights is that the sum of all previous weights in the current step, is the sum of weights in the previous step times a factor plus a correction. So by multiplying by this factor, both numerator and denominator of the moving average can be calculated simply. The denominator is a simple geometric series, for which there is a formula. The numerator correction depends on the new data element x[i].
So no trick beyond the geometric series formula was used.

@Dan ah I see the truncation is equivalent to the formula on the pandas page. Thanks

Here’s a version that may be slightly faster, and I think more accurate:

function ewma_4(
x::AbstractArray{T,m},
n::Int = 10,
α::T = T(T(UInt8(2))/(n+1)),
c::T = true - α,
) where {T, m}
res = similar(x)
num = zero(T)::T
den = zero(T)::T
for i ∈ Base.OneTo(length(x))
j = i - 1
num = fma(num, c, x[begin + j])
den = fma(den, c, true)
res[begin + j] = num/den
end
res::AbstractArray{T,m}
end

1 Like

A slight modification allows using Symbolics for verifying that the formula checks out:

function ewma_4_sym(x::AbstractArray{T,m}, c::T) where {T, m}
res = zeros(T, size(x))
num = zero(T)::T
den = zero(T)::T
for i ∈ Base.OneTo(length(x))
j = i - 1
num = muladd(num, c, x[begin + j])
res[begin + j] = num/den
end
res::AbstractArray{T,m}
end

julia> include("/home/nsajko/EWMA/EWMA.jl");

julia> using Symbolics

julia> @variables x[1:6]
1-element Vector{Symbolics.Arr{Num, 1}}:
x[1:6]

julia> @variables c
1-element Vector{Num}:
c

julia> EWMA.ewma_4_sym(x, c)
6-element Vector{Num}:
x
(x + c*x) / (true + c)
(x + c*(x + c*x)) / (true + (true + c)*c)
(x + c*(x + c*(x + c*x))) / (true + c*(true + (true + c)*c))
(x + c*(x + c*(x + c*(x + c*x)))) / (true + c*(true + c*(true + (true + c)*c)))
(x + c*(x + c*(x + c*(x + c*(x + c*x))))) / (true + c*(true + c*(true + c*(true + (true + c)*c))))

1 Like

Adding an @inbounds helps the benchmark. Also, limiting parameter to a Vector instead of where {m} seems acceptable in this case (unless a dim parameter is added).

1 Like

on my machine gets about another 2x as fast if you use fma(num, c, x[i]) instead of num*c + x[i] and same for den

4 Likes

Already made the edit a few minutes ago 2 Likes

What’s the purpose of this type conversion? Particularly to an abstract type?

Unfortunately I can’t test now. But how much faster is the fastest alternative relative to the simpler:

function ewma5(x::AbstractArray{T}, c::T) where {T}
res = zeros(T, size(x))
num = zero(T)
den = zero(T)
for i ∈ eachindex(x)
j = i - 1
num += c * x[begin + j]
den += c
res[begin + j] = num / den
end
return res
end


And to this one with @turbo (or @inbounds @simd) in the loop?

I’m asking because I feel that these excessive optimizations (although cool) scary new users and cause an impression that proper performance in Julia is only achieved with these tricks.

2 Likes

amazing! started off at ~5ms with my crappy algo, and this is now runs under 5µs.

3 Likes