I don’t think Julia has a EWMA implementation I could easily use (there are some old packages but they’re not maintained), so I tried to build my own based on the pandas implementation. This is what I’ve come up with

function ewma(x::AbstractArray{T}; n::Int64=10, alpha::T=2.0/(n+1)) where {T<:Real}
N = length(x)
@assert n<N && n>0 "Argument n out of bounds."
out = Vector{T}(undef, N)
w = Vector{T}(undef, N)
w = [(1-alpha)^i for i in N-1:-1:0]
w_sum = sum(w[end-n+2:end])
@inbounds for i in Iterators.drop(eachindex(x), n-1)
w_sum += w[end-i+1]
out[i] = dot(@view(w[end-i+1:end]), @view(x[1:i])) / w_sum
end
out[1:(n-1)] .= NaN
return out
end

My julia code produces the same output but is over 5 times slower for an input vector of size 6000ish (on my machine).

I was wondering can we make my julia code as fast as the pandas ewma implementation? The formula for how pandas implements ewma is given here (see the adjust parameter). Advice appreciated

I did try the following optimization but that turned out to be twice as slow as the original, not sure why. Why does the line w[end - i] = w[end - i + 1] * c cause such a performance hit?

function ewma(x::AbstractArray{T}; n::Int64=10, alpha::T=2.0/(n+1)) where {T<:Real}
N = length(x)
@assert n<N && n>0 "Argument n out of bounds."
out = Vector{T}(undef, N)
w = Vector{T}(undef, N)
c = 1-alpha
w_sum = 1.0
w[end] = 1.0
@inbounds for i in Iterators.take(eachindex(x), N-1)
w[end - i] = w[end - i + 1] * c # c^i - why is multiplication slower than exponentiation?
out[i] = dot(@view(w[end-i+1:end]), @view(x[1:i])) / w_sum
w_sum += w[end-i]
end
out[end] = dot(w, x) / w_sum
out[1:(n-1)] .= NaN
return out
end

Not sure I have the right formula, but it checked out in one test-case :

function ewma3(x::Vector{Float64}; n=10, alpha=2.0/(n+1))
N = length(x)
c = 1-alpha
ci = 1.0
r = 0.0
res = similar(x)
@inbounds for i in eachindex(res)
r = c*r + alpha*x[i]
ci *= c
res[i] = i < n ? NaN : r/(1-ci)
end
return res
end

@Dan This seems to check out for me and it’s way faster than pandas (by at least 20 times on my machine). I’m trying to work out the Maths in the general case. Did you rearrange the formula somehow? If I read the code correctly you’re computing the following

The logic goes back to the idea of Exponentially Weighted Moving Average. The benefit of exponential weights is that the sum of all previous weights in the current step, is the sum of weights in the previous step times a factor plus a correction. So by multiplying by this factor, both numerator and denominator of the moving average can be calculated simply. The denominator is a simple geometric series, for which there is a formula. The numerator correction depends on the new data element x[i].
So no trick beyond the geometric series formula was used.

Here’s a version that may be slightly faster, and I think more accurate:

function ewma_4(
x::AbstractArray{T,m},
n::Int = 10,
α::T = T(T(UInt8(2))/(n+1)),
c::T = true - α,
) where {T, m}
res = similar(x)
num = zero(T)::T
den = zero(T)::T
for i ∈ Base.OneTo(length(x))
j = i - 1
num = fma(num, c, x[begin + j])
den = fma(den, c, true)
res[begin + j] = num/den
end
res::AbstractArray{T,m}
end

A slight modification allows using Symbolics for verifying that the formula checks out:

function ewma_4_sym(x::AbstractArray{T,m}, c::T) where {T, m}
res = zeros(T, size(x))
num = zero(T)::T
den = zero(T)::T
for i ∈ Base.OneTo(length(x))
j = i - 1
num = muladd(num, c, x[begin + j])
den = muladd(den, c, true)
res[begin + j] = num/den
end
res::AbstractArray{T,m}
end

Adding an @inbounds helps the benchmark. Also, limiting parameter to a Vector instead of where {m} seems acceptable in this case (unless a dim parameter is added).

Unfortunately I can’t test now. But how much faster is the fastest alternative relative to the simpler:

function ewma5(x::AbstractArray{T}, c::T) where {T}
res = zeros(T, size(x))
num = zero(T)
den = zero(T)
for i ∈ eachindex(x)
j = i - 1
num += c * x[begin + j]
den += c
res[begin + j] = num / den
end
return res
end

And to this one with @turbo (or @inbounds @simd) in the loop?

I’m asking because I feel that these excessive optimizations (although cool) scary new users and cause an impression that proper performance in Julia is only achieved with these tricks.