 How to improve performance of simple averaging function

Dear All

I have a program that is heavely based on functions like the `my_function!` below.

Is there a way to further improve speed and reduce memory allocation so as to make `my_function!` as efficient as possible?

``````using Statistics

input = rand(1000)
output = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
@views @fastmath @inbounds for i=n:length(input)
output[i] = Statistics.mean(input[i-n+1:i])
end
return output
end

function test_speed(input::Array{Float64,1},output::Array{Float64,1},n::Int64)
@time output = my_function!( output , input , n )
end
``````

When I run

`test_speed(input,output,3)`

I get

`0.000018 seconds (998 allocations: 46.781 KiB)`

Thank you very much
Francesco

Your code looks pretty good, although you should check out BenchmarkTools.jl for more accurate timing than just `@time`. I also think that `@fastmath` is probably not doing anything in your case, since the only calculations it’s affecting are the integer indices `i-n+1`.

However, there is an important algorithmic trick that you are currently missing. To compute a rolling average, it’s not necessary to separately compute the mean of each range. Instead, you can maintain a running total and just add and subtract one element at each iteration. In other words, you have a value `x`, starting at 0. At each iteration, you add `input[i]` to `x` and subtract `input[i - n]`, then you have `output[i] = x / n`.

That should save you about a factor of `n` in your computation time.

7 Likes

Thank you! In the code below I have modified `my_function` into `my_function1` as you suggested. It is much faster and it does not allocate memory.

``````using Statistics, BenchmarkTools

input = rand(1000)
output = zeros(1000)
output1 = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
@views @fastmath @inbounds for i=n:length(input)
output[i] = Statistics.mean(input[i-n+1:i])
end
return output
end

function my_function1!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )

# this puts in output[n] the value mean(input[1:n])
output[n] = 0.0		# reset/initialize
@views @fastmath @inbounds for i=1:n
output[n] += input[i]
end
output[n] = output[n]/n

# this puts in output[i] the value mean(input[i-n+1:i])
@views @fastmath @inbounds for i=n:length(input)-1
output[i+1] = output[i] - (input[i-n+1]-input[i+1])/n
end

return output
end

@btime output = my_function!( \$output , \$input , 300 )
@btime output1 = my_function1!( \$output1 , \$input , 300 )
``````

Running the above code I get:

``````35.495 μs (701 allocations: 32.86 KiB)    # my_function
1.497 μs (0 allocations: 0 bytes)         # my_function1
``````

I imagine this is close to the optimum in terms of execution time.

Thank you
Best
Francesco

1 Like

`@views` has no effect on code that uses simple scalar indices (no slicing). I don’t think `@fastmath` helps here, either, unless it allows the `1/n` factor to be hoisted from your second loop? `@simd` might help for this kind of loop, as might storing `output[n]` in a temporary variable like `s = zero(eltype(input))` since I don’t think the compiler can put `output[n]` in a register (especially with `@simd`) even though you are using it over and over.

The type declarations of your arguments don’t help performance, and are overly stringent for correctness. I would do something like `output::AbstractVector, input::AbstractVector, n::Integer` so that it supports any type with the requisite operations (that’s also why I suggested `zero(eltype(input))` rather than `0.0` above). Function argument types are a filter saying for what types the method works, not a performance hint — when you call the function, the compiler specializes the compiled code for whatever argument types you actually pass.

If you are using `@inbounds`, then for safety you should do a bounds check at the beginning of the function, before the loops, e.g.:

``````@boundscheck checkbounds(input, 1:n)
@boundscheck checkbounds(output, n:length(input))
``````

You should beware that the floating-point roundoff errors for this algorithm will accumulate as the length of your input grows, however. In particular, if you look closely it turns out that what you are doing is exactly equivalent to a special case of the sliding DFT for the k=0 “DC” Fourier component (which is your windowed sum of the inputs, not including your `1/n` scale factor). As your window slides along your data, the roundoff errors grow, and it was analyzed rigorously in

In particular, if I’m reading this paper correctly, if `L = length(input)`, then your root-mean-square relative error is expected to grow as O(√`L`) (theorem 4.2 in the paper), which could be problematic if `L` is large. Caveat emptor.

7 Likes

Thank you very much for all the indications, there is a significant improvement (mostly thanks to the `@simd`):

``````using Statistics, BenchmarkTools

input = 100*rand(1000)
output = zeros(1000)
output1 = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
@views @fastmath @inbounds for i=n:length(input)
output[i] = Statistics.mean(input[i-n+1:i])
end
return output
end

function my_function1!( output::AbstractVector, input::AbstractVector , n::Integer )

@boundscheck checkbounds(input, 1:n)
@boundscheck checkbounds(output, n:length(input))

# this puts in output[n] the value mean(input[1:n])
s = zero(eltype(input))
@inbounds @simd for i=1:n
s += input[i]
end
output[n] = s/n

# this puts in output[i] the value mean(input[i-n+1:i])
@fastmath @inbounds for i=n:length(input)-1
output[i+1] = output[i] - (input[i-n+1]-input[i+1])/n
end

return output
end

@btime output = my_function!( \$output , \$input , 300 )
@btime output1 = my_function1!( \$output1 , \$input , 300 )

maximum(abs.(output-output1))
``````

I get:

``````  35.067 μs (701 allocations: 32.86 KiB)  # my_function
1.240 μs (0 allocations: 0 bytes)       # my_function1
7.815970093361102e-14
``````

I will check the paper: there is indeed a roundoff error. At the moment its impact seems tolerable in my application.

Thank you