How to improve performance of simple averaging function

flibe · April 19, 2019, 11:44am

Dear All

I have a program that is heavely based on functions like the my_function! below.

Is there a way to further improve speed and reduce memory allocation so as to make my_function! as efficient as possible?

using Statistics

input = rand(1000)
output = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
		@views @fastmath @inbounds for i=n:length(input)
			output[i] = Statistics.mean(input[i-n+1:i])
		end
	return output
end

function test_speed(input::Array{Float64,1},output::Array{Float64,1},n::Int64)
	@time output = my_function!( output , input , n )
end

When I run

test_speed(input,output,3)

I get

0.000018 seconds (998 allocations: 46.781 KiB)

Thank you very much
Francesco

rdeits · April 19, 2019, 12:14pm

Your code looks pretty good, although you should check out BenchmarkTools.jl for more accurate timing than just @time. I also think that @fastmath is probably not doing anything in your case, since the only calculations it’s affecting are the integer indices i-n+1.

However, there is an important algorithmic trick that you are currently missing. To compute a rolling average, it’s not necessary to separately compute the mean of each range. Instead, you can maintain a running total and just add and subtract one element at each iteration. In other words, you have a value x, starting at 0. At each iteration, you add input[i] to x and subtract input[i - n], then you have output[i] = x / n.

That should save you about a factor of n in your computation time.

flibe · April 19, 2019, 2:09pm

Thank you! In the code below I have modified my_function into my_function1 as you suggested. It is much faster and it does not allocate memory.

using Statistics, BenchmarkTools

input = rand(1000)
output = zeros(1000)
output1 = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
		@views @fastmath @inbounds for i=n:length(input)
			output[i] = Statistics.mean(input[i-n+1:i])
		end
	return output
end

function my_function1!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )

		# this puts in output[n] the value mean(input[1:n])
		output[n] = 0.0		# reset/initialize 
		@views @fastmath @inbounds for i=1:n
			output[n] += input[i]
		end
		output[n] = output[n]/n
		
		# this puts in output[i] the value mean(input[i-n+1:i])
		@views @fastmath @inbounds for i=n:length(input)-1
			output[i+1] = output[i] - (input[i-n+1]-input[i+1])/n
		end
		
	return output
end


@btime output = my_function!( $output , $input , 300 )
@btime output1 = my_function1!( $output1 , $input , 300 )

Running the above code I get:

35.495 μs (701 allocations: 32.86 KiB)    # my_function
1.497 μs (0 allocations: 0 bytes)         # my_function1

I imagine this is close to the optimum in terms of execution time.

Thank you
Best
Francesco

stevengj · April 19, 2019, 3:01pm

@views has no effect on code that uses simple scalar indices (no slicing). I don’t think @fastmath helps here, either, unless it allows the 1/n factor to be hoisted from your second loop? @simd might help for this kind of loop, as might storing output[n] in a temporary variable like s = zero(eltype(input)) since I don’t think the compiler can put output[n] in a register (especially with @simd) even though you are using it over and over.

The type declarations of your arguments don’t help performance, and are overly stringent for correctness. I would do something like output::AbstractVector, input::AbstractVector, n::Integer so that it supports any type with the requisite operations (that’s also why I suggested zero(eltype(input)) rather than 0.0 above). Function argument types are a filter saying for what types the method works, not a performance hint — when you call the function, the compiler specializes the compiled code for whatever argument types you actually pass.

If you are using @inbounds, then for safety you should do a bounds check at the beginning of the function, before the loops, e.g.:

@boundscheck checkbounds(input, 1:n)
@boundscheck checkbounds(output, n:length(input))

You should beware that the floating-point roundoff errors for this algorithm will accumulate as the length of your input grows, however. In particular, if you look closely it turns out that what you are doing is exactly equivalent to a special case of the sliding DFT for the k=0 “DC” Fourier component (which is your windowed sum of the inputs, not including your 1/n scale factor). As your window slides along your data, the roundoff errors grow, and it was analyzed rigorously in

M. Tasche and H. Zeuner, “Roundoff Error Analysis of the Recursive Moving Window Discrete Fourier Transform”, Advances in Computational Mathematics vol 18, pp. 65–78 (2003).

In particular, if I’m reading this paper correctly, if L = length(input), then your root-mean-square relative error is expected to grow as O(√L) (theorem 4.2 in the paper), which could be problematic if L is large. Caveat emptor.

flibe · April 19, 2019, 9:12pm

Thank you very much for all the indications, there is a significant improvement (mostly thanks to the @simd):

using Statistics, BenchmarkTools

input = 100*rand(1000)
output = zeros(1000)
output1 = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
		@views @fastmath @inbounds for i=n:length(input)
			output[i] = Statistics.mean(input[i-n+1:i])
		end
	return output
end

function my_function1!( output::AbstractVector, input::AbstractVector , n::Integer )

		@boundscheck checkbounds(input, 1:n)
		@boundscheck checkbounds(output, n:length(input))

		# this puts in output[n] the value mean(input[1:n])
		s = zero(eltype(input))
		@inbounds @simd for i=1:n
			s += input[i]
		end
		output[n] = s/n
		
		# this puts in output[i] the value mean(input[i-n+1:i])
		@fastmath @inbounds for i=n:length(input)-1
			output[i+1] = output[i] - (input[i-n+1]-input[i+1])/n
		end
		
	return output
end


@btime output = my_function!( $output , $input , 300 )
@btime output1 = my_function1!( $output1 , $input , 300 )

maximum(abs.(output-output1))

I get:

  35.067 μs (701 allocations: 32.86 KiB)  # my_function
  1.240 μs (0 allocations: 0 bytes)       # my_function1
7.815970093361102e-14

I will check the paper: there is indeed a roundoff error. At the moment its impact seems tolerable in my application.

Thank you

Topic		Replies	Views
Rolling Sum Statistics question	28	6586	November 7, 2019
Comparing performance of 2 simple averaging functions - why is one faster? Performance	5	502	August 31, 2020
Performance input on my code Performance	8	845	October 15, 2018
How to avoid unneccessary memory allocation for inplace estimation of `mean`? Performance question , memory-allocation	12	1097	February 27, 2020
Why `mean` with `dims` argument is so slow? Performance question	4	574	June 10, 2020

How to improve performance of simple averaging function

Related topics