How to improve performance of simple averaging function

Dear All

I have a program that is heavely based on functions like the my_function! below.

Is there a way to further improve speed and reduce memory allocation so as to make my_function! as efficient as possible?

using Statistics

input = rand(1000)
output = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
		@views @fastmath @inbounds for i=n:length(input)
			output[i] = Statistics.mean(input[i-n+1:i])
		end
	return output
end

function test_speed(input::Array{Float64,1},output::Array{Float64,1},n::Int64)
	@time output = my_function!( output , input , n )
end

When I run

test_speed(input,output,3)

I get

0.000018 seconds (998 allocations: 46.781 KiB)

Thank you very much
Francesco

Your code looks pretty good, although you should check out BenchmarkTools.jl for more accurate timing than just @time. I also think that @fastmath is probably not doing anything in your case, since the only calculations itā€™s affecting are the integer indices i-n+1.

However, there is an important algorithmic trick that you are currently missing. To compute a rolling average, itā€™s not necessary to separately compute the mean of each range. Instead, you can maintain a running total and just add and subtract one element at each iteration. In other words, you have a value x, starting at 0. At each iteration, you add input[i] to x and subtract input[i - n], then you have output[i] = x / n.

That should save you about a factor of n in your computation time.

7 Likes

Thank you! In the code below I have modified my_function into my_function1 as you suggested. It is much faster and it does not allocate memory.

using Statistics, BenchmarkTools

input = rand(1000)
output = zeros(1000)
output1 = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
		@views @fastmath @inbounds for i=n:length(input)
			output[i] = Statistics.mean(input[i-n+1:i])
		end
	return output
end

function my_function1!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )

		# this puts in output[n] the value mean(input[1:n])
		output[n] = 0.0		# reset/initialize 
		@views @fastmath @inbounds for i=1:n
			output[n] += input[i]
		end
		output[n] = output[n]/n
		
		# this puts in output[i] the value mean(input[i-n+1:i])
		@views @fastmath @inbounds for i=n:length(input)-1
			output[i+1] = output[i] - (input[i-n+1]-input[i+1])/n
		end
		
	return output
end


@btime output = my_function!( $output , $input , 300 )
@btime output1 = my_function1!( $output1 , $input , 300 )

Running the above code I get:

35.495 Ī¼s (701 allocations: 32.86 KiB)    # my_function
1.497 Ī¼s (0 allocations: 0 bytes)         # my_function1

I imagine this is close to the optimum in terms of execution time.

Thank you
Best
Francesco

1 Like

@views has no effect on code that uses simple scalar indices (no slicing). I donā€™t think @fastmath helps here, either, unless it allows the 1/n factor to be hoisted from your second loop? @simd might help for this kind of loop, as might storing output[n] in a temporary variable like s = zero(eltype(input)) since I donā€™t think the compiler can put output[n] in a register (especially with @simd) even though you are using it over and over.

The type declarations of your arguments donā€™t help performance, and are overly stringent for correctness. I would do something like output::AbstractVector, input::AbstractVector, n::Integer so that it supports any type with the requisite operations (thatā€™s also why I suggested zero(eltype(input)) rather than 0.0 above). Function argument types are a filter saying for what types the method works, not a performance hint ā€” when you call the function, the compiler specializes the compiled code for whatever argument types you actually pass.

If you are using @inbounds, then for safety you should do a bounds check at the beginning of the function, before the loops, e.g.:

@boundscheck checkbounds(input, 1:n)
@boundscheck checkbounds(output, n:length(input))

You should beware that the floating-point roundoff errors for this algorithm will accumulate as the length of your input grows, however. In particular, if you look closely it turns out that what you are doing is exactly equivalent to a special case of the sliding DFT for the k=0 ā€œDCā€ Fourier component (which is your windowed sum of the inputs, not including your 1/n scale factor). As your window slides along your data, the roundoff errors grow, and it was analyzed rigorously in

In particular, if Iā€™m reading this paper correctly, if L = length(input), then your root-mean-square relative error is expected to grow as O(āˆšL) (theorem 4.2 in the paper), which could be problematic if L is large. Caveat emptor.

7 Likes

Thank you very much for all the indications, there is a significant improvement (mostly thanks to the @simd):

using Statistics, BenchmarkTools

input = 100*rand(1000)
output = zeros(1000)
output1 = zeros(1000)

function my_function!( output::Array{Float64,1} , input::Array{Float64,1} , n::Int64 )
		@views @fastmath @inbounds for i=n:length(input)
			output[i] = Statistics.mean(input[i-n+1:i])
		end
	return output
end

function my_function1!( output::AbstractVector, input::AbstractVector , n::Integer )

		@boundscheck checkbounds(input, 1:n)
		@boundscheck checkbounds(output, n:length(input))

		# this puts in output[n] the value mean(input[1:n])
		s = zero(eltype(input))
		@inbounds @simd for i=1:n
			s += input[i]
		end
		output[n] = s/n
		
		# this puts in output[i] the value mean(input[i-n+1:i])
		@fastmath @inbounds for i=n:length(input)-1
			output[i+1] = output[i] - (input[i-n+1]-input[i+1])/n
		end
		
	return output
end


@btime output = my_function!( $output , $input , 300 )
@btime output1 = my_function1!( $output1 , $input , 300 )

maximum(abs.(output-output1))

I get:

  35.067 Ī¼s (701 allocations: 32.86 KiB)  # my_function
  1.240 Ī¼s (0 allocations: 0 bytes)       # my_function1
7.815970093361102e-14

I will check the paper: there is indeed a roundoff error. At the moment its impact seems tolerable in my application.

Thank you