Performance input on my code

Hey all! I have just picked up Julia as a fun language to learn and have been translating old python code both as a learning experience and to see just how much performance i can squeeze out of it. I think I have gotten all i can from just reading the performance tips, and thought it would be really beneficial to have some veteran eyes show me the way. If it is not too much to ask I am going to post some code snippets and see what can be done. These functions will be run thousands of times each so I’m looking to learn how to squeeze every last bit of speedup out of them. Any array used should be a 1D array with 10^6 → 10^8 elements. The signal varies from white noise to sinusoidal behavior. Thanks in advance for any cool tips or feedback!


# The data i have been using for benchmarking purposes
t = LinRange(0,10^5,10^7)
signal = 3sin.(2π*8*t) .+ sin.(2π*4*t) + .03*randn(length(t))
d = [11, 111, 1111, 11111]

"Pad an array by mirroring the endpoints"
function mirror(A,d::Int=3)
    w = div(d-1,2) #width of window, d is an ODD integer
    output = zeros(length(A)+2w)
    
    output[w+1:end-w] = vec(A) #center signal
    output[1:w] = reverse!(view(A,2:w+1))
    output[end-w:end]= reverse!(view(A,lastindex(A)-w-1:lastindex(A)-1))

    return output
end

"""
Fast Implementation of a minmax order statistics filter. Computes the upper and
lower envelopes of a signal.
"""
function stream_minmax!(env1::Array{Float64},env2::Array{Float64},a,w::Int64)
    upper = Int[] #buffer for indices
    lower = Int[]

    pushfirst!(upper,1)
    pushfirst!(lower,1)

    for i in 2:lastindex(a)
        if i >= w
            @views env1[i] = a[upper[1]]
            @views env2[i] = a[lower[1]]
        end
        
        if a[i] > a[i-1]
            pop!(upper)
            # remove maxima from buffer that are less than our new one
            while isempty(upper) != true && a[i] > a[upper[end]]
                pop!(upper)
            end
        elseif a[i] <= a[i-1]
            pop!(lower)
            # remove minima from buffer that are greater than our new one
            while isempty(lower) != true && a[i] < a[lower[end]]
                pop!(lower)
            end
        end
        
        push!(upper,i)
        push!(lower,i)
        if i == w + upper[1]
            popfirst!(upper)
        elseif i == w + lower[1]
            popfirst!(lower)
        end
    end
        nothing
end

"Moving average with window size d"
function moving_average(A,d::Int=3)
    ret = zeros(length(A))
    cumsum!(ret,A)
    ret = (ret[d+1:end] .- ret[1:end-d]) / d
    return ret
end

Please also include code for generating your data and calling your stuff. Some block like

julia> using BenchmarkTools
julia> v=rand(10^7);
julia> @btime foofun($v);
  5.392 ms (0 allocations: 0 bytes)

Otherwise we don’t know what to call with what data. If the statistics of your data are important for performance, then include code to generate data with appropriate statistics.

2 Likes

Included an edit to address this. Thanks!

moving_average drops the first average, is that a bug or intentional? You can fix that, and optimize it a bit, by doing this:

function moving_average(A, d::Int=3)
    cs = cumsum(A)
    T = typeof(one(eltype(A))/1)
    ret = Vector{T}(undef, length(A) - d + 1)
    ret[1] = cs[d] / d
    @inbounds for n = 1:length(ret)-1
        ret[n+1] = (cs[n+d] - cs[n]) / d
    end
    return ret
end

Timings:

julia> @btime moving_average_original($signal, 11);
  236.972 ms (10 allocations: 381.47 MiB)

julia> @btime moving_average_optimized($signal, 11)
  49.671 ms (4 allocations: 152.59 MiB)
1 Like

Cool! That worked great for me. I guess i was a little confused about the broadcasting and when those allocations occurred. Awesome solution!

julia> @btime moving_average($s,11111)
  53.817 ms (4 allocations: 152.50 MiB)

This thread made me realize my rookie mistake. I have reduced the mirror to be as optimized as i care about!

function mirror!(A::Array,d::Int=3)
    w = div(d-1,2) #width of window, d is an ODD integer
    prepend!(A,zeros(w))
    append!(A,zeros(w))
    @views A[1:w] = reverse(A[w+2:2w+1])
    @views A[end-w+1:end] = reverse(A[end-2w:end-w-1])
    return nothing
end

 @btime mirror!($s, 111)
  499.579 ns (6 allocations: 2.22 KiB)

Because the functions are being run thousands of times each, it may be worth preallocating storage, given that the sizes of the signals do not vary too much.
TO see why this is:

julia> @benchmark moving_average_optimized($signal, 11)
BenchmarkTools.Trial: 
  memory estimate:  152.59 MiB
  allocs estimate:  4
  --------------
  minimum time:     51.512 ms (1.03% GC)
  median time:      59.847 ms (11.52% GC)
  mean time:        60.327 ms (12.18% GC)
  maximum time:     102.204 ms (49.57% GC)
  --------------
  samples:          83
  evals/sample:     1

Over 10% of the time this code runs is spent on the garbage collector, on average. The fastest time, which you see via btime, is a run lucky enough not to see much garbage collector action.

If sizes of the arrays vary a lot, you could try views, but then you’d have to be careful to make sure the code still vectorizes correctly.

d, T = 11, Float64
cumulative_sum = Vector{T}(undef, length(signal));
ret = Vector{T}(undef, length(signal) - d + 1);
function moving_average_optimized!(ret, cs, A, d::Int=3)
    cumsum!(cs, A)
    invd = inv(d)
    ret[1] = cs[d] * invd
    @inbounds @simd for n = 1:length(ret)-1
        ret[n+1] = (cs[n+d] - cs[n]) * invd
    end
    return ret
end

This yields:

@benchmark moving_average_optimized!($ret, $cumulative_sum, $signal, 11)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.045 ms (0.00% GC)
  median time:      21.323 ms (0.00% GC)
  mean time:        21.475 ms (0.00% GC)
  maximum time:     22.514 ms (0.00% GC)
  --------------
  samples:          233
  evals/sample:     1

Also, changing the size of vectors is slow. If you know how large it is likely to be, sizehint! can make it faster, but still much slower than simply storing into an array you created of that size in the first place. Therefore, if you want to optimize for performance, it could be worth trying to figure out how to just allocate all the memory you need in one go, and track how much of the vector you’re using at that time, using that as the bound.
But that’s rather inconvenient. If that part of the code isn’t the bottleneck, you may prefer to spend your time elsewhere.

2 Likes

In this instance i am guaranteed a constant size for every signal. The window size is adaptive and thus changes, but i think i can take this into account. I hadn’t thought about garbage collection at all! Glad i posted this. I’ll see what kind of magic i can work with pre-allocation.

The cumsum can be reused to calculate the moving average for any window size, but if you’ll only use it once, you can optimize the code further:

function moving_average_no_cumsum(A, d::Int=3)
    T = typeof(one(eltype(A))/1)
    ret = Vector{T}(undef, length(A) - d + 1)
    id = 1 / d
    s = sum(view(A, 1:d))
    ret[1] = s * id
    @inbounds for n = 1:length(ret)-1
        s += A[n+d] - A[n]
        ret[n+1] = s * id
    end
    return ret
end

Timings:

julia> @btime moving_average_optimized($signal, 11)
  49.671 ms (4 allocations: 152.59 MiB)

julia> @btime moving_average_no_cumsum($signal, 11)
  28.331 ms (3 allocations: 76.29 MiB)

(Just keep in mind that numerical accuracy might be slightly sacrificed by the repeated addition/subtraction.)