Performance input on my code

jarrison · October 14, 2018, 8:25pm

Hey all! I have just picked up Julia as a fun language to learn and have been translating old python code both as a learning experience and to see just how much performance i can squeeze out of it. I think I have gotten all i can from just reading the performance tips, and thought it would be really beneficial to have some veteran eyes show me the way. If it is not too much to ask I am going to post some code snippets and see what can be done. These functions will be run thousands of times each so I’m looking to learn how to squeeze every last bit of speedup out of them. Any array used should be a 1D array with 10^6 → 10^8 elements. The signal varies from white noise to sinusoidal behavior. Thanks in advance for any cool tips or feedback!


# The data i have been using for benchmarking purposes
t = LinRange(0,10^5,10^7)
signal = 3sin.(2π*8*t) .+ sin.(2π*4*t) + .03*randn(length(t))
d = [11, 111, 1111, 11111]

"Pad an array by mirroring the endpoints"
function mirror(A,d::Int=3)
    w = div(d-1,2) #width of window, d is an ODD integer
    output = zeros(length(A)+2w)
    
    output[w+1:end-w] = vec(A) #center signal
    output[1:w] = reverse!(view(A,2:w+1))
    output[end-w:end]= reverse!(view(A,lastindex(A)-w-1:lastindex(A)-1))

    return output
end

"""
Fast Implementation of a minmax order statistics filter. Computes the upper and
lower envelopes of a signal.
"""
function stream_minmax!(env1::Array{Float64},env2::Array{Float64},a,w::Int64)
    upper = Int[] #buffer for indices
    lower = Int[]

    pushfirst!(upper,1)
    pushfirst!(lower,1)

    for i in 2:lastindex(a)
        if i >= w
            @views env1[i] = a[upper[1]]
            @views env2[i] = a[lower[1]]
        end
        
        if a[i] > a[i-1]
            pop!(upper)
            # remove maxima from buffer that are less than our new one
            while isempty(upper) != true && a[i] > a[upper[end]]
                pop!(upper)
            end
        elseif a[i] <= a[i-1]
            pop!(lower)
            # remove minima from buffer that are greater than our new one
            while isempty(lower) != true && a[i] < a[lower[end]]
                pop!(lower)
            end
        end
        
        push!(upper,i)
        push!(lower,i)
        if i == w + upper[1]
            popfirst!(upper)
        elseif i == w + lower[1]
            popfirst!(lower)
        end
    end
        nothing
end

"Moving average with window size d"
function moving_average(A,d::Int=3)
    ret = zeros(length(A))
    cumsum!(ret,A)
    ret = (ret[d+1:end] .- ret[1:end-d]) / d
    return ret
end

foobar_lv2 · October 14, 2018, 8:42pm

Please also include code for generating your data and calling your stuff. Some block like

julia> using BenchmarkTools
julia> v=rand(10^7);
julia> @btime foofun($v);
  5.392 ms (0 allocations: 0 bytes)

Otherwise we don’t know what to call with what data. If the statistics of your data are important for performance, then include code to generate data with appropriate statistics.

jarrison · October 14, 2018, 9:13pm

Included an edit to address this. Thanks!

bennedich · October 14, 2018, 9:36pm

moving_average drops the first average, is that a bug or intentional? You can fix that, and optimize it a bit, by doing this:

function moving_average(A, d::Int=3)
    cs = cumsum(A)
    T = typeof(one(eltype(A))/1)
    ret = Vector{T}(undef, length(A) - d + 1)
    ret[1] = cs[d] / d
    @inbounds for n = 1:length(ret)-1
        ret[n+1] = (cs[n+d] - cs[n]) / d
    end
    return ret
end

Timings:

julia> @btime moving_average_original($signal, 11);
  236.972 ms (10 allocations: 381.47 MiB)

julia> @btime moving_average_optimized($signal, 11)
  49.671 ms (4 allocations: 152.59 MiB)

jarrison · October 14, 2018, 9:54pm

Cool! That worked great for me. I guess i was a little confused about the broadcasting and when those allocations occurred. Awesome solution!

julia> @btime moving_average($s,11111)
  53.817 ms (4 allocations: 152.50 MiB)

jarrison · October 14, 2018, 10:55pm

This thread made me realize my rookie mistake. I have reduced the mirror to be as optimized as i care about!

function mirror!(A::Array,d::Int=3)
    w = div(d-1,2) #width of window, d is an ODD integer
    prepend!(A,zeros(w))
    append!(A,zeros(w))
    @views A[1:w] = reverse(A[w+2:2w+1])
    @views A[end-w+1:end] = reverse(A[end-2w:end-w-1])
    return nothing
end

 @btime mirror!($s, 111)
  499.579 ns (6 allocations: 2.22 KiB)

Elrod · October 14, 2018, 11:04pm

Because the functions are being run thousands of times each, it may be worth preallocating storage, given that the sizes of the signals do not vary too much.
TO see why this is:

julia> @benchmark moving_average_optimized($signal, 11)
BenchmarkTools.Trial: 
  memory estimate:  152.59 MiB
  allocs estimate:  4
  --------------
  minimum time:     51.512 ms (1.03% GC)
  median time:      59.847 ms (11.52% GC)
  mean time:        60.327 ms (12.18% GC)
  maximum time:     102.204 ms (49.57% GC)
  --------------
  samples:          83
  evals/sample:     1

Over 10% of the time this code runs is spent on the garbage collector, on average. The fastest time, which you see via btime, is a run lucky enough not to see much garbage collector action.

If sizes of the arrays vary a lot, you could try views, but then you’d have to be careful to make sure the code still vectorizes correctly.

d, T = 11, Float64
cumulative_sum = Vector{T}(undef, length(signal));
ret = Vector{T}(undef, length(signal) - d + 1);
function moving_average_optimized!(ret, cs, A, d::Int=3)
    cumsum!(cs, A)
    invd = inv(d)
    ret[1] = cs[d] * invd
    @inbounds @simd for n = 1:length(ret)-1
        ret[n+1] = (cs[n+d] - cs[n]) * invd
    end
    return ret
end

This yields:

@benchmark moving_average_optimized!($ret, $cumulative_sum, $signal, 11)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.045 ms (0.00% GC)
  median time:      21.323 ms (0.00% GC)
  mean time:        21.475 ms (0.00% GC)
  maximum time:     22.514 ms (0.00% GC)
  --------------
  samples:          233
  evals/sample:     1

Also, changing the size of vectors is slow. If you know how large it is likely to be, sizehint! can make it faster, but still much slower than simply storing into an array you created of that size in the first place. Therefore, if you want to optimize for performance, it could be worth trying to figure out how to just allocate all the memory you need in one go, and track how much of the vector you’re using at that time, using that as the bound.
But that’s rather inconvenient. If that part of the code isn’t the bottleneck, you may prefer to spend your time elsewhere.

jarrison · October 14, 2018, 11:16pm

In this instance i am guaranteed a constant size for every signal. The window size is adaptive and thus changes, but i think i can take this into account. I hadn’t thought about garbage collection at all! Glad i posted this. I’ll see what kind of magic i can work with pre-allocation.

bennedich · October 15, 2018, 5:00pm

The cumsum can be reused to calculate the moving average for any window size, but if you’ll only use it once, you can optimize the code further:

function moving_average_no_cumsum(A, d::Int=3)
    T = typeof(one(eltype(A))/1)
    ret = Vector{T}(undef, length(A) - d + 1)
    id = 1 / d
    s = sum(view(A, 1:d))
    ret[1] = s * id
    @inbounds for n = 1:length(ret)-1
        s += A[n+d] - A[n]
        ret[n+1] = s * id
    end
    return ret
end

Timings:

julia> @btime moving_average_optimized($signal, 11)
  49.671 ms (4 allocations: 152.59 MiB)

julia> @btime moving_average_no_cumsum($signal, 11)
  28.331 ms (3 allocations: 76.29 MiB)

(Just keep in mind that numerical accuracy might be slightly sacrificed by the repeated addition/subtraction.)

Topic		Replies	Views
Julia Performance - Help Needed Performance question , python	40	2853	September 17, 2021
Small benchmark Performance benchmark	14	2647	November 21, 2018
Julia vs SciPy - Performance comparison and benchmark help Performance benchmark	15	1337	April 22, 2021
Question about performance when iterating New to Julia	3	441	March 18, 2020
Applying performance tips in library Performance	0	312	July 28, 2021

Performance input on my code

Related topics