The Case of the Phantom Allocations πŸ‘»

Hey Julia experts,

I have a performance puzzle. This function is designed to be completely allocation-free, but @benchmark tells a different story. I’m hoping a fresh pair of eyes can spot the problem.

This is the function:


using  StatsBase

_default_reducer(::Val{:median}) = median!
_default_reducer(::Val{:mean}) = mean
_default_reducer(::Val{:std}) = std

function sigma_clip!(x::AbstractVector{T},
    mask::BitVector,
    buffer::AbstractVector{T};
    upper::Real=3,
    lower::Real=3,
    cent_func::Symbol=:median,
    std_func::Symbol=:std,
    maxiter::Int=5) where {T<:Real}

    maxiter > 0 || throw(ArgumentError("maxiter must be positive"))
    upper > 0 && lower > 0 || throw(ArgumentError("upper and lower must be positive"))

    cent_reducer = _default_reducer(Val(cent_func))
    std_reducer = _default_reducer(Val(std_func))

    # initialize buffer following mask (isfinite) to avoid cal reducer over NaN or Inf
    N = 0
    @inbounds for i in eachindex(x)
        xi = x[i]
        if isfinite(xi)
            N += 1
            buffer[N] = xi
            mask[i] = true
        else
            mask[i] = false
        end
    end

    N == 0 && return mask

    iter = 0
    @inbounds while iter < maxiter
        # calculate statistic
        w = view(buf, 1:N)

        c = cent_reducer(w)
        s = std_reducer(w)
        iszero(s) && return mask

        upper_bound = c + s * upper
        lower_bound = c - s * lower

        N = 0
        n_clipped = 0

        @inbounds for i in eachindex(x)
            if mask[i]
                xi = x[i]
                if xi > upper_bound || xi < lower_bound
                    mask[i] = false
                    n_clipped += 1
                else
                    N += 1
                    buffer[N] = xi
                end
            end
        end

        n_clipped == 0 && return mask
        iter += 1
    end

    return mask
end

The Evidence (The Benchmark):

@benchmark sigma_clip!(x, y, buf;upper=1.5, lower=1.5, cent_func=:mean) setup=begin
           x = rand(100)
           y = trues(100) 
           buf = rand(100)
       end
BenchmarkTools.Trial: 10000 samples with 8 evaluations per sample.
 Range (min … max):  3.302 ΞΌs … 254.620 ΞΌs  β”Š GC (min … max): 0.00% … 97.24%
 Time  (median):     5.255 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.725 ΞΌs Β±   4.792 ΞΌs  β”Š GC (mean Β± Οƒ):  3.23% Β±  4.42%

                  β–β–‡β–ˆβ–„    β–‚
  β–‚β–‚β–β–β–‚β–‚β–β–β–β–β–β–β–β–‚β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–…β–„β–ƒβ–ƒβ–ƒβ–„β–„β–„β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  3.3 ΞΌs          Histogram: frequency by time        8.42 ΞΌs <

 Memory estimate: 5.50 KiB, allocs estimate: 346.

I’ve ruled out the usual suspects (global variables, benchmark setup for mutating functions). What am I missing?

Thanks for any clues!

I think

cent_reducer = _default_reducer(Val(cent_func))

Is type unstable because Val(runtime-thing) generates a type which depends on a runtime value, therefore every call of the reducer might me type unstable.
Try hardcoding it

cent_reducer = median!

To see if it changes anything.

Also check out GitHub - JuliaDebug/Cthulhu.jl: The slow descent into madness to see the inferred types throughout the function.

I tried to hard-code the center function and the std function, like this:

cent_reducer = mean
    std_reducer = std

but got this:

@benchmark sigma_clip!($x, $mask, $buf ;upper=1.5, lower=1.5, cent_func=$:mean)
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min … max):  7.188 ΞΌs … 346.198 ΞΌs  β”Š GC (min … max): 0.00% … 96.71%
 Time  (median):     7.979 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   8.351 ΞΌs Β±   5.017 ΞΌs  β”Š GC (mean Β± Οƒ):  2.12% Β±  4.58%

               β–ƒβ–ˆβ–†β–ˆβ–
  β–β–β–β–‚β–‚β–…β–…β–†β–†β–…β–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–ƒβ–„β–„β–ƒβ–„β–…β–ƒβ–„β–„β–ƒβ–ƒβ–ƒβ–…β–‡β–‡β–‡β–ˆβ–…β–„β–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  7.19 ΞΌs         Histogram: frequency by time        9.86 ΞΌs <

 Memory estimate: 13.31 KiB, allocs estimate: 837.

Also: the benchmark was done in a fresh new REPL

Maybe this line:

 w = view(buf, 1:N)

The docs say that view needs to allocate a subarray object. The compiler may be smart enough to stack allocate this or it may get put on the heap. Perhaps the latter case is happening here.

here it should be buffer and not buf (I think). Then buf is just some non-constant global variable you defined outside the function, and that is allocating.

2 Likes

Not at a pc right now so unfortunately I cannot try it on my own. Maybe for benchmarking try to set a random seed and also set maxiter to 1, if you allocate/dynamic dispatch every iteration, the result might depend on the input and thus the number of iterations more than on the code.
I would be quite surprised if the hard coded version had worse performance for the exact same input.
If Cthulhu is intimidating (been there), another classic approach would be to comment out everything and slowly bring back the functionality line by line to see which operations introduce the allocations.

oooh.
You are right!
It was a typo all the time along…

@benchmark sigma_clip!($x, $mask, $buf ;upper=1.5, lower=1.5, maxiter=100, cent_func=$:mean)
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  4.292 ΞΌs …  12.536 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.333 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.439 ΞΌs Β± 242.706 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–‡β–ˆβ–…β–ƒ              ▁▄▆▅▅▃▂  β–‚β–‚β–‚                             β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–…β–„β–…β–„β–β–β–ƒβ–„β–„β–…β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–…β–ƒβ–ƒβ–ƒβ–β–„β–β–ƒβ–ƒβ–„β–ƒβ–β–„β–„β–†β–ƒβ–…β–„β–„β–„β–ƒβ–…β–ƒβ–… β–ˆ
  4.29 ΞΌs      Histogram: log(frequency) by time      5.14 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

Thank You so much!

2 Likes