The Case of the Phantom Allocations 👻

Matt_jl · October 15, 2025, 7:17pm

Hey Julia experts,

I have a performance puzzle. This function is designed to be completely allocation-free, but @benchmark tells a different story. I’m hoping a fresh pair of eyes can spot the problem.

This is the function:


using  StatsBase

_default_reducer(::Val{:median}) = median!
_default_reducer(::Val{:mean}) = mean
_default_reducer(::Val{:std}) = std

function sigma_clip!(x::AbstractVector{T},
    mask::BitVector,
    buffer::AbstractVector{T};
    upper::Real=3,
    lower::Real=3,
    cent_func::Symbol=:median,
    std_func::Symbol=:std,
    maxiter::Int=5) where {T<:Real}

    maxiter > 0 || throw(ArgumentError("maxiter must be positive"))
    upper > 0 && lower > 0 || throw(ArgumentError("upper and lower must be positive"))

    cent_reducer = _default_reducer(Val(cent_func))
    std_reducer = _default_reducer(Val(std_func))

    # initialize buffer following mask (isfinite) to avoid cal reducer over NaN or Inf
    N = 0
    @inbounds for i in eachindex(x)
        xi = x[i]
        if isfinite(xi)
            N += 1
            buffer[N] = xi
            mask[i] = true
        else
            mask[i] = false
        end
    end

    N == 0 && return mask

    iter = 0
    @inbounds while iter < maxiter
        # calculate statistic
        w = view(buf, 1:N)

        c = cent_reducer(w)
        s = std_reducer(w)
        iszero(s) && return mask

        upper_bound = c + s * upper
        lower_bound = c - s * lower

        N = 0
        n_clipped = 0

        @inbounds for i in eachindex(x)
            if mask[i]
                xi = x[i]
                if xi > upper_bound || xi < lower_bound
                    mask[i] = false
                    n_clipped += 1
                else
                    N += 1
                    buffer[N] = xi
                end
            end
        end

        n_clipped == 0 && return mask
        iter += 1
    end

    return mask
end

The Evidence (The Benchmark):

@benchmark sigma_clip!(x, y, buf;upper=1.5, lower=1.5, cent_func=:mean) setup=begin
           x = rand(100)
           y = trues(100) 
           buf = rand(100)
       end
BenchmarkTools.Trial: 10000 samples with 8 evaluations per sample.
 Range (min … max):  3.302 μs … 254.620 μs  ┊ GC (min … max): 0.00% … 97.24%
 Time  (median):     5.255 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.725 μs ±   4.792 μs  ┊ GC (mean ± σ):  3.23% ±  4.42%

                  ▁▇█▄    ▂
  ▂▂▁▁▂▂▁▁▁▁▁▁▁▂▂▄██████▇██▇▅▄▃▃▃▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  3.3 μs          Histogram: frequency by time        8.42 μs <

 Memory estimate: 5.50 KiB, allocs estimate: 346.

I’ve ruled out the usual suspects (global variables, benchmark setup for mutating functions). What am I missing?

Thanks for any clues!

hexaeder · October 15, 2025, 7:33pm

I think

cent_reducer = _default_reducer(Val(cent_func))

Is type unstable because Val(runtime-thing) generates a type which depends on a runtime value, therefore every call of the reducer might me type unstable.
Try hardcoding it

cent_reducer = median!

To see if it changes anything.

Also check out GitHub - JuliaDebug/Cthulhu.jl: The slow descent into madness to see the inferred types throughout the function.

Matt_jl · October 15, 2025, 7:37pm

I tried to hard-code the center function and the std function, like this:

cent_reducer = mean
    std_reducer = std

but got this:

@benchmark sigma_clip!($x, $mask, $buf ;upper=1.5, lower=1.5, cent_func=$:mean)
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min … max):  7.188 μs … 346.198 μs  ┊ GC (min … max): 0.00% … 96.71%
 Time  (median):     7.979 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.351 μs ±   5.017 μs  ┊ GC (mean ± σ):  2.12% ±  4.58%

               ▃█▆█▁
  ▁▁▁▂▂▅▅▆▆▅▄▆▇█████▇▅▃▄▄▃▄▅▃▄▄▃▃▃▅▇▇▇█▅▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  7.19 μs         Histogram: frequency by time        9.86 μs <

 Memory estimate: 13.31 KiB, allocs estimate: 837.

Also: the benchmark was done in a fresh new REPL

brianguenter · October 15, 2025, 7:39pm

Maybe this line:

 w = view(buf, 1:N)

The docs say that view needs to allocate a subarray object. The compiler may be smart enough to stack allocate this or it may get put on the heap. Perhaps the latter case is happening here.

lmiq · October 15, 2025, 7:47pm

here it should be buffer and not buf (I think). Then buf is just some non-constant global variable you defined outside the function, and that is allocating.

hexaeder · October 15, 2025, 7:51pm

Not at a pc right now so unfortunately I cannot try it on my own. Maybe for benchmarking try to set a random seed and also set maxiter to 1, if you allocate/dynamic dispatch every iteration, the result might depend on the input and thus the number of iterations more than on the code.
I would be quite surprised if the hard coded version had worse performance for the exact same input.
If Cthulhu is intimidating (been there), another classic approach would be to comment out everything and slowly bring back the functionality line by line to see which operations introduce the allocations.

Matt_jl · October 15, 2025, 7:52pm

oooh.
You are right!
It was a typo all the time along…

@benchmark sigma_clip!($x, $mask, $buf ;upper=1.5, lower=1.5, maxiter=100, cent_func=$:mean)
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  4.292 μs …  12.536 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.333 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.439 μs ± 242.706 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█▅▃              ▁▄▆▅▅▃▂  ▂▂▂                             ▂
  █████▇▆▇▅▄▅▄▁▁▃▄▄▅▄█████████████▇▆▅▅▃▃▃▁▄▁▃▃▄▃▁▄▄▆▃▅▄▄▄▃▅▃▅ █
  4.29 μs      Histogram: log(frequency) by time      5.14 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Matt_jl · October 15, 2025, 7:52pm

Thank You so much!

Topic		Replies	Views
Inappropriate memory allocation? Julia at Scale question , memory	24	2228	September 4, 2017
Finding the memory allocation in some code General Usage performance	3	879	August 25, 2017
Reducing allocations to optimize (to match Java speed) Performance	22	1407	December 10, 2018
Help optimizing my generic sigma clipping function in Julia Performance question	2	85	July 28, 2025
Controlling memory allocation using external struct as placeholder General Usage question	5	893	March 11, 2017

The Case of the Phantom Allocations 👻

Related topics