Is it possible to reduce the allocation in replace missing value?

Hi , I wanna replace the missing value to 0 in a large vector, but I got a hugh memory allocation

for example:

A = rand(10_000_000);
@time replace!(A, -Inf => 0, Inf => 0,  NaN => 0);

this takes only 0.065884 seconds in first run (44.28 k allocations: 2.426 MiB, 39.79% compilation time)

but if I include the missing replace :

@time replace!(A, -Inf => 0, Inf => 0,  missing => 0, NaN => 0);

this will take 1.328083 seconds (60.00 M allocations: 2.533 GiB, 21.87% gc time)

Is it anything I wrong? Thanks.

This performs much better, although I’m not sure which are the tradeoffs, or if replace! could do better there:

julia> @time map!(x -> (ismissing(x) | isnan(x) | isinf(x)) ? 0 : x, A, A);
  0.056996 seconds (16.95 k allocations: 961.940 KiB, 57.23% compilation time)

julia> f(x) = (ismissing(x) | isnan(x) | isinf(x)) ? 0 : x
f (generic function with 1 method)

julia> @time map!(f, A, A);
  0.034643 seconds (16.91 k allocations: 958.653 KiB, 33.26% compilation time)

julia> @time map!(f, A, A);
  0.022496 seconds

julia> @btime map!($f, A, A) setup=(A = Union{Missing,Float64}[ (rand() > 0.1 ? rand() : missing) for _ in 1:10^7 ]) evals=1
  24.532 ms (0 allocations: 0 bytes)


(the compilation time in the first example is associated to the anonymous function, on every run)

There some things that are a little strange in this MWE.

  1. The vector A simply cannot have -Inf, Inf, or NaN. So you are always just testing how a pass that do not replace anything should go.
  2. The 0 literal is interpreted as Int (which is an alias for either Int32 or Int64 in your system), so there is probably an automatic conversion (maybe optimized away) to the 0.0, otherwise replace! should not work, as it changes the vector in place and you cannot save a Int in a Vector{Float64} without converting.
  3. rand returns a Vector{Float64} such type simply cannot have missing values, another red flag is the number of allocations, there is no reason for replace! to allocate anything, I believe what is happening here is that something inside replace! became type-unstable because of the strange types in the pairs, ideally all pairs should have a Float64 in the right and the left side, unless your vector is a Vector{Float64,Missing}.
1 Like

without prejudice to all doubts about what really happens, here is another proposal

julia> @btime replace!(repl, A);
  13.607 ms (1 allocation: 16 bytes)

in my laptop

julia> @btime map!($f, A, A) setup=(A = Union{Missing,Float64}[ (rand() > 0.1 ? rand() : missing) for _ in 1:10^7 ]) evals=1;
  53.493 ms (0 allocations: 0 bytes)

I guess, without checking, that the Dict that is being created by replace from the arguments is a Dict{Any,Any} when the missing is present in the example.

Probably you won’t get allocations if using interpolations:

2 Likes

The problem seems to be a type-unstable for-loop inside Base.replace_pairs!. Using recursion instead of a loop solves the problem for this case.

julia> _new(x) = x
_new (generic function with 1 method)

julia> _new(x, p, ps...) = isequal(first(p), x) ? last(p) : _new(x, ps...)
_new (generic function with 2 methods)

julia> function Base.replace_pairs!(res, A, count::Int, old_new::Tuple{Vararg{Pair}})
           Base._replace!(res, A, count) do x
               _new(x, old_new...)
           end
       end

julia> @time replace!(A, -Inf => 0, Inf => 0, missing => 0, NaN => 0);
  0.074939 seconds (68.23 k allocations: 3.670 MiB, 31.72% compilation time)