How to convert all nothings in dataframe to missing?

I have a dataframe with a bunch of nothings mixed between union columns, etc. How can I change all nothings to missings efficiently?



From the point of view of the function itself, I do not know if you can get faster than the simple loop:

julia> function nothing_to_missing!(x)
         for i in eachindex(x)
           if isnothing(x[i])
             x[i] = missing

However, the performance is very much dependent on the type of vector:

# Vector Any[]
julia> @btime nothing_to_missing!(x) setup=(x=Any[isodd(i) ? nothing : 1 for i in 1:1000 ]) evals=1
  4.905 μs (0 allocations: 0 bytes)

# Vector Union{Int,Nothing,Missing}
julia> @btime nothing_to_missing!(x) setup=(x=Union{Int,Nothing,Missing}[isodd(i) ? nothing : 1 for i in 1:1000 ]) evals=1
  1.343 μs (0 allocations: 0 bytes)

This for arrays. I do not know if dataframes have some specific behavior concerning these values.

Edit: With a DataFrame it seems to be much slower. But I am not completely sure if this benchmark makes sense, I am not a regular user of data frames:

julia> function nothing_to_missing!(df,col)
         for i in eachindex(df[col])
           if isnothing(df[col][i])
             df[col][i] = missing

julia> @btime nothing_to_missing!(df,1) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1
  145.723 μs (1978 allocations: 46.52 KiB)

julia> @btime nothing_to_missing!(df,1) setup=(df=DataFrame([Union{Int,Missing,Nothing}[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1
  127.832 μs (1978 allocations: 46.52 KiB)

1 Like

Thank you. The loop will do then! Thanks again!

Not sure if this is faster:

df = DataFrame(A=["A","B",nothing, "C"])
df.A = (df.A .|> a -> isnothing(a) ? missing : a)

Yes, for data frames it is faster than the above loop, and much dependent on the type of array as well:

julia> f(df) = (df.x .|> a -> isnothing(a) ? missing : a )
f (generic function with 1 method)

julia> @btime f(df) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1;
  53.321 μs (501 allocations: 16.95 KiB)

julia> @btime f(df) setup=(df=DataFrame([Union{Int,Missing,Nothing}[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1;
  2.655 μs (11 allocations: 9.29 KiB)

(but it does not mutate the original data frame, it creates a new one)

replace is the most obvious answer here

df = DataFrame(a = [1, 2, nothing, 4])
df.a = replace(df.a, nothing => missing)

I recommend creating a new Vector and assigning it to the column, as @MatFi did. Otherwise you need that the DataFrame column is of type Any[] or any other type that supports both missing and nothing what is not common nor recommended. I personally use:

df.A = ifelse.(isnothing.(df.A), missing, df.A)

This is actually less performant, since you create isnothing.(df.A) as a temporary vector.

This also creates a new vector and puts it in the place of df.a. What if one wants to mutate the values? Is there any alternative that will perform well?

No, this approach does the minimum amount of copying possibe, since there is no copying in the assignment to df.a.

In the ifelse.(isnothing.(df.A)...) case, the isnothing.(df.A) creates an intermediate temporary vector that is never used.

There are no options to do this entirely in-place, since a vector, say [1, 2, nothing, 4] only has the memory footprint for Int64 and nothing laid out for it. You need a new vector slotted to hold Int64 and missings.

1 Like

Complementing the answer to my own question, there is replace!, but one need to explicitly define the type of the array to accept both missing and nothing values, as pointed above. And the difference in performance is huge from Any to Union{Missing,Nothing,...}:

julia> @btime replace!(df.x,nothing=>missing) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:10000 ]],[:x])) evals=1;
  112.331 μs (0 allocations: 0 bytes)

julia> @btime replace!(df.x,nothing=>missing) setup=(df=DataFrame([Union{Missing,Nothing,Int}[isodd(i) ? nothing : 1 for i in 1:10000 ]],[:x])) evals=1;
  5.607 μs (0 allocations: 0 bytes)

1 Like

Do I? Is this not a case of broadcast fusion?

I think that if there is less than 4 type in the Union and all of them are concrete, the code will be reasonably fast. Nevertheless, it is not usual to declare such Union{Missing,Nothing,...} columns. A CSV reader will probably read as either Union{Missing,...} or Union{Nothing,...}, and if the user intends to change the convention/standard the original array will not have the element type necessary without allocating a new array anyway.

1 Like

Not quite, I guess

julia> x = [rand() < .2 ? rand() : nothing for i in 1:1_000_000];

julia> function f1(x)
       replace(x, nothing => missing)

julia> function f2(x)
       ifelse.(isnothing.(x), missing, x)

julia> @btime f1($x);
  3.296 ms (2 allocations: 8.58 MiB)

julia> @btime f2($x);
  24.277 ms (13 allocations: 8.58 MiB)
1 Like

This is strange, because it is clear that there is no allocation of an intermediary Vector otherwise my method would use the double of the memory (or at least something significantly larger), but any difference in the memory used is less than 1%.

The number of allocations however is higher, maybe the broadcast machinery use @views or something that allocates without duplicating the whole intermediary array, and for such simple task these extra allocations make the difference.