How to convert all nothings in dataframe to missing?

Joe_Schnetzler · January 26, 2021, 5:42pm

I have a dataframe with a bunch of nothings mixed between union columns, etc. How can I change all nothings to missings efficiently?

Thanks!

Best,
Joe

lmiq · January 26, 2021, 6:22pm

From the point of view of the function itself, I do not know if you can get faster than the simple loop:

julia> function nothing_to_missing!(x)
         for i in eachindex(x)
           if isnothing(x[i])
             x[i] = missing
           end
         end
       end

However, the performance is very much dependent on the type of vector:

# Vector Any[]
julia> @btime nothing_to_missing!(x) setup=(x=Any[isodd(i) ? nothing : 1 for i in 1:1000 ]) evals=1
  4.905 μs (0 allocations: 0 bytes)

# Vector Union{Int,Nothing,Missing}
julia> @btime nothing_to_missing!(x) setup=(x=Union{Int,Nothing,Missing}[isodd(i) ? nothing : 1 for i in 1:1000 ]) evals=1
  1.343 μs (0 allocations: 0 bytes)

This for arrays. I do not know if dataframes have some specific behavior concerning these values.

Edit: With a DataFrame it seems to be much slower. But I am not completely sure if this benchmark makes sense, I am not a regular user of data frames:

julia> function nothing_to_missing!(df,col)
         for i in eachindex(df[col])
           if isnothing(df[col][i])
             df[col][i] = missing
           end
         end
       end

julia> @btime nothing_to_missing!(df,1) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1
  145.723 μs (1978 allocations: 46.52 KiB)

julia> @btime nothing_to_missing!(df,1) setup=(df=DataFrame([Union{Int,Missing,Nothing}[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1
  127.832 μs (1978 allocations: 46.52 KiB)

Joe_Schnetzler · January 26, 2021, 6:53pm

Thank you. The loop will do then! Thanks again!

MatFi · January 26, 2021, 6:59pm

Not sure if this is faster:

df = DataFrame(A=["A","B",nothing, "C"])
df.A = (df.A .|> a -> isnothing(a) ? missing : a)

lmiq · January 26, 2021, 7:06pm

Yes, for data frames it is faster than the above loop, and much dependent on the type of array as well:

julia> f(df) = (df.x .|> a -> isnothing(a) ? missing : a )
f (generic function with 1 method)

julia> @btime f(df) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1;
  53.321 μs (501 allocations: 16.95 KiB)

julia> @btime f(df) setup=(df=DataFrame([Union{Int,Missing,Nothing}[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1;
  2.655 μs (11 allocations: 9.29 KiB)

(but it does not mutate the original data frame, it creates a new one)

pdeffebach · January 26, 2021, 7:12pm

replace is the most obvious answer here

df = DataFrame(a = [1, 2, nothing, 4])
df.a = replace(df.a, nothing => missing)

Henrique_Becker · January 26, 2021, 7:12pm

I recommend creating a new Vector and assigning it to the column, as @MatFi did. Otherwise you need that the DataFrame column is of type Any[] or any other type that supports both missing and nothing what is not common nor recommended. I personally use:

df.A = ifelse.(isnothing.(df.A), missing, df.A)

pdeffebach · January 26, 2021, 7:32pm

This is actually less performant, since you create isnothing.(df.A) as a temporary vector.

lmiq · January 26, 2021, 7:37pm

This also creates a new vector and puts it in the place of df.a. What if one wants to mutate the values? Is there any alternative that will perform well?

pdeffebach · January 26, 2021, 7:41pm

No, this approach does the minimum amount of copying possibe, since there is no copying in the assignment to df.a.

In the ifelse.(isnothing.(df.A)...) case, the isnothing.(df.A) creates an intermediate temporary vector that is never used.

There are no options to do this entirely in-place, since a vector, say [1, 2, nothing, 4] only has the memory footprint for Int64 and nothing laid out for it. You need a new vector slotted to hold Int64 and missings.

lmiq · January 26, 2021, 7:44pm

Complementing the answer to my own question, there is replace!, but one need to explicitly define the type of the array to accept both missing and nothing values, as pointed above. And the difference in performance is huge from Any to Union{Missing,Nothing,...}:

julia> @btime replace!(df.x,nothing=>missing) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:10000 ]],[:x])) evals=1;
  112.331 μs (0 allocations: 0 bytes)

julia> @btime replace!(df.x,nothing=>missing) setup=(df=DataFrame([Union{Missing,Nothing,Int}[isodd(i) ? nothing : 1 for i in 1:10000 ]],[:x])) evals=1;
  5.607 μs (0 allocations: 0 bytes)

Henrique_Becker · January 26, 2021, 8:02pm

Do I? Is this not a case of broadcast fusion?

Henrique_Becker · January 26, 2021, 8:08pm

I think that if there is less than 4 type in the Union and all of them are concrete, the code will be reasonably fast. Nevertheless, it is not usual to declare such Union{Missing,Nothing,...} columns. A CSV reader will probably read as either Union{Missing,...} or Union{Nothing,...}, and if the user intends to change the convention/standard the original array will not have the element type necessary without allocating a new array anyway.

pdeffebach · January 26, 2021, 8:29pm

Not quite, I guess

julia> x = [rand() < .2 ? rand() : nothing for i in 1:1_000_000];

julia> function f1(x)
       replace(x, nothing => missing)
       end;

julia> function f2(x)
       ifelse.(isnothing.(x), missing, x)
       end;

julia> @btime f1($x);
  3.296 ms (2 allocations: 8.58 MiB)

julia> @btime f2($x);
  24.277 ms (13 allocations: 8.58 MiB)

Henrique_Becker · January 26, 2021, 8:43pm

This is strange, because it is clear that there is no allocation of an intermediary Vector otherwise my method would use the double of the memory (or at least something significantly larger), but any difference in the memory used is less than 1%.

The number of allocations however is higher, maybe the broadcast machinery use @views or something that allocates without duplicating the whole intermediary array, and for such simple task these extra allocations make the difference.