I have a dataframe with a bunch of nothings mixed between union columns, etc. How can I change all nothings to missings efficiently?
Thanks!
Best,
Joe
I have a dataframe with a bunch of nothings mixed between union columns, etc. How can I change all nothings to missings efficiently?
Thanks!
Best,
Joe
From the point of view of the function itself, I do not know if you can get faster than the simple loop:
julia> function nothing_to_missing!(x)
for i in eachindex(x)
if isnothing(x[i])
x[i] = missing
end
end
end
However, the performance is very much dependent on the type of vector:
# Vector Any[]
julia> @btime nothing_to_missing!(x) setup=(x=Any[isodd(i) ? nothing : 1 for i in 1:1000 ]) evals=1
4.905 μs (0 allocations: 0 bytes)
# Vector Union{Int,Nothing,Missing}
julia> @btime nothing_to_missing!(x) setup=(x=Union{Int,Nothing,Missing}[isodd(i) ? nothing : 1 for i in 1:1000 ]) evals=1
1.343 μs (0 allocations: 0 bytes)
This for arrays. I do not know if dataframes have some specific behavior concerning these values.
Edit: With a DataFrame
it seems to be much slower. But I am not completely sure if this benchmark makes sense, I am not a regular user of data frames:
julia> function nothing_to_missing!(df,col)
for i in eachindex(df[col])
if isnothing(df[col][i])
df[col][i] = missing
end
end
end
julia> @btime nothing_to_missing!(df,1) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1
145.723 μs (1978 allocations: 46.52 KiB)
julia> @btime nothing_to_missing!(df,1) setup=(df=DataFrame([Union{Int,Missing,Nothing}[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1
127.832 μs (1978 allocations: 46.52 KiB)
Thank you. The loop will do then! Thanks again!
Not sure if this is faster:
df = DataFrame(A=["A","B",nothing, "C"])
df.A = (df.A .|> a -> isnothing(a) ? missing : a)
Yes, for data frames it is faster than the above loop, and much dependent on the type of array as well:
julia> f(df) = (df.x .|> a -> isnothing(a) ? missing : a )
f (generic function with 1 method)
julia> @btime f(df) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1;
53.321 μs (501 allocations: 16.95 KiB)
julia> @btime f(df) setup=(df=DataFrame([Union{Int,Missing,Nothing}[isodd(i) ? nothing : 1 for i in 1:1000 ]],[:x])) evals=1;
2.655 μs (11 allocations: 9.29 KiB)
(but it does not mutate the original data frame, it creates a new one)
replace
is the most obvious answer here
df = DataFrame(a = [1, 2, nothing, 4])
df.a = replace(df.a, nothing => missing)
I recommend creating a new Vector
and assigning it to the column, as @MatFi did. Otherwise you need that the DataFrame column is of type Any[]
or any other type that supports both missing
and nothing
what is not common nor recommended. I personally use:
df.A = ifelse.(isnothing.(df.A), missing, df.A)
This is actually less performant, since you create isnothing.(df.A)
as a temporary vector.
This also creates a new vector and puts it in the place of df.a
. What if one wants to mutate the values? Is there any alternative that will perform well?
No, this approach does the minimum amount of copying possibe, since there is no copying in the assignment to df.a
.
In the ifelse.(isnothing.(df.A)...)
case, the isnothing.(df.A)
creates an intermediate temporary vector that is never used.
There are no options to do this entirely in-place, since a vector, say [1, 2, nothing, 4]
only has the memory footprint for Int64
and nothing
laid out for it. You need a new vector slotted to hold Int64
and missing
s.
Complementing the answer to my own question, there is replace!
, but one need to explicitly define the type of the array to accept both missing
and nothing
values, as pointed above. And the difference in performance is huge from Any
to Union{Missing,Nothing,...}
:
julia> @btime replace!(df.x,nothing=>missing) setup=(df=DataFrame([Any[isodd(i) ? nothing : 1 for i in 1:10000 ]],[:x])) evals=1;
112.331 μs (0 allocations: 0 bytes)
julia> @btime replace!(df.x,nothing=>missing) setup=(df=DataFrame([Union{Missing,Nothing,Int}[isodd(i) ? nothing : 1 for i in 1:10000 ]],[:x])) evals=1;
5.607 μs (0 allocations: 0 bytes)
Do I? Is this not a case of broadcast fusion?
I think that if there is less than 4 type in the Union
and all of them are concrete, the code will be reasonably fast. Nevertheless, it is not usual to declare such Union{Missing,Nothing,...}
columns. A CSV reader will probably read as either Union{Missing,...}
or Union{Nothing,...}
, and if the user intends to change the convention/standard the original array will not have the element type necessary without allocating a new array anyway.
Not quite, I guess
julia> x = [rand() < .2 ? rand() : nothing for i in 1:1_000_000];
julia> function f1(x)
replace(x, nothing => missing)
end;
julia> function f2(x)
ifelse.(isnothing.(x), missing, x)
end;
julia> @btime f1($x);
3.296 ms (2 allocations: 8.58 MiB)
julia> @btime f2($x);
24.277 ms (13 allocations: 8.58 MiB)
This is strange, because it is clear that there is no allocation of an intermediary Vector
otherwise my method would use the double of the memory (or at least something significantly larger), but any difference in the memory used is less than 1%.
The number of allocations however is higher, maybe the broadcast machinery use @views
or something that allocates without duplicating the whole intermediary array, and for such simple task these extra allocations make the difference.