In some cases, CSV.read returns a dataframe with vectors that are of type Vector{Union{Missing, String}}.
Is there an easy way to ‘remove’ the missing, i.e. convert to a ‘pure’ datatype?
Also, how can I check how many ‘elements’ the union has?
The code below seems to work. But I was wondering if there is an easier way.
x=[2,3]
y=convert(Vector{Union{Missing,Int64}},x)
typeof(y)
"""
tries to convert x::Vector{Union{Missing,T}} to type Vector{T}
"""
function tryToRemoveMissingFromType(x::AbstractVector)
@show elt=eltype(x)
if typeof(elt)!=Union
return nothing
end
a=elt.a
b=elt.b
tragetT = ifelse(a==Missing,b,a)
try
res=convert(Vector{tragetT},x)
return res
catch
@warn("oops")
end
return nothing
end
z=tryToRemoveMissingFromType(y)
@show eltype(z),eltype(x),eltype(y)
1 Like
I just noted that the option allowmissing=none of CSV.read (Home · CSV.jl) might solve my issue. Although I am still interested in an answer to my question above.
The disallowmissing
function in the Missings
package is the usual way of performing this transformation. There are methods in the DataFrames
package to apply this to all the columns of a DataFrame
. See also disallowmissing!
in the DataFrames
package.
I’m not sure what you mean by your question about the number of elements that the union has. Perhaps you could rephrase it.
3 Likes
It might be worth asking if it’s really necessary to convert the element types of your Vector
s. One of the ideas behind Missing
was that this usually shouldn’t be necessary, granted that is more true in 0.7 than it is in 0.6.
Here’s some simple code I used to clean these things up in 0.6:
sanitize(::Type{Missing}, v::AbstractVector{Union{T,Missing}}) where T = convert(AbstractVector{T}, v)
function sanitize!(::Type{Missing}, df::AbstractDataFrame)
for i ∈ 1:size(df,2)
if typeof(df[i]) <: AbstractVector{Union{T,Missing}} where {T}
if count(ismissing, df[i]) == 0
df[i] = sanitize(Missing, df[i])
end
end
end
df
end
So you can just do sanitize(Missing, df)
.
Thank you both. Indeed I possibly should keep the type as it is. But I am using a custom algorithm of mine (and I am really not sure what would/will happen with missing values as I have not read up on it). Therefore It seems safer to get rid of the type (and thus know that I have values for each observation).
Regarding the “number of elements the Union has”.
Well, could I have a type Union{String,Int64,Missing}? If yes, that seems to have 3 “elements” whereas Union{Missing, Int64} only has 2.
It seems the following might have been my answer, but it results 2, why?
length(fieldnames(typeof(eltype(convert(Vector{Union{Missing,String,Int64}},x)))))
apologies for the nasty one liner.
Break that up into pieces and you will see that the type is fine:
julia> using Missings
julia> x = [2,3]
2-element Array{Int64,1}:
2
3
julia> y = convert(Vector{Union{Missing,String,Int64}},x)
2-element Array{Union{Int64, Missings.Missing, String},1}:
2
3
julia> T = eltype(y)
Union{Int64, Missings.Missing, String}
but you need to use the appropriate accessor:
julia> Base.uniontypes(T)
3-element Array{Any,1}:
Missings.Missing
String
Int64
because of implementation details of Union
.