Union Types and conversion thereof

In some cases, CSV.read returns a dataframe with vectors that are of type Vector{Union{Missing, String}}.
Is there an easy way to ‘remove’ the missing, i.e. convert to a ‘pure’ datatype?

Also, how can I check how many ‘elements’ the union has?

The code below seems to work. But I was wondering if there is an easier way.


x=[2,3]
y=convert(Vector{Union{Missing,Int64}},x)
typeof(y)

"""
tries to convert x::Vector{Union{Missing,T}} to type Vector{T}
"""
function tryToRemoveMissingFromType(x::AbstractVector)
    @show elt=eltype(x)
    if typeof(elt)!=Union
        return nothing
    end
    a=elt.a
    b=elt.b
    tragetT = ifelse(a==Missing,b,a)
    try
        res=convert(Vector{tragetT},x) 
        return res
    catch
       @warn("oops")
    end
    return nothing
end


z=tryToRemoveMissingFromType(y)
@show eltype(z),eltype(x),eltype(y)

1 Like

I just noted that the option allowmissing=none of CSV.read (Home · CSV.jl) might solve my issue. Although I am still interested in an answer to my question above.

The disallowmissing function in the Missings package is the usual way of performing this transformation. There are methods in the DataFrames package to apply this to all the columns of a DataFrame. See also disallowmissing! in the DataFrames package.

I’m not sure what you mean by your question about the number of elements that the union has. Perhaps you could rephrase it.

3 Likes

It might be worth asking if it’s really necessary to convert the element types of your Vectors. One of the ideas behind Missing was that this usually shouldn’t be necessary, granted that is more true in 0.7 than it is in 0.6.

Here’s some simple code I used to clean these things up in 0.6:

sanitize(::Type{Missing}, v::AbstractVector{Union{T,Missing}}) where T = convert(AbstractVector{T}, v)    
                                                                                                          
function sanitize!(::Type{Missing}, df::AbstractDataFrame)                                                
    for i ∈ 1:size(df,2)                                                                                  
        if typeof(df[i]) <: AbstractVector{Union{T,Missing}} where {T}                                    
            if count(ismissing, df[i]) == 0                                                               
                df[i] = sanitize(Missing, df[i])                                                          
            end                                                                                           
        end                                                                                               
    end                                                                                                   
    df                                                                                                    
end    

So you can just do sanitize(Missing, df).

Thank you both. Indeed I possibly should keep the type as it is. But I am using a custom algorithm of mine (and I am really not sure what would/will happen with missing values as I have not read up on it). Therefore It seems safer to get rid of the type (and thus know that I have values for each observation).

Regarding the “number of elements the Union has”.
Well, could I have a type Union{String,Int64,Missing}? If yes, that seems to have 3 “elements” whereas Union{Missing, Int64} only has 2.

It seems the following might have been my answer, but it results 2, why?

length(fieldnames(typeof(eltype(convert(Vector{Union{Missing,String,Int64}},x)))))

apologies for the nasty one liner.

Break that up into pieces and you will see that the type is fine:

julia> using Missings

julia> x = [2,3]
2-element Array{Int64,1}:
 2
 3

julia> y = convert(Vector{Union{Missing,String,Int64}},x)
2-element Array{Union{Int64, Missings.Missing, String},1}:
 2
 3

julia> T = eltype(y)
Union{Int64, Missings.Missing, String}

but you need to use the appropriate accessor:

julia> Base.uniontypes(T)
3-element Array{Any,1}:
 Missings.Missing
 String          
 Int64           

because of implementation details of Union.