Union Types and conversion thereof

bernhard · June 11, 2018, 6:16am

In some cases, CSV.read returns a dataframe with vectors that are of type Vector{Union{Missing, String}}.
Is there an easy way to ‘remove’ the missing, i.e. convert to a ‘pure’ datatype?

Also, how can I check how many ‘elements’ the union has?

The code below seems to work. But I was wondering if there is an easier way.


x=[2,3]
y=convert(Vector{Union{Missing,Int64}},x)
typeof(y)

"""
tries to convert x::Vector{Union{Missing,T}} to type Vector{T}
"""
function tryToRemoveMissingFromType(x::AbstractVector)
    @show elt=eltype(x)
    if typeof(elt)!=Union
        return nothing
    end
    a=elt.a
    b=elt.b
    tragetT = ifelse(a==Missing,b,a)
    try
        res=convert(Vector{tragetT},x) 
        return res
    catch
       @warn("oops")
    end
    return nothing
end


z=tryToRemoveMissingFromType(y)
@show eltype(z),eltype(x),eltype(y)

bernhard · June 11, 2018, 6:23am

I just noted that the option allowmissing=none of CSV.read (Home · CSV.jl) might solve my issue. Although I am still interested in an answer to my question above.

dmbates · June 11, 2018, 6:08pm

The disallowmissing function in the Missings package is the usual way of performing this transformation. There are methods in the DataFrames package to apply this to all the columns of a DataFrame. See also disallowmissing! in the DataFrames package.

I’m not sure what you mean by your question about the number of elements that the union has. Perhaps you could rephrase it.

ExpandingMan · June 11, 2018, 6:26pm

It might be worth asking if it’s really necessary to convert the element types of your Vectors. One of the ideas behind Missing was that this usually shouldn’t be necessary, granted that is more true in 0.7 than it is in 0.6.

Here’s some simple code I used to clean these things up in 0.6:

sanitize(::Type{Missing}, v::AbstractVector{Union{T,Missing}}) where T = convert(AbstractVector{T}, v)    
                                                                                                          
function sanitize!(::Type{Missing}, df::AbstractDataFrame)                                                
    for i ∈ 1:size(df,2)                                                                                  
        if typeof(df[i]) <: AbstractVector{Union{T,Missing}} where {T}                                    
            if count(ismissing, df[i]) == 0                                                               
                df[i] = sanitize(Missing, df[i])                                                          
            end                                                                                           
        end                                                                                               
    end                                                                                                   
    df                                                                                                    
end

So you can just do sanitize(Missing, df).

bernhard · June 11, 2018, 7:13pm

Thank you both. Indeed I possibly should keep the type as it is. But I am using a custom algorithm of mine (and I am really not sure what would/will happen with missing values as I have not read up on it). Therefore It seems safer to get rid of the type (and thus know that I have values for each observation).

Regarding the “number of elements the Union has”.
Well, could I have a type Union{String,Int64,Missing}? If yes, that seems to have 3 “elements” whereas Union{Missing, Int64} only has 2.

It seems the following might have been my answer, but it results 2, why?

length(fieldnames(typeof(eltype(convert(Vector{Union{Missing,String,Int64}},x)))))

apologies for the nasty one liner.

Tamas_Papp · June 12, 2018, 4:52am

Break that up into pieces and you will see that the type is fine:

julia> using Missings

julia> x = [2,3]
2-element Array{Int64,1}:
 2
 3

julia> y = convert(Vector{Union{Missing,String,Int64}},x)
2-element Array{Union{Int64, Missings.Missing, String},1}:
 2
 3

julia> T = eltype(y)
Union{Int64, Missings.Missing, String}

but you need to use the appropriate accessor:

julia> Base.uniontypes(T)
3-element Array{Any,1}:
 Missings.Missing
 String          
 Int64

because of implementation details of Union.

Topic		Replies	Views
How to change the type of a column of a DataFrame General Usage question	9	1379	January 1, 2021
Importing CSV with missing data Data dataframes	13	4514	April 30, 2018
I have a DataFrame with multiple columns of type Union{Missing, String}. What is the most concise manner of converting the non-missing values in Float? General Usage	2	577	January 29, 2021
Vector of missing and float General Usage question , missing-values	10	884	March 20, 2023
Get rid of Missing in eltype General Usage	6	648	February 22, 2019

Union Types and conversion thereof

Related topics