Finding the "minimal" necessary type of a vector and converting to it

I am having a hard time finding the “minimal” type necessary for a vector and converting to it. By “minimal” I mean the type that can still hold all of the values without being too general. For example:

String > Float64 > Int

since Strings can contain Float64 which in turn can contain Int. I understand that not all Int can be expressed exactly as Float64 but

julia> promote_type(Int, Float64)
Float64

My attempt so far is the following:

using Parsers
struct ConvType
    T::DataType
    needsmissing::Bool
end

function guesstype(v; n = 10000)
    if n >= length(v)
        vu = copy(v)
    else
        inds = sample(1:length(v), n, replace = false)
        vu = v[inds] 
    end
    missings = ismissing.(vu)
    needsmissing = any(missings)
    vu = vu[missings .== false]
    min_T = Int
    for val in vu
        new_T = _promote(val)
        if  new_T <: AbstractFloat || new_T == String
            min_T = new_T
        end
        min_T == String && break
    end
    return ConvType(min_T, needsmissing)
end

function _promote(s::String)
    s = strip(s)
    p = Parsers.tryparse(Float64, s) 
    if isnothing(p)
        return String
    else
        return _promote(p)
    end
end

function _promote(n::T) where T <: AbstractFloat
    if round(n) == n
        return Int
    else
        return T
    end
end

function _promote(a::T) where T
    return T
end

function convone(s::String, ::Type{T}) where T <: Integer
    s = replace(s, r"\.\d*"=>"")
    return Parsers.parse(T, s)
end

convone(s::String, ::Type{T}) where T <: Number = Parsers.parse(T, s)

function convone(n::T, ::Type{String}) where T <: Number
    string(n)
end

convone(a::T, ::Type{T}) where T = a

function convone(a::T, ::Type{S}) where T<:Number where S<:Number
    S(a)
end

convone(::Missing, ::Type{T}) = missing

function conveach(v)
    T = guesstype(v)
    for (i, val) in enumerate(v)
        v[i] = convone(val, T.T)
    end
    OT = T.needsmissing ? Union{T.T, Missing} : T.T
    Vector{OT}(v)
end
        

Here some example output

julia> conveach(Any[1, "123", 1., "1.00"])
4-element Array{Int64,1}:
   1
 123
   1
   1

julia> conveach(["1.1", "2", 1, 2])
4-element Array{Float64,1}:
 1.1
 2.0
 1.0
 2.0

julia> conveach(["a", "1", 1, 1.1])
4-element Array{String,1}:
 "a"
 "1"
 "1"
 "1.1"

I understand that Julia’s type system/hierarchy is extremely complicated but this code seems quite involved just to promote between String, Float64 and Int. Any suggestions on making this easier?

Your algorithm is inevitably going to be a bit messy here because it is type-unstable: your output depends on the values of the data and not just on the types. Moreover, that type instability will make everything downstream of your conveach function more complicated as well, because it has to deal with data that might be represented by strings or numbers…

In what context does this arise? If you are dealing with data in a messy format, I would try to write a preprocessing script that cleans up your data first before processing. e.g. why not just make everything floating-point?

3 Likes

Yeah the data is kind of weird. It’s basically a bunch of individual data from an API pasted together and the api seems not to be consistent. So I figured I would write something generic and it got out of hand. I guess just converting everything to Float64 is the best solution. The convert vs parse situation is still a bit annoying but easy to handle. Thanks for your help!

1 Like

You might want to check out BangBang.jl for this, especially push!!. If you just need regular promote behavior, it’s as easy as:

julia> using BangBang

julia> foldl(push!!, Any[1, 1.1, 0x2, 3f0], init=Union{}[])
4-element Array{Float64,1}:
 1.0
 1.1
 2.0
 3.0

To get the behavior you want, it’s a little more complicated, but using your functions convone and _promote, not much more:

julia> foldl(Any[1, "123", 1., "1.00"], init=Union{}[]) do a, i
           push!!(a, convone(i, _promote(i)))
       end
4-element Array{Int64,1}:
   1
 123
   1
   1
2 Likes

Thank you. I’ll try that out.