DataFrames: convert column data type

I can absolutely see how this is confusing, just to explain where that comes from:

  • Int is a type, it’s an alias to Int64 or Int32 as appropriate, Int(...) is therefore a constructor
  • float on the other hand is an ordinary function and there is no type or alias Float, so that can’t work. float(...) tries to infer an appropriate floating point type to represent the input
  • String is a peculiar case. String is a type obviously, and String(...) does work but it’s doing something else than you may expect. The constructor takes only things that can directly be converted to the String type and gives you a String object. So this works for codepoints and other String-like objects. string(...) on the other hand will turn anything into a string, and is therefore different from String(...), similar how you get an Int from a String via parse(somestring), not Int(somestring)
1 Like

I appreciate your empathy and your explanation @FPGro :slight_smile: I am now reading up on constructors. It would still be nice to know:

  1. why is there no constructor/alias to Float64 or Float32 called Float?

  2. why is there no generic function with n methods called int?

Good questions! I’m not “old” enough with julia to answer that in full, but there used to be Float and int which were dropped at some point, so there’s probably a good reason for both :smiley:

https://github.com/JuliaLang/julia/issues/1231

https://github.com/JuliaLang/julia/issues/1470

Those may be good starts if you want to dig deeper

1 Like

Even 32-bit CPUs use Float64 as the default floating point type, so there’s no need for a Float alias like for Int.

What would be the point of adding it? We already have Int. string and float are necessary evils, not something that we want to replicate for all types if we can avoid them.

1 Like

I don’t understand these comments: Int("23") does not work. And the following also doesn’t work:

using DataFrames

df = DataFrame(B=["23"])
Int.(df[!,:B])

Consistency really, that’s all. From the perspective of a user that is in the middle of a project, with no time to study constructors, Python’s user-friendly df[['B']].astype('T') is ugly, but effective.

PS. As we can see from the discussion around the functioning of Int with @sijo there is also a form of “history dependence” that is unexpected. A “Markov” approach (vis a vis the state of the data) to the behaviour of these functions/constructors would also help with consistency. I accept that Python’s function may also be subject to this criticism (I haven’t tested it).

PPS. For comparison, I have just tested Python. Whether I run @sijo’s experiment or my own, df3[['B']].astype('int')
df3[['B']].astype('int64')
df3[['B']].astype('float')
df3[['B']].astype('str')
all work as expected. I think this kind of robustness is essential.

What can I say, except that it works for me, and that it fits the bill for what constructors are supposed to do.

Correction: I think I know why it works. In my experiment, I started with an array of integers and then converted it to an array of strings. Then I convert it back to integers. (May sound like something that one would never do in the field, but I needed it last year in a project on R. It is one of the reasons I am looking into Julia.

In your experiment, you start with strings, so I guess that “Julia” is trying to force you into using parse.

Probably you called Int on values that were already numbers (not strings). Prove me wrong :slight_smile:

1 Like

DIY:

df = DataFrame(B = [23])
string.(df[!,:B])
Int.(df[!,:B])

QED :slight_smile:

Your second line do not change dataframe.

julia> df = DataFrame(B = [23])
1×1 DataFrame
 Row │ B
     │ Int64
─────┼───────
   1 │    23

julia> string.(df[!,:B])
1-element Vector{String}:
 "23"

julia> df[!, :B]
1-element Vector{Int64}:
 23

If you want to make a change, you should reassign column

julia> df[!, :B] = string.(df[!,:B])
1-element Vector{String}:
 "23"

julia> df[!, :B]
1-element Vector{String}:
 "23"

Of course, Int. is not working anymore

julia> Int.(df[!,:B])
ERROR: MethodError: no method matching Int64(::String)

5 Likes

Thanks @Skoffer and @sijo and apologies to all for the confusion. I stand corrected. Int only works on numbers (not strings).

To round up,

  • only parse(T,x) will convert strings to numbers (and parse only accepts strings).

  • The constructor String(x) will only work on strings.

  • The constructor Int(x) and Float64(x) etc only work on numbers.

  • convert(T,x) is a conservative alternative to constructors

  • string(x) works on anything

  • All my comments on “history dependence” are nonsense. The “state” of the data is “Markovian”, just as it should be.

Final whinge: I think a lot of this would be much clearer if the domain and range of each function were clearly specified somewhere. All this T(x) in the manual would be much improved by writing T : X \rightarrow Y.

Thanks all. I guess you can’t make an omelette without cracking a few eggs. :slight_smile:

2 Likes

I also don’t really feel the difference between constructors (Int(x), Float64(x), String(x)) and the convert(T, x) function: in practice both convert from one type to another when these types represent “the same” kind of thing (e.g. number to number).

The distinction between constructors and functions like string(x), parse(T, x) is pretty clear, however: string and parse translate between conceptually different kinds of things, e.g. strings and numbers. There is no uniquely natural way to do this conversion in general: for example, what base should the string <-> number conversion use?

1 Like

There is something about being able to call convert implicitly in the manual. (I have not been able to find a definition or example of “implicit call” however.)

The following link (thanks to @FPGro for posting it) has a pretty good discussion about the process that went into convert and constructors
https://github.com/JuliaLang/julia/issues/1470

That discussion also covers consistency concerns and many of the points made here. Including a relevant comment by Stefan Karpinski on concern for newcomers :slight_smile: I guess he lost that battle.

I think the implicit calling just refers to these cases:

https://docs.julialang.org/en/v1/manual/conversion-and-promotion/#When-is-convert-called?

So when you do:

julia> struct IntInDisguise
         the::Int
       end

julia> IntInDisguise(1)
IntInDisguise(1)

julia> IntInDisguise(1.)
IntInDisguise(1)

julia> IntInDisguise("1")
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  ...
Stacktrace:
 [1] IntInDisguise(the::String)
   @ Main ./REPL[1]:2
 [2] top-level scope
   @ REPL[4]:1

The constructor does call convert although convert never directly appears in the constructor. There’s really nothing more to it, you can extend these as normal: (don’t do that in real code)

julia> Base.convert(::Type{Int}, s::String) = parse(Int,s)

julia> IntInDisguise("1")
IntInDisguise(1)
1 Like

very nice

The answer to the following question is also very good.
https://stackoverflow.com/questions/12036037/explicit-call-to-a-constructor
From the comments in the Julia manual (the page you cite) convert can also be implicitly called, so it seems natural to conclude that T(x) are explicit, one-argument constructors.

Without actually changing the data in the column (only its union type), I have written this little function to do that:

function add_type!(df::DataFrame, colname::Symbol, appendtypes::Type...)
    df[!, colname] =
        Vector{Union{appendtypes..., Base.uniontypes(eltype(df[!, colname]))...}}(df[!, colname])
    return df
end

This way, you can add types as a Union type to the column of interest, which will allow you to later add values of that type to the column without getting errors.

Thanks pdeffebach, that is very useful information. (passmissing)
Is there even a way that works when there are 100% missings?

using DataFrames, Dates
#Vector of string dates
df_str_date = DataFrame(date_ = ["2022-02-10", "2022-02-11"])
#Convert date_ var from type String to type Date
df_str_date.date_ = parse.(Date, df_str_date[:,:date_]) # Type of date_ now Date, great

#Vector of string dates, including missing values
df_str_date2 = DataFrame(date_ = ["2022-02-10", Missing()])
#Convert date_ var from type String to type Date
df_str_date2.date_ = passmissing(parse).(Date, df_str_date2.date_) # Type of date_ now Union{Missing, Date}. Missings.passmissing does the trick, 

#Vector of "string" dates, 100% missing values
df_str_date3 = DataFrame(date_ = [Missing(), Missing()])
df_str_date3.date_ = passmissing(parse).(Date, df_str_date3.date_) #Type of date_ still Missing, not Union{Missing, Date} as I had hoped

Best I can think of is

julia> t = [missing, missing]
2-element Vector{Missing}:
 missing
 missing

julia> Union{Date, Missing}[passmissing(parse)(ti) for ti in t]
2-element Vector{Union{Missing, Date}}:
 missing
 missing

Although I guess at that point you could just write

julia> missings(Date, 2)
2-element Vector{Union{Missing, Date}}:
 missing
 missing

given that you have to manually specify things here anyway (there’s no way to tell you want a Date column when everything is missing other than explicitly saying you do)

EDIT ah sorry I guess what you proposed could be a function that always produces a Date column, like this:

julia> always_parse_Date(x) = identity.(Union{Missing, Date}[passmissing(parse)(Date, xᵢ) for xᵢ ∈ x])
always_parse_Date (generic function with 1 method)

julia> always_parse_Date(["2020-1-1", "2020-1-2"])
2-element Vector{Date}:
 2020-01-01
 2020-01-02

julia> always_parse_Date(["2020-1-1", missing])
2-element Vector{Union{Missing, Date}}:
 2020-01-01
 missing

julia> always_parse_Date([missing, missing])
2-element Vector{Union{Missing, Date}}:
 missing
 missing
4 Likes