I can absolutely see how this is confusing, just to explain where that comes from:
Int is a type, it’s an alias to Int64 or Int32 as appropriate, Int(...) is therefore a constructor
float on the other hand is an ordinary function and there is no type or alias Float, so that can’t work. float(...) tries to infer an appropriate floating point type to represent the input
String is a peculiar case. String is a type obviously, and String(...) does work but it’s doing something else than you may expect. The constructor takes only things that can directly be converted to the String type and gives you a String object. So this works for codepoints and other String-like objects. string(...) on the other hand will turn anything into a string, and is therefore different from String(...), similar how you get an Int from a String via parse(somestring), not Int(somestring)
Good questions! I’m not “old” enough with julia to answer that in full, but there used to be Float and int which were dropped at some point, so there’s probably a good reason for both
Even 32-bit CPUs use Float64 as the default floating point type, so there’s no need for a Float alias like for Int.
What would be the point of adding it? We already have Int. string and float are necessary evils, not something that we want to replicate for all types if we can avoid them.
Consistency really, that’s all. From the perspective of a user that is in the middle of a project, with no time to study constructors, Python’s user-friendly df[['B']].astype('T') is ugly, but effective.
PS. As we can see from the discussion around the functioning of Int with @sijo there is also a form of “history dependence” that is unexpected. A “Markov” approach (vis a vis the state of the data) to the behaviour of these functions/constructors would also help with consistency. I accept that Python’s function may also be subject to this criticism (I haven’t tested it).
PPS. For comparison, I have just tested Python. Whether I run @sijo’s experiment or my own, df3[['B']].astype('int') df3[['B']].astype('int64') df3[['B']].astype('float') df3[['B']].astype('str')
all work as expected. I think this kind of robustness is essential.
What can I say, except that it works for me, and that it fits the bill for what constructors are supposed to do.
Correction: I think I know why it works. In my experiment, I started with an array of integers and then converted it to an array of strings. Then I convert it back to integers. (May sound like something that one would never do in the field, but I needed it last year in a project on R. It is one of the reasons I am looking into Julia.
In your experiment, you start with strings, so I guess that “Julia” is trying to force you into using parse.
Thanks @Skoffer and @sijo and apologies to all for the confusion. I stand corrected. Int only works on numbers (not strings).
To round up,
only parse(T,x) will convert strings to numbers (and parse only accepts strings).
The constructor String(x) will only work on strings.
The constructor Int(x) and Float64(x) etc only work on numbers.
convert(T,x) is a conservative alternative to constructors
string(x) works on anything
All my comments on “history dependence” are nonsense. The “state” of the data is “Markovian”, just as it should be.
Final whinge: I think a lot of this would be much clearer if the domain and range of each function were clearly specified somewhere. All this T(x) in the manual would be much improved by writing T : X \rightarrow Y.
Thanks all. I guess you can’t make an omelette without cracking a few eggs.
I also don’t really feel the difference between constructors (Int(x), Float64(x), String(x)) and the convert(T, x) function: in practice both convert from one type to another when these types represent “the same” kind of thing (e.g. number to number).
The distinction between constructors and functions like string(x), parse(T, x) is pretty clear, however: string and parse translate between conceptually different kinds of things, e.g. strings and numbers. There is no uniquely natural way to do this conversion in general: for example, what base should the string <-> number conversion use?
There is something about being able to call convert implicitly in the manual. (I have not been able to find a definition or example of “implicit call” however.)
That discussion also covers consistency concerns and many of the points made here. Including a relevant comment by Stefan Karpinski on concern for newcomers I guess he lost that battle.
julia> struct IntInDisguise
the::Int
end
julia> IntInDisguise(1)
IntInDisguise(1)
julia> IntInDisguise(1.)
IntInDisguise(1)
julia> IntInDisguise("1")
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
Closest candidates are:
convert(::Type{T}, ::T) where T<:Number at number.jl:6
convert(::Type{T}, ::Number) where T<:Number at number.jl:7
convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
...
Stacktrace:
[1] IntInDisguise(the::String)
@ Main ./REPL[1]:2
[2] top-level scope
@ REPL[4]:1
The constructor does call convert although convert never directly appears in the constructor. There’s really nothing more to it, you can extend these as normal: (don’t do that in real code)
The answer to the following question is also very good. https://stackoverflow.com/questions/12036037/explicit-call-to-a-constructor
From the comments in the Julia manual (the page you cite) convert can also be implicitly called, so it seems natural to conclude that T(x) are explicit, one-argument constructors.
Without actually changing the data in the column (only its union type), I have written this little function to do that:
function add_type!(df::DataFrame, colname::Symbol, appendtypes::Type...)
df[!, colname] =
Vector{Union{appendtypes..., Base.uniontypes(eltype(df[!, colname]))...}}(df[!, colname])
return df
end
This way, you can add types as a Union type to the column of interest, which will allow you to later add values of that type to the column without getting errors.
Thanks pdeffebach, that is very useful information. (passmissing)
Is there even a way that works when there are 100% missings?
using DataFrames, Dates
#Vector of string dates
df_str_date = DataFrame(date_ = ["2022-02-10", "2022-02-11"])
#Convert date_ var from type String to type Date
df_str_date.date_ = parse.(Date, df_str_date[:,:date_]) # Type of date_ now Date, great
#Vector of string dates, including missing values
df_str_date2 = DataFrame(date_ = ["2022-02-10", Missing()])
#Convert date_ var from type String to type Date
df_str_date2.date_ = passmissing(parse).(Date, df_str_date2.date_) # Type of date_ now Union{Missing, Date}. Missings.passmissing does the trick,
#Vector of "string" dates, 100% missing values
df_str_date3 = DataFrame(date_ = [Missing(), Missing()])
df_str_date3.date_ = passmissing(parse).(Date, df_str_date3.date_) #Type of date_ still Missing, not Union{Missing, Date} as I had hoped
given that you have to manually specify things here anyway (there’s no way to tell you want a Date column when everything is missing other than explicitly saying you do)
EDIT ah sorry I guess what you proposed could be a function that always produces a Date column, like this: