How to convert String type columns of a dataframe to a DateTime type and a Float64 type, using either plain DataFrames.jl or Query.jl

I have the following dataframe, with original columns of String type:

using DataFrames
df_0 = DataFrame(Inicio = ["10 Jan 2021  15:00 ", "10 Jan 2021  16:43 ", "10 Jan 2021  17:13 "], Q1 = ["0.50", "0.00", "0.50"])

I am able to convert the column named “Q1” to a Float64 type with (à la Query) the command:

using Query
df_1 = df_0 |> @mutate(Q1 = parse(Float64, _.Q1)) |> DataFrame

It does mutate the original df_0 to a new df_1 with “Inicio” remaining of String type and “Q1” now a Float64 type, as expected (is there a better or more efficient way?).

However, if then I try the naive commands below, inspired by Standalone Query Commands · Query.jl , to convert the column named “Inicio” to a DateTime type with proper format( notice the spaces within the strings!):

using Dates
dtformat = DateFormat("d u Y  H:M ")
df_2 = df_1 |> @mutate(Inicio = DateTime(_.Inicio, dtformat)) |> DataFrame

it generates another df_2 with the columns “Inicio” and “Q1” now both of type Any (sic!). The conversion of the column Inicio to the DateTime seems to have worked, however. In fact, when I just type:

df_2

it displays the dataframe with the types Any (in slight gray, on the REPL), right below the two named columns, whereas when I type, e.g.:

typeof(df_2.Inicio[1]), typeof(df.Q1[1]

I get the wanted output tuple: (DateTime, Float64)…
What is happening here? Shouldn’t the columns of df_2 be of type DateTime and Float64, respectively?

In plain DataFrames.jl you would write this transformation as:

julia> transform(df_0, :Inicio => ByRow(x -> DateTime(x, DateFormat("d u Y  H:M "))), :Q1 => ByRow(x -> parse(Float64, x)), renamecols=false)
3×2 DataFrame
 Row │ Inicio               Q1
     │ DateTime             Float64
─────┼──────────────────────────────
   1 │ 2021-01-10T15:00:00      0.5
   2 │ 2021-01-10T16:43:00      0.0
   3 │ 2021-01-10T17:13:00      0.5

but this is a bit verbose.

You could also just do:

julia> DataFrame(Inicio=DateTime.(df_0.Inicio, DateFormat("d u Y  H:M ")), Q1=parse.(Float64, df_0.Q1))
3×2 DataFrame
 Row │ Inicio               Q1
     │ DateTime             Float64
─────┼──────────────────────────────
   1 │ 2021-01-10T15:00:00      0.5
   2 │ 2021-01-10T16:43:00      0.0
   3 │ 2021-01-10T17:13:00      0.5

which is shorter (but would create only two columns, while transform would keep all other columns if needed).

Also, if it is OK for you to work in-place you can just write:

df_0.Inicio = DateTime.(df_0.Inicio, DateFormat("d u Y  H:M "))
df_0.Q1 = parse.(Float64, df_0.Q1))

which should be quite readable.

3 Likes

Your last use of DataFrame’s is indeed quite readable; thank you!
But why does my second use of Query’s @mutate does not generate columns with the “correct” types, rather than with “Any” types?

Probably @davidanthoff can help you with this!

Thanks for the prompt reply!