Converting columns in a DataFrame from String to Date (with a specific format)

I am trying to convert a DataFrame column populated with dates whose type is String and format DD/MM/YYYY to dates with Date type and format YYYYMMDD

Tried A[!,:DATE] = parse.(Date, A[!,:DATE])

But it errors, with: ArgumentError: Unable to parse date time. Expected directive Delim(-)

Which would be the correct way to achieve this conversion? Thank you!

Parsing into a Date is done by:

julia> d1="02/02/2022"
"02/02/2022"

julia> date=Date(d1,"dd/mm/yyyy")
2022-02-02

julia> typeof(date)
Date

Output (as String) is done by:

julia> s=Dates.format(date,"Ymmdd")
"20220202"

So:

A[!,:DATE]= Dates.format.(Date.( A[!,:DATE],"dd/mm/yyyy"),"Ymmdd")

Typically, when itā€™s about DataFrames, after a while, much more elegant versions are popping up, just wait for itā€¦

2 Likes

Thank you very much!

How did you get date strings into a data frame in the first place?

In case that you use CSV.jl, you can specify a date format when reading the file Reading Ā· CSV.jl

2 Likes

Donā€™t call Dates.format on your column, because it converts Dates to String so in general it is a bad practice. Leave it as Date unless you want to present data.

using Dates
f(x; df = dateformat"m/d/y") = Date(x, df)
transform(df, :date=>ByRow(f))

ā€¦apologies but how does this work for a column of times i.e. converting a dataframe column typically ā€œ01:10:46ā€ string to 01:10:46 Time. I blow up with either ā€˜no method matchingā€™ or ā€˜no method matching Int64(::Vector{Any})ā€™ etc etc. Dataframe column is loaded with 700 rows of Any.

The first thing to realise is that converting DataFrame columns is not different from converting regular Arrays of any type to any other type, as DataFrame columns are just vectors. So you really just need to know how to convert a String to a Time object, irrespective of where this String is stored.

The second thing to note is that getting a numerical value out of a string is generally referred to as ā€œparsingā€, rather than ā€œconversionā€. With this, you have:

julia> using Dates

julia> parse(Time, "01:10:46")
01:10:46

julia> typeof(ans)
Time

and then broadcast that over your data, i.e. parse.(Time, df.timecol) (although I canā€™t guarantee that all your strings have the right format to be parsed, so you might have to fiddle with it a bit!)

PS how did you end up with 700 columns of type Any? If you are using XLSX.jl to read an Excel file, consider the infer_eltypes = true kwarg.

thank you very much for the reply. I have a df column with 700 rows of string time in a df.Time column which has come from an xslx but I had missed the infer_eltypes option.

parse(Time, df.ā€œTimeā€) gives methodError

Iā€™m trying to re-write the entire df.Time column. In the xlsx file the times are in time format and add up.

new to Julia and coming from a Python background. Again thanks for the comment, Iā€™ll keep at it.

My suggestion was parse.(Time, df."Time") - note the dot after parse to broadcast (apply element-wise) the function.

(also note the quotes arenā€™t necessary if you have a column name which is a single word without special characters)


Tried the broadcast (sorry) my mis-typing.

julia> parse.(Time, df.Time)
ERROR: MethodError: no method matching parse(::Type{Time}, ::Time)

You have already converted the df.Time column to a Time type. It is no longer a string, so parse doesnā€™t work. Try again on your ā€œrawā€ data frame with string types.

Good spot :slight_smile: the df.Time column got corrected with the great infer_eltype suggestion. Other similar columns in the df remained Any and a parse.(Time, df.ā€œAvg Paceā€) givesā€¦

The problem is the same, as before. Your column isnā€™t all Strings and parse only works on strings.

How are you importing your data? Itā€™s not a great sign to have these Any columns.

Interesting, perhaps I go back a raw CSV.File input, which I moved away from to

DataFrame(XLSX.readtable(ā€œfile.xlsxā€, ā€œActivitiesā€, infer_eltypes = true))

with the useful suggestion of infer_eltypes that worked on the first Time column leaving the remaining others.

Think Iā€™ll take a few steps backwards. Didnā€™t want to take up too much of peopleā€™s time etc. From the comments I think its the raw data more than my logic which is a step. Thanks.

Donā€™t worry! These are annoying problems to run into.

One solution would be a helper function

get_to_time(x::String) = parse(Time, x)
get_to_time(x::Time) = x
get_to_time(x) = missing
get_to_time.(df.Time)