Most efficient way: create/mutate DataFrame time column

Suppose you have a DataFrame, which has a column called Year (with Int64). Then it has a column called Month (with String). Yes, suppose you have 12 rows with the integer year 2000 and the 12 months January to December as Strings with that, and under it 2001 and its 12 months.

You look at it. Now you want to create a corresponding column with Julia’s own Date type.

What is the most direct, most efficient, and easy way to get the result whether using some querying package or only Julia’s own included tools?


I thought of using a for loop to create a new array, but it might not be the best here.

Alternatively, is it possible to simply enforce Julia’s recognition of Dates by editing a CSV beforehand somehow?

I don’t see a function in Dates for parsing month names, however there is a function for generating them.

You can do

mdict = Dict(monthname(i)=>i for i ∈ 1:12)
Date.(years, [mdict[m] for m ∈ months])

assuming the months and years are in two separate columns.

If you had a column with parseable timestamps you can set the column type with the types keyword, see the CSV.jl documentation, but that doesn’t sound like it’s the case here.

Month names could be parsed like:

# df is the DataFrame described above
Date.(string.(df.Year).*"-".*df.Month, "y-U")

For info on the "y-U" string, see ?DateFormat

Thanks, both. I will have a look at that (and maybe come back to you).

Your way, @Mattriks, gives me an error:

ArgumentError: Unable to parse date time. Expected directive DatePart(U) at char 6

I wonder, @ExpandingMan, if yours is really applicable to this situation. Maybe I don’t understand or I explained wrongly.


Let me try to explain (and someone can then extend this code):

### a simple DataFrame
df = DataFrame(
  Year = [2000, 2000, 2000, 2000],
  Month = ["January", "February", "March", "April"]
);


Question: how do you easily, automatically and most efficiently create a third column, which will have Julia’s Date type, where the elements/Dates will have the format according to2000-01-01, 2000-02-01, which you would normally get from Date(2000,1,1), Date(2000,2,1), etc.?

Using the df in your last post, these both work for me:

df.date = Date.(.*(string.(df.Year), "-", df.Month), "y-U")
df.date2 = Date.(string.(df.Year).*"-".*df.Month, "y-U")
2 Likes

That works.

I wish there were pre-made convenient functions for this kind of thing (which may be common encounters), even that it would nicely work with different String month names. I thought of it, but haven’t come up with a very nice solution.

Is there in any case an alternative to such code as you wrote closer to the source (earlier in the process)? For example, if you edit or make a CSV beforehand, can you do it in such a way (by doing a column in some way) so as to force Date recognition?

You can do (see ?CSV.File):

CSV.read("myfile.csv", dateformat=dateformat"y-U", types=Dict(1=>Date))  

which assumes that column 1 contains e.g. 2000-January that can be converted to a Date. But that only works for single columns, I don’t know of a way for combining multiple columns into a Date in CSV.read.

1 Like

Thanks.