Split DateTime column into separate Date and Time Columns

phantom · April 28, 2022, 8:07pm

Suppose I have a Dataframe F with a column of DateTime data as follows,

        Number  DateTime            
        Int64      DateTime            

    1 │73    2022-03-21T20:00:00
    2 │80    2022-03-21T19:56:00
    3 │85    2022-03-21T19:47:00
    4 │62    2022-03-21T19:15:00

what is the optimal way to split the DateTime Column into separate Date and Time columns while preserving the Date and Time data so that it would be in the following form.

        Number  Date         Time            
        Int64       Date         Time            

    1 │73    2022-03-21  20:00:00
    2 │80    2022-03-21  19:56:00
    3 │85    2022-03-21  19:47:00
    4 │62    2022-03-21  19:15:00

Currently I am trying something like this.

G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime) )
F = F[:, [:Number]]
F = hcat(G,F)

but because I am looping through thousands of DataFrames I was wondering if there was a more efficient way to do this without having to create a new DataFrame, removing the old column and then combining them.

After some digging I think that this should be able to be done with split or transform but I cannot seem to get it to work. Any pointers would be appreciated thank you!

zweiglimmergneis · April 28, 2022, 8:23pm

julia> transform!(F, :DateTime => ByRow(x -> (Date(x), Time(x))) => [:Date, :Time])
4×4 DataFrame
 Row │ Number  DateTime             Date        Time
     │ Int64   DateTime             Date        Time
─────┼───────────────────────────────────────────────────
   1 │     73  2022-03-21T20:00:00  2022-03-21  20:00:00
   2 │     80  2022-03-21T19:56:00  2022-03-21  19:56:00
   3 │     85  2022-03-21T19:47:00  2022-03-21  19:47:00
   4 │     62  2022-03-21T19:15:00  2022-03-21  19:15:00

bkamins · April 28, 2022, 8:46pm

or a more fancy way - just to practice the DataFrames.jl minilanguage (though I prefer the solution by @zweiglimmergneis as it is easier to understand)

transform!(F, @. :DateTime => ByRow([Date, Time]) => [:Date, :Time])

For completeness let me mention that one could also write:

transform!(F, :DateTime => ByRow(x -> (Date=Date(x), Time=Time(x))) => AsTable)

phantom · April 28, 2022, 9:24pm

Thanks this is awesome! would you mind explaining what the x -> (Date(x), Time(x) is doing? Is x -> equivalent to x = , that is to say is it just an assignment operator? Sorry really don’t know what I’m doing.

phantom · April 28, 2022, 9:24pm

Thanks so much! this is very helpful. Still working through the code so I can understand it better. also I’m just curious because when I just directly create a new DataFrame and concatenate them I am getting the following time on a 4876 x 1 DataFrame of DateTimes

function splitcolumn(F)
     G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime))
     F = select!(F, Not([:DateTime]))
     F = hcat(G,F)
end
splitcolumn (generic function with 1 method)

And I have a DataFrame G that is 4976 x1 I get a time of

@btime splitcolumn(G)
  14.750 μs (101 allocations: 235.05 KiB)

Whereas if I run

 @btime transform!(G, @. :DateTime  => ByRow([Date, Time]) => [:Date, :Time])
  25.750 μs (219 allocations: 87.09 KiB)

Why is one method slower than the other?

pdeffebach · April 28, 2022, 9:29pm

transform! probably has a bit more overhead. With a larger data frame the times should be similar.

bkamins · April 28, 2022, 10:09pm

it creates a lambda function, see here.

bkamins · April 28, 2022, 10:13pm

Exactly. However, this is a microsecond overhead that is constant (i.e. it does not grow with the number of rows of a data frame).

The reason is that transform! has a lot of logic to handle many different things you might want to perform with it.

Also note that the difference is that your splitcolumn function is most likely not doing what you assume it does as it:

removes :DateTime column from the data frame that you passed to it in-place
creates a new data frame and returns it (the columns generated in G are not added to the passed data frame - they are only added to the returned data frame)

phantom · April 29, 2022, 2:31am

Thanks so much for your explanation! It’s really helpful and appreciate you taking the time. Just had one more follow up. So then splitcolumn behaves a little like transform (without the exclamation mark) in that it creates a new DataFrame leaving the original DataFrame untouched whereas, transform! actually modifies the original? But suppose I have a DataFrame f and I set another DataFrame d = f

d = f
transform!(d, @. :DateTime => ByRow([Date, Time]) =>[:Date, :Time])

This not only alters the DataFrame d in the function, but also makes the same changes to the original DataFrame f. So by modifying d I also changed f. This is different then the behavior in scalars where

c
4

d = c
4

if I then modify d in the following manner with a unary function c remains unchanged.

d=sqrt(d)
2.0

c
4

Just curious why the original is modified in the first case and not the latter.

phantom · April 29, 2022, 3:41am

Also just curious why is it that if I run the three commands separately

 G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime))
 F = select!(F, Not([:DateTime]))
 F = hcat(G,F)

G is added to F whereas when they are run as the splitcolumn function

function splitcolumn(F)
     G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime))
     F = select!(F, Not([:DateTime]))
     F = hcat(G,F)
end

G is only added to the returned DataFrame whereas F is unchanged.

HerAdri · April 29, 2022, 6:11am

Inside the function splitcolumn(F)
the variable F behaves like a global variable?!!

julia> using Dates, DataFrames, BenchmarkTools

julia> F = DataFrame(Number=[73,80,85,62],
            DateTime=["2022-03-21T20:00:00",
             "2022-03-21T19:56:00",
             "2022-03-21T19:47:00",
              "2022-03-21T19:15:00"])
4x2 DataFrame
 Row │ Number  DateTime
     │ Int64   String
─────┼─────────────────────────────
   1 │     73  2022-03-21T20:00:00
   2 │     80  2022-03-21T19:56:00
   3 │     85  2022-03-21T19:47:00
   4 │     62  2022-03-21T19:15:00

julia> F.DateTime = DateTime.(F.DateTime, DateFormat("y-m-dTH:M:SZ"))
4-element Vector{DateTime}:
 2022-03-21T20:00:00
 2022-03-21T19:56:00
 2022-03-21T19:47:00
 2022-03-21T19:15:00

julia> F
4x2 DataFrame
 Row │ Number  DateTime
     │ Int64   DateTime
─────┼─────────────────────────────
   1 │     73  2022-03-21T20:00:00
   2 │     80  2022-03-21T19:56:00
   3 │     85  2022-03-21T19:47:00
   4 │     62  2022-03-21T19:15:00

julia> function splitcolumn(F)
            G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime))
            F = select!(F, Not([:DateTime]))
            F = hcat(G,F)
       end
splitcolumn (generic function with 1 method)

julia> @btime splitcolumn(F)
ERROR: ArgumentError: column name :DateTime not found in the data frame
Stacktrace:
  [1] lookupname
    @ C:\Users\Hermesr\.julia\packages\DataFrames\6xBiG\src\other\index.jl:392 [inlined]
  [2] getindex
    @ C:\Users\Hermesr\.julia\packages\DataFrames\6xBiG\src\other\index.jl:401 [inlined]
  [3] getindex(df::DataFrame, #unused#::typeof(!), col_ind::Symbol)
    @ DataFrames C:\Users\Hermesr\.julia\packages\DataFrames\6xBiG\src\dataframe\dataframe.jl:525
  [4] getproperty
    @ C:\Users\Hermesr\.julia\packages\DataFrames\6xBiG\src\abstractdataframe\abstractdataframe.jl:378 [inlined]
  [5] splitcolumn(F::DataFrame)
    @ Main .\REPL[13]:2
  [6] var"##core#303"()
    @ Main C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:489
  [7] var"##sample#304"(::Tuple{}, __params::BenchmarkTools.Parameters)
    @ Main C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:495
  [8] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Sym
bol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:99
  [9] #invokelatest#2
    @ .\essentials.jl:718 [inlined]
 [10] #run_result#45
    @ C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:34 [inlined]
 [11] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float
64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample),
 Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:117
 [12] #warmup#54
    @ C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:169 [inlined]
 [13] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Flo
at64, verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ BenchmarkTools C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:250
 [14] tune! (repeats 2 times)
    @ C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:250 [inlined]
 [15] top-level scope
    @ C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:576

julia> F
4x1 DataFrame
 Row │ Number
     │ Int64
─────┼────────
   1 │     73
   2 │     80
   3 │     85
   4 │     62

julia> G
ERROR: UndefVarError: G not defined

nilshg · April 29, 2022, 6:22am

Or you could just do

F.Date = Date.(F.DateTime)
F.Time = Time.(F.DateTime)

bkamins · April 29, 2022, 7:23am

Because data frame is mutable, and integer is immutable. Therefore d = sqrt(d) does not change d but assigns a new binding to d value (in this case d is an integer - immutable). As a self promotion, the difference between binding a value and copying it is discussed in chapter 2 of my upcoming book Julia for Data Analysis; you should be able to preview it for free. See also the ! convention explanation at Style Guide · The Julia Language.

No, it also modifies the original data frame, as you run the select!(F, Not([:DateTime])) command there which modifies the source.

bkamins · April 29, 2022, 7:24am

As commented above F passed to splitcolumn is changed by the select!(F, Not([:DateTime])) operation.

bkamins · April 29, 2022, 7:26am

No, F is a function local variable. However, the select!(F, Not([:DateTime])) operation mutates source F.

HerAdri · April 29, 2022, 7:50am

Is it possible to avoid that behavior? another line of code, for example!!

nilshg · April 29, 2022, 7:53am

Not sure if I’m benchmarking this right (I usually don’t use setup and it seems when doing so one can’t interpolate?) but FWIW this is what I get for the size that OP is interested in:

julia> using DataFrames, Dates, BenchmarkTools

julia> f1!(F) = transform!(F, :DateTime => ByRow(x -> (Date(x), Time(x))) => [:Date, :Time]);

julia> f2!(F) = transform!(F, @. :DateTime => ByRow([Date, Time]) => [:Date, :Time]);

julia> f3!(F) = transform!(F, :DateTime => ByRow(x -> (Date=Date(x), Time=Time(x))) => AsTable);

julia> function f4!(F)
           F.Date = Date.(F.DateTime)
           F.Time = Time.(F.DateTime)
       end;

julia> @btime f1!(df_test) setup = (df_test = DataFrame(DateTime = rand(now():Minute(1):now()+Year(2), 5_000)));
  247.100 μs (20177 allocations: 634.48 KiB)

julia> @btime f2!(df_test) setup = (df_test = DataFrame(DateTime = rand(now():Minute(1):now()+Year(2), 5_000)));
  40.300 μs (254 allocations: 92.56 KiB)

julia> @btime f3!(df_test) setup = (df_test = DataFrame(DateTime = rand(now():Minute(1):now()+Year(2), 5_000)));
  283.000 μs (20159 allocations: 633.48 KiB)

julia> @btime f4!(df_test) setup = (df_test = DataFrame(DateTime = rand(now():Minute(1):now()+Year(2), 5_000)));
  12.400 μs (16 allocations: 78.62 KiB)

pdeffebach · April 29, 2022, 8:10am

You can avoid the behavior by doing

t  = select(F, Not([:DateTime]))

that is, use the non-mutating select. But you would have to assign this to a new variable.

Another alternative would be to write F = copy(F) at the beginning of the function.

Though this is a digression from OPs main point. Please open up a new thread if you have more questions.

phantom · April 29, 2022, 9:13pm

This is very helpful again thanks so much! And thank you for sharing your book, I’ll be sure to check it out! Sorry what I meant was not that F was unchanged but that the assignment F = hcat(G,F) as a stand alone line would bind G to the original DataFrame F. However why is it when the assignment is included in the function above, as you mentioned before, it only works to concatenate G to the returned Dataframe but not to F itself?

phantom · April 29, 2022, 9:22pm

Wow this is really great thanks! Yeah I got similar benchmark times. There was also a few earlier comments from pdeffebach and bkamins which I understood to mean that the advantage of using transform for smaller datasets would be limited because of the initial overhead.

Topic		Replies	Views
Create a new DateTime dataframe column from Date and Time columns General Usage dataframes , datetime	2	43	May 28, 2025
Converting DateTime in DataFrames indo Data New to Julia question , dates , dataframes	9	561	April 15, 2021
Datetime operations New to Julia dataframes , time	5	237	November 12, 2024
Most efficient way: create/mutate DataFrame time column General Usage	8	1081	November 28, 2019
How to split up a Float column in array or dataframe (not String) New to Julia	8	1577	June 2, 2020

Split DateTime column into separate Date and Time Columns

Related topics