Split DateTime column into separate Date and Time Columns

Suppose I have a Dataframe F with a column of DateTime data as follows,

        Number  DateTime            
        Int64      DateTime            

    1 │73    2022-03-21T20:00:00
    2 │80    2022-03-21T19:56:00
    3 │85    2022-03-21T19:47:00
    4 │62    2022-03-21T19:15:00

what is the optimal way to split the DateTime Column into separate Date and Time columns while preserving the Date and Time data so that it would be in the following form.

        Number  Date         Time            
        Int64       Date         Time            

    1 │73    2022-03-21  20:00:00
    2 │80    2022-03-21  19:56:00
    3 │85    2022-03-21  19:47:00
    4 │62    2022-03-21  19:15:00

Currently I am trying something like this.

G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime) )
F = F[:, [:Number]]
F = hcat(G,F)

but because I am looping through thousands of DataFrames I was wondering if there was a more efficient way to do this without having to create a new DataFrame, removing the old column and then combining them.

After some digging I think that this should be able to be done with split or transform but I cannot seem to get it to work. Any pointers would be appreciated thank you!

julia> transform!(F, :DateTime => ByRow(x -> (Date(x), Time(x))) => [:Date, :Time])
4×4 DataFrame
 Row │ Number  DateTime             Date        Time
     │ Int64   DateTime             Date        Time
─────┼───────────────────────────────────────────────────
   1 │     73  2022-03-21T20:00:00  2022-03-21  20:00:00
   2 │     80  2022-03-21T19:56:00  2022-03-21  19:56:00
   3 │     85  2022-03-21T19:47:00  2022-03-21  19:47:00
   4 │     62  2022-03-21T19:15:00  2022-03-21  19:15:00
3 Likes

or a more fancy way - just to practice the DataFrames.jl minilanguage :smile: (though I prefer the solution by @zweiglimmergneis as it is easier to understand)

transform!(F, @. :DateTime => ByRow([Date, Time]) => [:Date, :Time])

For completeness let me mention that one could also write:

transform!(F, :DateTime => ByRow(x -> (Date=Date(x), Time=Time(x))) => AsTable)
5 Likes

Thanks this is awesome! would you mind explaining what the x -> (Date(x), Time(x) is doing? Is x -> equivalent to x = , that is to say is it just an assignment operator? Sorry really don’t know what I’m doing.

Thanks so much! this is very helpful. Still working through the code so I can understand it better. also I’m just curious because when I just directly create a new DataFrame and concatenate them I am getting the following time on a 4876 x 1 DataFrame of DateTimes

function splitcolumn(F)
     G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime))
     F = select!(F, Not([:DateTime]))
     F = hcat(G,F)
end
splitcolumn (generic function with 1 method)

And I have a DataFrame G that is 4976 x1 I get a time of

@btime splitcolumn(G)
  14.750 μs (101 allocations: 235.05 KiB)

Whereas if I run

 @btime transform!(G, @. :DateTime  => ByRow([Date, Time]) => [:Date, :Time])
  25.750 μs (219 allocations: 87.09 KiB)

Why is one method slower than the other?

transform! probably has a bit more overhead. With a larger data frame the times should be similar.

1 Like

it creates a lambda function, see here.

2 Likes

Exactly. However, this is a microsecond overhead that is constant (i.e. it does not grow with the number of rows of a data frame).

The reason is that transform! has a lot of logic to handle many different things you might want to perform with it.

Also note that the difference is that your splitcolumn function is most likely not doing what you assume it does as it:

  1. removes :DateTime column from the data frame that you passed to it in-place
  2. creates a new data frame and returns it (the columns generated in G are not added to the passed data frame - they are only added to the returned data frame)
3 Likes

Thanks so much for your explanation! It’s really helpful and appreciate you taking the time. Just had one more follow up. So then splitcolumn behaves a little like transform (without the exclamation mark) in that it creates a new DataFrame leaving the original DataFrame untouched whereas, transform! actually modifies the original? But suppose I have a DataFrame f and I set another DataFrame d = f

d = f
transform!(d, @. :DateTime => ByRow([Date, Time]) =>[:Date, :Time]) 

This not only alters the DataFrame d in the function, but also makes the same changes to the original DataFrame f. So by modifying d I also changed f. This is different then the behavior in scalars where

c
4

d = c
4

if I then modify d in the following manner with a unary function c remains unchanged.

d=sqrt(d)
2.0

c
4 

Just curious why the original is modified in the first case and not the latter.

Also just curious why is it that if I run the three commands separately

 G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime))
 F = select!(F, Not([:DateTime]))
 F = hcat(G,F)

G is added to F whereas when they are run as the splitcolumn function

function splitcolumn(F)
     G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime))
     F = select!(F, Not([:DateTime]))
     F = hcat(G,F)
end

G is only added to the returned DataFrame whereas F is unchanged.

Inside the function splitcolumn(F)
the variable F behaves like a global variable?!!

julia> using Dates, DataFrames, BenchmarkTools

julia> F = DataFrame(Number=[73,80,85,62],
            DateTime=["2022-03-21T20:00:00",
             "2022-03-21T19:56:00",
             "2022-03-21T19:47:00",
              "2022-03-21T19:15:00"])
4x2 DataFrame
 Row │ Number  DateTime
     │ Int64   String
─────┼─────────────────────────────
   1 │     73  2022-03-21T20:00:00
   2 │     80  2022-03-21T19:56:00
   3 │     85  2022-03-21T19:47:00
   4 │     62  2022-03-21T19:15:00

julia> F.DateTime = DateTime.(F.DateTime, DateFormat("y-m-dTH:M:SZ"))
4-element Vector{DateTime}:
 2022-03-21T20:00:00
 2022-03-21T19:56:00
 2022-03-21T19:47:00
 2022-03-21T19:15:00

julia> F
4x2 DataFrame
 Row │ Number  DateTime
     │ Int64   DateTime
─────┼─────────────────────────────
   1 │     73  2022-03-21T20:00:00
   2 │     80  2022-03-21T19:56:00
   3 │     85  2022-03-21T19:47:00
   4 │     62  2022-03-21T19:15:00

julia> function splitcolumn(F)
            G = DataFrame(Date = Date.(F.DateTime),Time = Time.(F.DateTime))
            F = select!(F, Not([:DateTime]))
            F = hcat(G,F)
       end
splitcolumn (generic function with 1 method)

julia> @btime splitcolumn(F)
ERROR: ArgumentError: column name :DateTime not found in the data frame
Stacktrace:
  [1] lookupname
    @ C:\Users\Hermesr\.julia\packages\DataFrames\6xBiG\src\other\index.jl:392 [inlined]
  [2] getindex
    @ C:\Users\Hermesr\.julia\packages\DataFrames\6xBiG\src\other\index.jl:401 [inlined]
  [3] getindex(df::DataFrame, #unused#::typeof(!), col_ind::Symbol)
    @ DataFrames C:\Users\Hermesr\.julia\packages\DataFrames\6xBiG\src\dataframe\dataframe.jl:525
  [4] getproperty
    @ C:\Users\Hermesr\.julia\packages\DataFrames\6xBiG\src\abstractdataframe\abstractdataframe.jl:378 [inlined]
  [5] splitcolumn(F::DataFrame)
    @ Main .\REPL[13]:2
  [6] var"##core#303"()
    @ Main C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:489
  [7] var"##sample#304"(::Tuple{}, __params::BenchmarkTools.Parameters)
    @ Main C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:495
  [8] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Sym
bol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:99
  [9] #invokelatest#2
    @ .\essentials.jl:718 [inlined]
 [10] #run_result#45
    @ C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:34 [inlined]
 [11] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float
64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample),
 Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:117
 [12] #warmup#54
    @ C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:169 [inlined]
 [13] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Flo
at64, verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ BenchmarkTools C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:250
 [14] tune! (repeats 2 times)
    @ C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:250 [inlined]
 [15] top-level scope
    @ C:\Users\Hermesr\.julia\packages\BenchmarkTools\7xSXH\src\execution.jl:576

julia> F
4x1 DataFrame
 Row │ Number
     │ Int64
─────┼────────
   1 │     73
   2 │     80
   3 │     85
   4 │     62

julia> G
ERROR: UndefVarError: G not defined
1 Like

Or you could just do

F.Date = Date.(F.DateTime)
F.Time = Time.(F.DateTime)
3 Likes

Because data frame is mutable, and integer is immutable. Therefore d = sqrt(d) does not change d but assigns a new binding to d value (in this case d is an integer - immutable). As a self promotion, the difference between binding a value and copying it is discussed in chapter 2 of my upcoming book Julia for Data Analysis; you should be able to preview it for free. See also the ! convention explanation at Style Guide · The Julia Language.

No, it also modifies the original data frame, as you run the select!(F, Not([:DateTime])) command there which modifies the source.

1 Like

As commented above F passed to splitcolumn is changed by the select!(F, Not([:DateTime])) operation.

1 Like

No, F is a function local variable. However, the select!(F, Not([:DateTime])) operation mutates source F.

1 Like

Is it possible to avoid that behavior? another line of code, for example!!

Not sure if I’m benchmarking this right (I usually don’t use setup and it seems when doing so one can’t interpolate?) but FWIW this is what I get for the size that OP is interested in:

julia> using DataFrames, Dates, BenchmarkTools

julia> f1!(F) = transform!(F, :DateTime => ByRow(x -> (Date(x), Time(x))) => [:Date, :Time]);

julia> f2!(F) = transform!(F, @. :DateTime => ByRow([Date, Time]) => [:Date, :Time]);

julia> f3!(F) = transform!(F, :DateTime => ByRow(x -> (Date=Date(x), Time=Time(x))) => AsTable);

julia> function f4!(F)
           F.Date = Date.(F.DateTime)
           F.Time = Time.(F.DateTime)
       end;

julia> @btime f1!(df_test) setup = (df_test = DataFrame(DateTime = rand(now():Minute(1):now()+Year(2), 5_000)));
  247.100 μs (20177 allocations: 634.48 KiB)

julia> @btime f2!(df_test) setup = (df_test = DataFrame(DateTime = rand(now():Minute(1):now()+Year(2), 5_000)));
  40.300 μs (254 allocations: 92.56 KiB)

julia> @btime f3!(df_test) setup = (df_test = DataFrame(DateTime = rand(now():Minute(1):now()+Year(2), 5_000)));
  283.000 μs (20159 allocations: 633.48 KiB)

julia> @btime f4!(df_test) setup = (df_test = DataFrame(DateTime = rand(now():Minute(1):now()+Year(2), 5_000)));
  12.400 μs (16 allocations: 78.62 KiB)
1 Like

You can avoid the behavior by doing

t  = select(F, Not([:DateTime]))

that is, use the non-mutating select. But you would have to assign this to a new variable.

Another alternative would be to write F = copy(F) at the beginning of the function.

Though this is a digression from OPs main point. Please open up a new thread if you have more questions.

3 Likes

This is very helpful again thanks so much! And thank you for sharing your book, I’ll be sure to check it out! Sorry what I meant was not that F was unchanged but that the assignment F = hcat(G,F) as a stand alone line would bind G to the original DataFrame F. However why is it when the assignment is included in the function above, as you mentioned before, it only works to concatenate G to the returned Dataframe but not to F itself?

Wow this is really great thanks! Yeah I got similar benchmark times. There was also a few earlier comments from pdeffebach and bkamins which I understood to mean that the advantage of using transform for smaller datasets would be limited because of the initial overhead.