Parsing date column when reading in CSV

Hello all,

I have a CSV file with a date column that I am trying to parse when reading it in. I have tried several options but they don’t work.

df = DataFrame(CSV.File(joinpath(data_path, "PJM_2018.csv"))) 
8751×8 DataFrame
  Row │ time_stamp        Coal     Nuclear  Gas      Oil      Wind     Hydro    Solar   
      │ String31          Float64  Float64  Float64  Float64  Float64  Float64  Float64 
──────┼─────────────────────────────────────────────────────────────────────────────────
    1 │ 01.01.2018 00:00    48.55    35.53    23.38     1.54     3.91     0.7    0.0
    2 │ 01.01.2018 01:00    48.06    35.55    22.77     1.56     3.41     1.04   0.0
  ⋮   │        ⋮             ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮
 8751 │ 12/31/2018 14:00    24.49    34.76    29.05     0.22     3.84     2.14   0.0456

The data the following format

 "12/31/2018 14:00"

Any suggestions on how to parse the time_stamp coloumn?

Thanks

Have you tried to use the dateformat kwarg as specified in the docs?

https://csv.juliadata.org/stable/reading.html#dateformat

Based on the sample data you posted above though this will fail because your time_stamp column has multiple different date formats. You’ll have to manually parse this after reading in the CSV.

1 Like

Thanks for pointing out the different date formats. I formatted them to a consistent format (mm/dd/yyyy) and parsed them using the following code:

df.time_stamp =  DateTime.(df.time_stamp, "mm/dd/yyyy HH:MM")

Thanks

1 Like

I don’t know if the following is useful to the problem nor, even less, if it is efficient, but it was just to do an exercise on what is illustrated here

data = """
col1;col2;col3;col4;col5
"05.02.2023";1000,01;2000,02;3000,03;12:00:00
"06.02.2023";4000,04;5000,05;6000,06;12:00:00
"06.02.2023";4000,04;5000,05;6000,06;12:00:00
"02/06/2023";4000,05;5000,05;6000,06;12:00:00
"02/06/2023";4000,06;5000,05;6000,06;12:00:00
"02/06/2023";4000,07;5000,05;6000,06;12:00:00
"""
tbl = CSV.File(IOBuffer(data); delim=';') |> columntable

tbl.col1

replace.(tbl.col1, r"(\d\d/)(?<ng>\d\d/)" => s"\g<ng>\1","."=>"/")

I don’t have much experience with regular expressions, but this case seems like it could be solved maybe more clearly


replace.(tbl.col1, r"(\d\d/)(\d\d/)" => s"\2\1", "."=>"/")

1 Like
using CSV, DataFrames, Dates

data = """
time_stamp,Coal,Nuclear,Gas,Oil,Wind,Hydro,Solar
01.01.2018 00:00,48.55,35.53,23.38,1.54,3.91,0.7,0.0
01.01.2018 01:00,48.06,35.55,22.77,1.56,3.41,1.04,0.0
12/31/2018 14:00,24.49,34.76,29.05,0.22,3.84,2.14,0.045
"""

df = CSV.read(IOBuffer(data), DataFrame)

df.time_stamp .= replace.(df.time_stamp, "." => "/")
df.time_stamp .= DateTime.(df.time_stamp, "mm/dd/yyyy HH:MM")
df
1 Like

@rafael.guerra thanks for the nifty code. Really helpful!

@rocco_sprmnt21 thanks for this