I have a large data-set with date columns in the format yyyy-mm-dd
. The challenge is that some of the rows have dates that are ambiguous and contain either the yyyy
e.g. 2019 or yyyy-mm
e.g. 2019-09.
I would like to import and not modify the date so that I can identify the rows with ambiguous date entries and either analyse them separately or remove them from the analysis. I am using the CSV.jl
and DataFrames.jl
packages to import and analyse the data.
However, after running,
data = """
code,date
0,2019-02
1,2019-01
3,2019
4,2019-04-23
"""
"code,date\n0,2019-02\n1,2019-01\n3,2019\n4,2019-04-23\n"
Followed by something like this,
file = CSV.File(IOBuffer(data)) |> DataFrame
4×2 DataFrame
Row │ code date
│ Int64 Date
─────┼───────────────────
1 │ 0 2019-02-01
2 │ 1 2019-01-01
3 │ 3 2019-01-01
4 │ 4 2019-04-23
The date column “fills” to a default format (either January if the month and day are missing, or the first day of the month if just the day that is missing) and therefore I cannot tell which is which because the actual rows with the correct 2019-01-01
gets mixed up with the interpolated ones and missing dates.
I am wondering what is the best way to import the data and retain the date format and respect the date in each row and so i can later filter all entries with a complete date or ambiguous date.