Importing data with ambiguous dates

George_Githinji · March 4, 2022, 6:06am

I have a large data-set with date columns in the format yyyy-mm-dd. The challenge is that some of the rows have dates that are ambiguous and contain either the yyyy e.g. 2019 or yyyy-mm e.g. 2019-09.

I would like to import and not modify the date so that I can identify the rows with ambiguous date entries and either analyse them separately or remove them from the analysis. I am using the CSV.jl and DataFrames.jl packages to import and analyse the data.

However, after running,

data = """
              code,date
              0,2019-02
              1,2019-01
              3,2019
              4,2019-04-23
              """
"code,date\n0,2019-02\n1,2019-01\n3,2019\n4,2019-04-23\n"

Followed by something like this,

file = CSV.File(IOBuffer(data)) |> DataFrame
4×2 DataFrame
 Row │ code   date
     │ Int64  Date
─────┼───────────────────
   1 │     0  2019-02-01
   2 │     1  2019-01-01
   3 │     3  2019-01-01
   4 │     4  2019-04-23

The date column “fills” to a default format (either January if the month and day are missing, or the first day of the month if just the day that is missing) and therefore I cannot tell which is which because the actual rows with the correct 2019-01-01 gets mixed up with the interpolated ones and missing dates.

I am wondering what is the best way to import the data and retain the date format and respect the date in each row and so i can later filter all entries with a complete date or ambiguous date.

lawless-m · March 4, 2022, 7:00am

You can turn off type detection by using the types kw argument

One such way:

file = CSV.File(IOBuffer(data); types=[Int, String]) |> DataFrame
4×2 DataFrame
 Row │ code   date       
     │ Int64  String     
─────┼───────────────────
   1 │     0  2019-02
   2 │     1  2019-01
   3 │     3  2019
   4 │     4  2019-04-23

You can even do it just for a single column

file = CSV.File(IOBuffer(data); types=Dict("date"=>String)) |> DataFrame
or
file = CSV.File(IOBuffer(data); types=Dict(:date=>String)) |> DataFrame

and post process the DataFrame to do what you want using filter or subset or whatever

George_Githinji · March 4, 2022, 1:13pm

Thanks! That worked!

lawless-m · March 4, 2022, 2:22pm

could you marked it solved please - it it changes the appearance in the listings

Topic		Replies	Views
Csv dateformat New to Julia csv	2	1481	September 30, 2020
Parsing date column when reading in CSV General Usage dates , dataframes , csv	6	1082	March 7, 2023
Read CSV, converting column containing String "Date" to a specific proper Date format New to Julia	1	337	March 22, 2021
Allow multiple datetime formats with CSV.jl Data	0	36	October 4, 2024
Reading Date.jl New to Julia	2	1529	February 18, 2019

Importing data with ambiguous dates

Related topics