Importing data with ambiguous dates

I have a large data-set with date columns in the format yyyy-mm-dd. The challenge is that some of the rows have dates that are ambiguous and contain either the yyyy e.g. 2019 or yyyy-mm e.g. 2019-09.

I would like to import and not modify the date so that I can identify the rows with ambiguous date entries and either analyse them separately or remove them from the analysis. I am using the CSV.jl and DataFrames.jl packages to import and analyse the data.

However, after running,

data = """

Followed by something like this,

file = CSV.File(IOBuffer(data)) |> DataFrame
4×2 DataFrame
 Row │ code   date
     │ Int64  Date
   1 │     0  2019-02-01
   2 │     1  2019-01-01
   3 │     3  2019-01-01
   4 │     4  2019-04-23

The date column “fills” to a default format (either January if the month and day are missing, or the first day of the month if just the day that is missing) and therefore I cannot tell which is which because the actual rows with the correct 2019-01-01 gets mixed up with the interpolated ones and missing dates.

I am wondering what is the best way to import the data and retain the date format and respect the date in each row and so i can later filter all entries with a complete date or ambiguous date.

You can turn off type detection by using the types kw argument

One such way:

file = CSV.File(IOBuffer(data); types=[Int, String]) |> DataFrame
4×2 DataFrame
 Row │ code   date       
     │ Int64  String     
   1 │     0  2019-02
   2 │     1  2019-01
   3 │     3  2019
   4 │     4  2019-04-23

You can even do it just for a single column

file = CSV.File(IOBuffer(data); types=Dict("date"=>String)) |> DataFrame
file = CSV.File(IOBuffer(data); types=Dict(:date=>String)) |> DataFrame

and post process the DataFrame to do what you want using filter or subset or whatever


Thanks! That worked!

1 Like

could you marked it solved please - it it changes the appearance in the listings

1 Like