How to force an error when can't parse value as date in CSV.jl?

I’m trying to parse a column in my CSV file as dates, so I try to use CSV.File(..., dateformat=...) like this:

df = CSV.File(
	IOBuffer("Date,Time\n01/01/20, 10:10:10\n01/01/20, 11:11:11\ninvalid, 12:12:12"),
	dateformat=Dict(:Date=>dateformat"dd/mm/yy", :Time=>dateformat"HH:MM:SS"),
	strict=true
) |> DataFrame

Clearly, invalid is not a valid date. However, this code runs without errors or warnings (even though strict=true) and produces a dataframe where the Date column is comprised of strings:

eltype(df.Date) # InlineStrings.String15

How to force an error here?


I can “fix” this by specifying types=Dict(:Date=>Dates.Date) in the call to CSV.File, but is it not already clear from the dateformat argument that I’d like the Date column to contain Dates.Dates? So I feel like I’m unnecessarily duplicating my intent to parse dates here…

2 Likes

Part of the issue is that if you specify a DateFormat for a specific column, it’s unclear, unless we so some kind of heuristic inspection of the DateFormat whether you want a Date, Time, or DateTime for the column type.

Seems like that inference would be quite effective:

  • has a date component and no time component: it’s a date
  • has a time component and no date component: it’s a time
  • has both: it’s a datetime

Why not simply infer all types based on the first row of the CSV file?

  1. Try to parse each column of the first (non-header) row.
  2. Assign to each column of the overall CSV the types of this first row’s columns.
  3. Go on as if the types argument to CSV.File was set to the types that were inferred from the first row.

I guess special handling of the user-supplied types argument will be needed, but the code could again try to fully parse and type the first row, respecting the types argument, then fill out types of each column based on the first row and then go to step (3) above.

A very dumb but straightforward algorithm:

  1. Run CSV.File for the first row of the CSV.
  2. Store type information in first_row_types.
  3. Rerun CSV.File(..., types= first_row_types) for the entire file.
  4. Bingo, now each column has a type inferred from the first row of the CSV. If some other row doesn’t conform, that’s an error.

You’re certainly welcome to code up your own function that implements your algorithm if it suits your purposes. CSV.jl will indeed respect the types passed from the user and emit warnings for values that don’t parse (or error if strict=true is passed).

1 Like