I’m trying to parse a column in my CSV file as dates, so I try to use
CSV.File(..., dateformat=...) like this:
df = CSV.File(
IOBuffer("Date,Time\n01/01/20, 10:10:10\n01/01/20, 11:11:11\ninvalid, 12:12:12"),
) |> DataFrame
invalid is not a valid date. However, this code runs without errors or warnings (even though
strict=true) and produces a dataframe where the
Date column is comprised of strings:
eltype(df.Date) # InlineStrings.String15
How to force an error here?
I can “fix” this by specifying
types=Dict(:Date=>Dates.Date) in the call to
CSV.File, but is it not already clear from the
dateformat argument that I’d like the
Date column to contain
Dates.Dates? So I feel like I’m unnecessarily duplicating my intent to parse dates here…
Part of the issue is that if you specify a
DateFormat for a specific column, it’s unclear, unless we so some kind of heuristic inspection of the
DateFormat whether you want a
DateTime for the column type.
Seems like that inference would be quite effective:
- has a date component and no time component: it’s a date
- has a time component and no date component: it’s a time
- has both: it’s a datetime
Why not simply infer all types based on the first row of the CSV file?
- Try to parse each column of the first (non-header) row.
- Assign to each column of the overall CSV the types of this first row’s columns.
- Go on as if the
types argument to
CSV.File was set to the types that were inferred from the first row.
I guess special handling of the user-supplied
types argument will be needed, but the code could again try to fully parse and type the first row, respecting the
types argument, then fill out types of each column based on the first row and then go to step (3) above.
A very dumb but straightforward algorithm:
CSV.File for the first row of the CSV.
- Store type information in
CSV.File(..., types= first_row_types) for the entire file.
- Bingo, now each column has a type inferred from the first row of the CSV. If some other row doesn’t conform, that’s an error.
You’re certainly welcome to code up your own function that implements your algorithm if it suits your purposes. CSV.jl will indeed respect the types passed from the user and emit warnings for values that don’t parse (or error if
strict=true is passed).