How to force an error when can't parse value as date in CSV.jl?

ForceBru · November 29, 2022, 7:31pm

I’m trying to parse a column in my CSV file as dates, so I try to use CSV.File(..., dateformat=...) like this:

df = CSV.File(
	IOBuffer("Date,Time\n01/01/20, 10:10:10\n01/01/20, 11:11:11\ninvalid, 12:12:12"),
	dateformat=Dict(:Date=>dateformat"dd/mm/yy", :Time=>dateformat"HH:MM:SS"),
	strict=true
) |> DataFrame

Clearly, invalid is not a valid date. However, this code runs without errors or warnings (even though strict=true) and produces a dataframe where the Date column is comprised of strings:

eltype(df.Date) # InlineStrings.String15

How to force an error here?

I can “fix” this by specifying types=Dict(:Date=>Dates.Date) in the call to CSV.File, but is it not already clear from the dateformat argument that I’d like the Date column to contain Dates.Dates? So I feel like I’m unnecessarily duplicating my intent to parse dates here…

quinnj · November 29, 2022, 8:38pm

Part of the issue is that if you specify a DateFormat for a specific column, it’s unclear, unless we so some kind of heuristic inspection of the DateFormat whether you want a Date, Time, or DateTime for the column type.

jar1 · November 29, 2022, 8:50pm

Seems like that inference would be quite effective:

has a date component and no time component: it’s a date
has a time component and no date component: it’s a time
has both: it’s a datetime

ForceBru · November 29, 2022, 10:38pm

Why not simply infer all types based on the first row of the CSV file?

Try to parse each column of the first (non-header) row.
Assign to each column of the overall CSV the types of this first row’s columns.
Go on as if the types argument to CSV.File was set to the types that were inferred from the first row.

I guess special handling of the user-supplied types argument will be needed, but the code could again try to fully parse and type the first row, respecting the types argument, then fill out types of each column based on the first row and then go to step (3) above.

A very dumb but straightforward algorithm:

Run CSV.File for the first row of the CSV.
Store type information in first_row_types.
Rerun CSV.File(..., types= first_row_types) for the entire file.
Bingo, now each column has a type inferred from the first row of the CSV. If some other row doesn’t conform, that’s an error.

quinnj · November 30, 2022, 12:40am

You’re certainly welcome to code up your own function that implements your algorithm if it suits your purposes. CSV.jl will indeed respect the types passed from the user and emit warnings for values that don’t parse (or error if strict=true is passed).

Topic		Replies	Views
Csv dateformat New to Julia csv	2	1477	September 30, 2020
Parsing date column when reading in CSV General Usage dates , dataframes , csv	6	1074	March 7, 2023
Allow multiple datetime formats with CSV.jl Data	0	36	October 4, 2024
Importing data with ambiguous dates Data question , dataframes , csv	3	513	March 4, 2022
Read CSV, converting column containing String "Date" to a specific proper Date format New to Julia	1	337	March 22, 2021

How to force an error when can't parse value as date in CSV.jl?

Related topics