Stricter date parsing

Tamas_Papp · July 12, 2018, 2:40pm

I don’t know if this is intended or not, but date parsing is somewhat too permissive:

julia> VERSION
v"0.7.0-beta.275"

julia> using Dates

julia> tryparse(Date, "19800101", dateformat"yyyy-mm-dd")
19800101-01-01

Is this a bug or a feature? How can I get stricter validation?

StefanKarpinski · July 12, 2018, 3:30pm

Yikes, that seems rather iffy! Do date formats implicitly not require the entire format to match?

rened · July 12, 2018, 4:31pm

It might not be as clear cut - what about 123 A.D.?

ExpandingMan · July 12, 2018, 4:42pm

I would definitely call that a bug, you should probably file an issue.

StefanKarpinski · July 12, 2018, 5:04pm

The bigger issue to me is that it ignores the trailing -mm-dd part of the date format. That part of the data seems like it would be mandatory, not optional. The year could potentially be 19800101, although it might be sensible to only parse years as long as the given number of format digits.

Tamas_Papp · July 12, 2018, 8:38pm

Thanks for the discussion. Opened issue
https://github.com/JuliaLang/julia/issues/28090

quinnj · July 12, 2018, 8:43pm

The DateFormat docs could probably be clearer here, but essentialy dateformat"y-m-d" ends up parsing the exact same dates as dateformat"yyyy-mm-dd"; the presence of a y or m just means, “parse the digits until another delimiter is encountered”. So the doc clarification should mention there’s currently no way to restrict the number of digits parsed.

There’s also not currently a way to mark a date part as “mandatory”; if some parts are parsed, then defaults are assumed for the rest. That just goes back to the fact that we have Date(2018, 1), which defaults to January 1st, 2018, even though no “day” part was given.

So while this might be surprising and maybe some enhancements need to be made, there’s not really any “bugs” here; this is definitely working as intended.

ExpandingMan · July 12, 2018, 8:49pm

Makes sense to me.

I take it this is standard date parsing behavior and others do this as well?

This definitely seems like it would be a nasty surprise for somebody somewhere, but if it is some sort of universal standard behavior for date parsing the burden would seem to be on user. If it is a Julia specific quirk the burden would seem to be on Julia.

(I was going to check what Python does but I can’t be bothered looking up how to do it right now lol)

Tamas_Papp · July 12, 2018, 8:49pm

Possibly, but at least the documentation should be improved then. ?DateFormat says

yyyymmdd 19960101 Matches fixed-width year, month, and day

which may lead the user to conclude that repeating characters leads to fixed-width parsing.

Also, it can be argued that if the parser is not validating things like yyyy-mm-dd, perhaps it should, or at least syntax should be implemented for it. But that’s a wishlist item.

quinnj · July 12, 2018, 8:56pm

Yeah, there’s just a big difference parsing-wise between fixed-width and delimited. With fixed-width, obviously you’re expecting only a certain # of characters and more or less than what is specified is going to blow things up.

With delimited though, what if you had a mix of date strings like ["2018-1-1", "2018-1-2", "2018-1-20"], how do you specify the “day” part for parsing? You either of a dateformat string like "yyyy-mm-dd" or "yyyy-m-d", but you can’t really require one or two digits, because there might be both in a set of date strings to be parsed.

So, generally, the approach is that fixed-width parsing is always more strict, while delimited parsing is more lenient.

Another thing maybe we should consider is taking away the default values for month & day parts; that would mean you’d have to explicitly call Date(2018, 1, 1) instead of relying on defaults. That would have helped this situation because Date(19800101) would have thrown an error because no month or day arguments were given. I’m not sure how widely used/expected/relied upon those default arguments are though.

Tamas_Papp · July 13, 2018, 8:46am

I would be OK with keeping the default options in the Date constructor, but not have the parser fill in missing fields. When I parse a dateformat"y-m-d", I implicitly expect an m and d to be there, also the two -s.

A few words on my actual use case: sometimes I parse datasets from CSV or similar where someone thought that having sentinel values like 99999 is a great idea… even in non-numerical columns. I don’t want this to parse as Date(99999).

ExpandingMan · July 13, 2018, 1:32pm

I think what I get hung up on is that since the characters get repeated (i.e. "yyyy") I expect that to indicate a specific width or at least an upper limit, though I do vaguely seem to remember there being different characters for fixed width.

I usually find that it’s better to replace those values with missing (or whatever) before trying to parse the set.

Tamas_Papp · July 13, 2018, 1:34pm

Eventually, yes, but first I have to know what they are. Imagine a half-TB dataset with undocumented conventions for missing values that change occasionally. For the first pass, I just want to know if something is a date of a certain kind, or not, collect the invalid values and look at them.

In any case, I think that restricting the parsing to be less permissive in the direction I suggested is generally useful.

ExpandingMan · July 13, 2018, 1:37pm

I feel your pain. This is definitely a case for very strict parsing.

By the way, do we have any methods that return missing in the event of a parsing error? try catch can be rather slow, as I understand.

Tamas_Papp · July 13, 2018, 1:43pm

tryparse returns nothing (in v0.7), which works fine. It is very fast.

chakravala · July 13, 2018, 2:19pm

Do you need it just for this one date format?

julia> checkdate(date::String) = ismatch(r"[0-9]{1,4}-[0-9]{1,2}-[0-9]{1,2}",date)
checkdate (generic function with 1 method)

julia> checkdate("2018-07-13")
true

julia> checkdate("19800101")
false

Or do you need something that works for a generic date format specification?

chakravala · July 13, 2018, 2:29pm

Here is a slightly more generic solution

julia> function checkdate(date::String,format::String)
           l = length.(split(format,'-'))
           r = join(["[0-9]{1,$(l[k])}" for k ∈ 1:length(l)],'-')
           ismatch(Regex("^$r\$"),date)
       end;

julia> checkdate("2018-07-13","yyyy-mm-dd")
true

julia> checkdate("19800101","yyyy-mm-dd")
false

This actually sets an upper limit to the number of digits after counting them.

Tamas_Papp · July 13, 2018, 2:39pm

I would prefer something generic. Also note that since v0.6, date parsing using DateFormat is very heavily optimized; I did not benchmark but I imagine a regex-based solution would be orders of magnitude slower.

StefanKarpinski · July 13, 2018, 2:44pm

Agree, I don’t see why constructor defaults and allowing incomplete parsing are coupled. It seems reasonable for constructors to supply defaults while format parsers require the entire format to be matched (and anything not in the format to be supplied by default). The current behavior strikes me as dangerous enough to be considered something of a design bug.

Liso · July 14, 2018, 7:24am

This could be probably interesting here too:

Topic		Replies	Views
Parsing DateTime, what should be the default(s)? General Usage dates , parsing	4	771	May 4, 2022
Handling Dates General Usage question , dates	4	292	November 29, 2023
Parsing datetime string General Usage question , dates , parsing	7	3883	July 26, 2021
stdlib dateformat parsing is inconsistent, creates 32-bit issue Internals & Design dates	3	541	March 15, 2019
How do I parse the string "29 February" into a date structure New to Julia	3	679	June 15, 2020

Stricter date parsing

Related topics