Stricter date parsing

question

#1

I don’t know if this is intended or not, but date parsing is somewhat too permissive:

julia> VERSION
v"0.7.0-beta.275"

julia> using Dates

julia> tryparse(Date, "19800101", dateformat"yyyy-mm-dd")
19800101-01-01

Is this a bug or a feature? How can I get stricter validation?


#2

Yikes, that seems rather iffy! Do date formats implicitly not require the entire format to match?


#3

It might not be as clear cut - what about 123 A.D.?


#4

I would definitely call that a bug, you should probably file an issue.


#5

The bigger issue to me is that it ignores the trailing -mm-dd part of the date format. That part of the data seems like it would be mandatory, not optional. The year could potentially be 19800101, although it might be sensible to only parse years as long as the given number of format digits.


#6

Thanks for the discussion. Opened issue


#7

The DateFormat docs could probably be clearer here, but essentialy dateformat"y-m-d" ends up parsing the exact same dates as dateformat"yyyy-mm-dd"; the presence of a y or m just means, “parse the digits until another delimiter is encountered”. So the doc clarification should mention there’s currently no way to restrict the number of digits parsed.

There’s also not currently a way to mark a date part as “mandatory”; if some parts are parsed, then defaults are assumed for the rest. That just goes back to the fact that we have Date(2018, 1), which defaults to January 1st, 2018, even though no “day” part was given.

So while this might be surprising and maybe some enhancements need to be made, there’s not really any “bugs” here; this is definitely working as intended.


#8

Makes sense to me.

I take it this is standard date parsing behavior and others do this as well?

This definitely seems like it would be a nasty surprise for somebody somewhere, but if it is some sort of universal standard behavior for date parsing the burden would seem to be on user. If it is a Julia specific quirk the burden would seem to be on Julia.

(I was going to check what Python does but I can’t be bothered looking up how to do it right now lol)


#9

Possibly, but at least the documentation should be improved then. ?DateFormat says

yyyymmdd 19960101 Matches fixed-width year, month, and day

which may lead the user to conclude that repeating characters leads to fixed-width parsing.

Also, it can be argued that if the parser is not validating things like yyyy-mm-dd, perhaps it should, or at least syntax should be implemented for it. But that’s a wishlist item.


#10

Yeah, there’s just a big difference parsing-wise between fixed-width and delimited. With fixed-width, obviously you’re expecting only a certain # of characters and more or less than what is specified is going to blow things up.

With delimited though, what if you had a mix of date strings like ["2018-1-1", "2018-1-2", "2018-1-20"], how do you specify the “day” part for parsing? You either of a dateformat string like "yyyy-mm-dd" or "yyyy-m-d", but you can’t really require one or two digits, because there might be both in a set of date strings to be parsed.

So, generally, the approach is that fixed-width parsing is always more strict, while delimited parsing is more lenient.

Another thing maybe we should consider is taking away the default values for month & day parts; that would mean you’d have to explicitly call Date(2018, 1, 1) instead of relying on defaults. That would have helped this situation because Date(19800101) would have thrown an error because no month or day arguments were given. I’m not sure how widely used/expected/relied upon those default arguments are though.


#11

I would be OK with keeping the default options in the Date constructor, but not have the parser fill in missing fields. When I parse a dateformat"y-m-d", I implicitly expect an m and d to be there, also the two -s.

A few words on my actual use case: sometimes I parse datasets from CSV or similar where someone thought that having sentinel values like 99999 is a great idea… even in non-numerical columns. I don’t want this to parse as Date(99999).


#12

I think what I get hung up on is that since the characters get repeated (i.e. "yyyy") I expect that to indicate a specific width or at least an upper limit, though I do vaguely seem to remember there being different characters for fixed width.

I usually find that it’s better to replace those values with missing (or whatever) before trying to parse the set.


#13

Eventually, yes, but first I have to know what they are. Imagine a half-TB dataset with undocumented conventions for missing values that change occasionally. For the first pass, I just want to know if something is a date of a certain kind, or not, collect the invalid values and look at them.

In any case, I think that restricting the parsing to be less permissive in the direction I suggested is generally useful.


#14

I feel your pain. This is definitely a case for very strict parsing.

By the way, do we have any methods that return missing in the event of a parsing error? try catch can be rather slow, as I understand.


#15

tryparse returns nothing (in v0.7), which works fine. It is very fast.


#16

Do you need it just for this one date format?

julia> checkdate(date::String) = ismatch(r"[0-9]{1,4}-[0-9]{1,2}-[0-9]{1,2}",date)
checkdate (generic function with 1 method)

julia> checkdate("2018-07-13")
true

julia> checkdate("19800101")
false

Or do you need something that works for a generic date format specification?


#17

Here is a slightly more generic solution

julia> function checkdate(date::String,format::String)
           l = length.(split(format,'-'))
           r = join(["[0-9]{1,$(l[k])}" for k ∈ 1:length(l)],'-')
           ismatch(Regex("^$r\$"),date)
       end;

julia> checkdate("2018-07-13","yyyy-mm-dd")
true

julia> checkdate("19800101","yyyy-mm-dd")
false

This actually sets an upper limit to the number of digits after counting them.


#18

I would prefer something generic. Also note that since v0.6, date parsing using DateFormat is very heavily optimized; I did not benchmark but I imagine a regex-based solution would be orders of magnitude slower.


#19

Agree, I don’t see why constructor defaults and allowing incomplete parsing are coupled. It seems reasonable for constructors to supply defaults while format parsers require the entire format to be matched (and anything not in the format to be supplied by default). The current behavior strikes me as dangerous enough to be considered something of a design bug.


#20

This could be probably interesting here too: