If you have an array of Date() objects and you want to find the middle one, what do you do?
If you have an odd number of objects, just take the middle one
If you have an even number, you would normally do the mean of the middle two. But mathematical operations aren’t defined for dates, nor should they be, so you can’t do that.
Returning an array of the middle two also sounds like odd behavior, since it leads to type instability. At the same time, it seems reasonable to want the middle(ish) value of an array of dates.
Are there any standard practices for dealing with this?
This appears to be a conceptual question, not something specific to Julia or even programming. Nevertheless, sample quantiles can be defined for ordinal data, you just don’t interpolate and need to break ties, eg by rounding the index up.
yes, it is conceptual, but it’s related to how my rewrite of DataFrames’s describe would work with quantiles.
I suppose the Julian flavor of this question to be whether there is a standard for doing this within Julia packages, particularly with regards to the Date type.
We want the summary statistics to be as flexible as possible, so we use try... catch to see if we can get an output for non-numeric types. Say, for example, you have MyType defined and MyType(1) < MyType(2), then we can still tell you the minimum of that column, returning nothing if its not defined.
For returning the 25th, 50th, and 75th quantiles, we are using the quantile function, which requires Mathematical operations. So MyType wouldn’t work.
The obvious answer is to say, “hey, if you want us to give you something other than nothing in return, write a method for quantile that gives you what you want.” That’s a fine solution, I think, but maybe there is a more general function for all types of ordinal data that I don’t know about, and that is commonly used in Julia packages.
I am not sure about the obvious answer, but a solution could involve traits, which carry information about whether values of a type are cardinal (everything <: Real), ordinal (eg Date), or nominal (the default). Then reporting would just use this information, eg interpolate for cardinal values, use uninterpolated quantiles for ordinal, and maybe just show the 5 most common values for nominal.
I think i would just call sort and use something like floor(length(col) * .25, then have that in the documentation.
However you don’t want to do anything too expensive, with, say, strings, and cause describe to be slow. Probably best to finish up the pull request now and then see what people think.
library(lubridate)
# Note that we have an even number of dates, so median
# is not obviously defined.
dates = dates = ymd("20010101", "20020101", "20030101", "20040101", "20050101", "20060101")
median(dates)
> 2003-07-02
# returns the midpoint. Also, note that it does not drop down into
# seconds, but will round down instead of doing that.
quantile(dates)
> Error # you need a special option
quantile(dates, type = 1) #Whatever that means
> 0%
2001-01-01
25%
2002-01-01
50%
2003-01-01
75%
2004-01-01
100%
2005-01-01
# Clearly it rounds down.
As far as I can tell, python throws an error for any thing, be it Panda’s date format or datetime’s date format. Though I could have sworn I had something the other day… If anyone works with dates regularly in python feel free to pitch in.
Given that quantile might be moved into stdlib soon, maybe now we could make a push to add an option like what R has.
Then in dataframes we could use try...catch twice, once to see if the user-defined object has a normal median working with it, and a second time to see if a special ordinal option works, then return whatever that is for quantile and median.
But then add another method for strings… since the user probably isn’t interested in the minimum and maximum string.
Opened up an issue here! I think it makes sense for it to live in Base because the current quantile function is pretty complicated and an ordinal version would only change the very last step. But I’m sure the developers hear that a lot!