Functions for median of ordinal data

pdeffebach · May 17, 2018, 6:54pm

If you have an array of Date() objects and you want to find the middle one, what do you do?

If you have an odd number of objects, just take the middle one
If you have an even number, you would normally do the mean of the middle two. But mathematical operations aren’t defined for dates, nor should they be, so you can’t do that.

Returning an array of the middle two also sounds like odd behavior, since it leads to type instability. At the same time, it seems reasonable to want the middle(ish) value of an array of dates.

Are there any standard practices for dealing with this?

Tamas_Papp · May 18, 2018, 7:16pm

This appears to be a conceptual question, not something specific to Julia or even programming. Nevertheless, sample quantiles can be defined for ordinal data, you just don’t interpolate and need to break ties, eg by rounding the index up.

pdeffebach · May 18, 2018, 7:26pm

yes, it is conceptual, but it’s related to how my rewrite of DataFrames’s describe would work with quantiles.

I suppose the Julian flavor of this question to be whether there is a standard for doing this within Julia packages, particularly with regards to the Date type.

We want the summary statistics to be as flexible as possible, so we use try... catch to see if we can get an output for non-numeric types. Say, for example, you have MyType defined and MyType(1) < MyType(2), then we can still tell you the minimum of that column, returning nothing if its not defined.

For returning the 25th, 50th, and 75th quantiles, we are using the quantile function, which requires Mathematical operations. So MyType wouldn’t work.

The obvious answer is to say, “hey, if you want us to give you something other than nothing in return, write a method for quantile that gives you what you want.” That’s a fine solution, I think, but maybe there is a more general function for all types of ordinal data that I don’t know about, and that is commonly used in Julia packages.

Tamas_Papp · May 18, 2018, 7:33pm

I am not sure about the obvious answer, but a solution could involve traits, which carry information about whether values of a type are cardinal (everything <: Real), ordinal (eg Date), or nominal (the default). Then reporting would just use this information, eg interpolate for cardinal values, use uninterpolated quantiles for ordinal, and maybe just show the 5 most common values for nominal.

pdeffebach · May 19, 2018, 1:00am

I think i would just call sort and use something like floor(length(col) * .25, then have that in the documentation.

However you don’t want to do anything too expensive, with, say, strings, and cause describe to be slow. Probably best to finish up the pull request now and then see what people think.

nalimilan · May 19, 2018, 9:11pm

I’m curious how other software handles this.

pdeffebach · May 23, 2018, 8:46pm

In R

library(lubridate)
# Note that we have an even number of dates, so median
# is not obviously defined.
dates = dates = ymd("20010101", "20020101", "20030101", "20040101", "20050101", "20060101")
median(dates) 
> 2003-07-02 
# returns the midpoint. Also, note that it does not drop down into 
# seconds, but will round down instead of doing that. 
quantile(dates) 
> Error # you need a special option
quantile(dates, type = 1)  #Whatever that means
> 0%
2001-01-01
25%
2002-01-01
50%
2003-01-01
75%
2004-01-01
100%
2005-01-01
# Clearly it rounds down.

As far as I can tell, python throws an error for any thing, be it Panda’s date format or datetime’s date format. Though I could have sworn I had something the other day… If anyone works with dates regularly in python feel free to pitch in.

pdeffebach · May 23, 2018, 8:52pm

Given that quantile might be moved into stdlib soon, maybe now we could make a push to add an option like what R has.

Then in dataframes we could use try...catch twice, once to see if the user-defined object has a normal median working with it, and a second time to see if a special ordinal option works, then return whatever that is for quantile and median.

But then add another method for strings… since the user probably isn’t interested in the minimum and maximum string.

pdeffebach · June 1, 2018, 3:02pm

https://github.com/JuliaLang/julia/issues/27367

Opened up an issue here! I think it makes sense for it to live in Base because the current quantile function is pretty complicated and an ordinal version would only change the very last step. But I’m sure the developers hear that a lot!

Topic		Replies	Views
Middle(x,y) for DateTime, Char, and others Internals & Design	3	361	March 28, 2023
Median vs 50th Quantile giving different answers Statistics stats	6	1474	February 18, 2019
Average Number of Days (Convert Type Day to Number?) General Usage dates	3	715	April 15, 2019
Row wise median for julia dataframes Data dataframes	18	636	November 30, 2023
How to find min/max or Q1/median/Q3 in Julia New to Julia question	2	1986	May 24, 2019

Functions for median of ordinal data

Related topics