Vectorized statistics.jl

Hi,

I came across VectorizedStatistics.jl but cant see it mentioned in package announcements (although it’s available in the general registry). It appears to be an absolutely incredible leap in performance (for some use cases) for percentiles so I am eager to make use of it.

Sorry to @ but is it ready for general usage @brenhinkeller ?

Further (apologies in advance for lack of MWE as I’m on my phone), with a few examples run I saw the “classic” degradation (10x+ in some cases) of the QuickSelector algorithm (which is applied for percentiles) performance when there were many duplicates in the target vector. Is there any plan on how to mitigate or have you observed this yourself?

Regards,

2 Likes

Been using it and teaching with it in a Scientific Computing course.
Congratulations on the package!

1 Like

Hi @djholiver, good question. I wasn’t very active on discourse when we registered it so never made an announcement, but I hope VectorizedStatistics.jl (and the nan-ignoring equivalents in NaNStatistics.jl) are generally usable, albeit with a particular set of tradeoffs – compilation time on first use may be significant, it has more dependencies than some other ways of implementing these statistics, and both are relatively new so we may not have found all bugs yet, etc… If I get tenure they’ll at least be maintained for a good long while though :).

As you noted, the sorting implementation that underlies vmedian/vpercentile/vquantile is relatively naive and while fast for some cases may not be for many others. More generally actually, a major improvement would be to have a more explicitly SIMD’d sorting algorithm altogether (a relatively major undertaking which I haven’t had time for) – PRs would be very welcome on this front for anyone interested.

2 Likes

Oh awesome, thanks!