What functions/packages should I use to sort and "group by" as fast as possible...?


#1

I’m looking for fast functions to group by or sort large dataframes with several columns of strings (with few different values) or numbers.

There are many packages and options: Base, ShortStrings, SortingLab, FastGroupBy, SortingAlgorithms…

I have only been able to install and use ShortStrings and SortingAlgorithms, the others produce errors.

Some packages say their functionality will be merged onto base Julia o dataframe.

What do yo suggest to use?
Any special function from those packages? Or just keep with the defaults on dataframe or alternatives such as IndexedTables or CategoricalArrays?

There are other threads about this:



but I don’t want to be told again I’m writing on old threads, even if they speak about the same.


#3

Most of them are mt posts. I am learning how to make the package work in v1, but my package can only handle one column sorts well and multi-column sorts would need some work. Basically to make it run fast for stringa you need to radixsort the strings and in the sort one should return the sortperm as well which ca n be used to sort the other columns. This is the fastest way I found


#4

But do you suggest keeping the data on frameworks and use some package to perform the radix sort? What package (ShortStrings, SortingLab, FastGroupBy, SortingAlgorithms,…)?
Or should we use other framework such as Indexedtables, Categoricalarrays, Staticarrays… instead?


#5

The third post in your original post remains the most up to date advice. I think just use DataFramesMeta.jl or just the functions in DataFrames.jl directly. They are not fast yet, but they remain the best hope because they might get updated with faster algorithms. Ok. I will devote the next week to work on getting SortingLab.jl and FastGroupBy.jl ready for Julia v1.


#6

While looking at differet benchmarks I’ve seen other promising solutions such as Dask, SciDB and Mapd.