What functions/packages should I use to sort and "group by" as fast as possible...?

Juan · December 15, 2018, 1:02pm

I’m looking for fast functions to group by or sort large dataframes with several columns of strings (with few different values) or numbers.

There are many packages and options: Base, ShortStrings, SortingLab, FastGroupBy, SortingAlgorithms…

I have only been able to install and use ShortStrings and SortingAlgorithms, the others produce errors.

Some packages say their functionality will be merged onto base Julia o dataframe.

What do yo suggest to use?
Any special function from those packages? Or just keep with the defaults on dataframe or alternatives such as IndexedTables or CategoricalArrays?

There are other threads about this:

but I don’t want to be told again I’m writing on old threads, even if they speak about the same.

xiaodai · December 15, 2018, 10:33pm

Most of them are mt posts. I am learning how to make the package work in v1, but my package can only handle one column sorts well and multi-column sorts would need some work. Basically to make it run fast for stringa you need to radixsort the strings and in the sort one should return the sortperm as well which ca n be used to sort the other columns. This is the fastest way I found

Juan · December 15, 2018, 11:32pm

But do you suggest keeping the data on frameworks and use some package to perform the radix sort? What package (ShortStrings, SortingLab, FastGroupBy, SortingAlgorithms,…)?
Or should we use other framework such as Indexedtables, Categoricalarrays, Staticarrays… instead?

xiaodai · December 15, 2018, 11:47pm

The third post in your original post remains the most up to date advice. I think just use DataFramesMeta.jl or just the functions in DataFrames.jl directly. They are not fast yet, but they remain the best hope because they might get updated with faster algorithms. Ok. I will devote the next week to work on getting SortingLab.jl and FastGroupBy.jl ready for Julia v1.

Juan · December 16, 2018, 1:05am

While looking at differet benchmarks I’ve seen other promising solutions such as Dask, SciDB and Mapd.

Topic		Replies	Views
Group-by performance benchmarks and recommendations Data	12	3518	September 2, 2019
Julia performs poorly on group-by benchmarks Data performance	48	5804	January 23, 2019
Various by-group strategies compared Data	36	3948	January 30, 2018
Tables package for fast grouping and filtering? Performance package	18	1584	December 8, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	467	August 27, 2020

What functions/packages should I use to sort and "group by" as fast as possible...?

Related topics