Groupby for regular array

tk3369 · March 22, 2018, 5:26am

Hi there.

Does anyone know a function or a package that can do the same DataFrame groupby function on a regular array? e.g.

julia> A = [1,2,3,4,1,2,4,4]
8-element Array{Int64,1}:
 1
 2
 3
 4
 1
 2
 4
 4

Then, sortperm gets me half way:

julia> sortperm(A)
8-element Array{Int64,1}:
 1
 5
 2
 6
 3
 4
 7
 8

…as I want to get something like:

[1,5], [2,6], [3], [4,7,8]

tk3369 · March 22, 2018, 6:04am

Never mind. There is already a good post.

davidanthoff · March 22, 2018, 5:02pm

Query.jl works on regular arrays, so you can just use its @groupby command.

tk3369 · March 23, 2018, 4:36am

Interesting. I’ve tried but unsure how to get the indices rather than values.

julia> A = [1,2,3,4,1,2,4,4]
8-element Array{Int64,1}:
 1
 2
 3
 4
 1
 2
 4
 4

julia> A |> @groupby(_)
?-element query result
 [1, 1]
 [2, 2]
 [3]
 [4, 4, 4]

tk3369 · March 23, 2018, 4:52am

I got it!

julia> 1:length(A) |> @groupby(A[_]) |> collect
4-element Array{QueryOperators.Grouping{Any,Int64},1}:
 [1, 5]   
 [2, 6]   
 [3]      
 [4, 7, 8]

tk3369 · March 23, 2018, 7:44am

I was curious about performance. Since it’s not a tough problem, I’ve created a custom groupby function and compared it with Query.jl. Just sharing results:

https://gist.github.com/tk3369/f979a4292fd696bee753f37cae93b45c

oheil · March 23, 2018, 12:41pm

function sampleRanks of package NormalizeQuantiles.jl maybe also a solution. Just for your information.

oheil · March 23, 2018, 12:41pm

julia> Pkg.add("NormalizeQuantiles")
julia> using NormalizeQuantiles
julia> A = [1,2,3,4,1,2,4,4];
julia> (r,m) = sampleRanks(A,resultMatrix=true);
julia> m
Dict{Int64,Array{Int64,N} where N} with 4 entries:
  4 => [4, 7, 8]
  2 => [2, 6]
  3 => [3]
  1 => [1, 5]

oheil · March 23, 2018, 12:47pm

julia> @time 1:length(A) |> @groupby(A[_]) |> collect
0.027596 seconds (9.28 k allocations: 522.957 KiB)

julia> @time A |> @groupby(_)
0.004368 seconds (1.95 k allocations: 118.646 KiB)

julia> @time (r,m) = sampleRanks(A,resultMatrix=true)
0.000128 seconds (234 allocations: 11.938 KiB)

tk3369 · March 25, 2018, 7:19am

Need to use BenchmarkTools for proper benchmarking.

Tried NormalizedQuantiles. It returns a Dict, which is nice, but it’s much slower than others and it does not scale with larger arrays. See updated gist for benchmark details.

Method	Function	8 elements	1,000 elements	10,000 elements
Custom function	woz	1 μs	20 μs	174 μs
Query	foo	4 μs	49 μs	425 μs
NormalizedQuantiles	bar	47 μs	10,658 μs	133,827 μs

oheil · March 25, 2018, 10:35am

Your custom function is quite fast. I think the main performance loss is because I have to expect dirty data, e.g.
A = [1,2,NaN,3,4,1,“5,0”,5.0,2,4,4]
where anything which is not of type “a number in general” is NA (not available, missing, …, like NA in R)

However, while comparing with Query.jl and your function I found a bug in my code (if the last value in the array is NA or all values are NA), which is now resolved.

Next step is to analyse your code in deep so maybe I can improve my code.

@groupby seems to be most versatile as it also groups e.g. strings: A = [“a”,“a”,“b”,“a”,“b”,“c”] or any other mix of types.

Topic		Replies	Views
Various by-group strategies compared Data	36	3947	January 30, 2018
Groupby function? General Usage	7	5071	March 23, 2018
Groupby() on two individual arrays General Usage question , array	1	318	April 20, 2021
What functions/packages should I use to sort and "group by" as fast as possible...? Performance sort , dataframes	4	1440	December 16, 2018
Groupby on an expression or a vector? New to Julia	21	574	June 11, 2024

Groupby for regular array

Related topics