Groupby for regular array

Hi there.

Does anyone know a function or a package that can do the same DataFrame groupby function on a regular array? e.g.

julia> A = [1,2,3,4,1,2,4,4]
8-element Array{Int64,1}:
 1
 2
 3
 4
 1
 2
 4
 4

Then, sortperm gets me half way:

julia> sortperm(A)
8-element Array{Int64,1}:
 1
 5
 2
 6
 3
 4
 7
 8

…as I want to get something like:

[1,5], [2,6], [3], [4,7,8]

Never mind. There is already a good post.

Query.jl works on regular arrays, so you can just use its @groupby command.

3 Likes

Interesting. I’ve tried but unsure how to get the indices rather than values.

julia> A = [1,2,3,4,1,2,4,4]
8-element Array{Int64,1}:
 1
 2
 3
 4
 1
 2
 4
 4

julia> A |> @groupby(_)
?-element query result
 [1, 1]
 [2, 2]
 [3]
 [4, 4, 4]

I got it!

julia> 1:length(A) |> @groupby(A[_]) |> collect
4-element Array{QueryOperators.Grouping{Any,Int64},1}:
 [1, 5]   
 [2, 6]   
 [3]      
 [4, 7, 8]

I was curious about performance. Since it’s not a tough problem, I’ve created a custom groupby function and compared it with Query.jl. Just sharing results:

https://gist.github.com/tk3369/f979a4292fd696bee753f37cae93b45c

function sampleRanks of package NormalizeQuantiles.jl maybe also a solution. Just for your information.

julia> Pkg.add("NormalizeQuantiles")
julia> using NormalizeQuantiles
julia> A = [1,2,3,4,1,2,4,4];
julia> (r,m) = sampleRanks(A,resultMatrix=true);
julia> m
Dict{Int64,Array{Int64,N} where N} with 4 entries:
  4 => [4, 7, 8]
  2 => [2, 6]
  3 => [3]
  1 => [1, 5]

julia> @time 1:length(A) |> @groupby(A[_]) |> collect
0.027596 seconds (9.28 k allocations: 522.957 KiB)

julia> @time A |> @groupby(_)
0.004368 seconds (1.95 k allocations: 118.646 KiB)

julia> @time (r,m) = sampleRanks(A,resultMatrix=true)
0.000128 seconds (234 allocations: 11.938 KiB)

Need to use BenchmarkTools for proper benchmarking.

Tried NormalizedQuantiles. It returns a Dict, which is nice, but it’s much slower than others and it does not scale with larger arrays. See updated gist for benchmark details.

Method Function 8 elements 1,000 elements 10,000 elements
Custom function woz 1 μs 20 μs 174 μs
Query foo 4 μs 49 μs 425 μs
NormalizedQuantiles bar 47 μs 10,658 μs 133,827 μs

Your custom function is quite fast. I think the main performance loss is because I have to expect dirty data, e.g.
A = [1,2,NaN,3,4,1,“5,0”,5.0,2,4,4]
where anything which is not of type “a number in general” is NA (not available, missing, …, like NA in R)

However, while comparing with Query.jl and your function I found a bug in my code (if the last value in the array is NA or all values are NA), which is now resolved.

Next step is to analyse your code in deep so maybe I can improve my code.

@groupby seems to be most versatile as it also groups e.g. strings: A = [“a”,“a”,“b”,“a”,“b”,“c”] or any other mix of types.

1 Like