Allocation - groupby

I’m attempting to create a function groupby(f, itr) that returns a dictionary whose keys are the results of applying f to the elements of itr, and the values are some collection of the elements themselves. Similar to Mathematica’s GroupBy. Example, groupby(iseven, 1:5) --> Dict(true=>(2, 4), false =>(1, 3, 5)). The type of the values doesn’t need to be a tuple if there’s a better alternative.

This is partly because I need it, and partly because I’m learning Julia. So, both answers that explain the performance considerations, and answers that tell me “hey, there’s this built-in, you don’t need to do anything” are appreciated.

In my use case, the collection is a very long collection of small items, which only appear uniquely a low number of times. Overall, it takes a good amount of memory. So, the resulting dictionary would be long, and the values are each short and don’t take more than a few bytes, and ideally I wouldn’t be using up 10 times the space for intermediate calculations.

I’m thinking of creating a dictionary d, looping over itr, if the key doesn’t exist create it, and then append the value to the collection in d[key].

Now, what’s the best type for the values of the dict?

  • Vector seems like a bad idea, because it seems to have a lot of memory overhead for small lists.
  • Tuple or some other immutable thing like StaticArrays.SVector only seem to take up the space needed. But, is there any con if they happen to get long-ish? Furthermore, I think I would have to reallocate every time I push, because even though they are immutable, they are inside a mutable(?); and they need to grow. Also, does the garbage collector have to keep track of every tuple that I create and destroy, and wouldn’t that be a huge performance overhead? Also, the fact that they change type every time I push a new value, does it mean there’s extra overhead on some dict header keeping track of all the different types?
  • Tuple but with predefined size. Say I could guarantee that there will be no more than n elements in each bin, and don’t mind raising an error eventually. I could find a value that represents “nothing here yet”, or make them tuples of Union{T, Void} or something (overhead), and only in the end trim those away. This seems ugly, and it also wouldn’t avoid reallocating since they are immutables that can’t be changed in place (?reinterpret?/?compiler optimizations?). In that case, StaticArrays.MVector is mutable but it doesn’t seem to have that much overhead. I wonder if it makes a difference for the garbage collector whether they are mutables or immutables given that they live in the heap anyway.

Thanks!

Vector is going to be by far the easiest approach of the ones you’ve listed (and, I suspect, the best). The problem with any Tuple type is that the length of the tuple is part of its type, so if you have varying-length tuples, then your dictionary will have to have a non-concrete element type, which will probably undo any performance or memory savings you hope to gain. Your idea of a fixed-sized MVector of Union{T, Void} is possible (and may perform better on v0.7 than v0.6), but will be a) complicated b) hard to explain and c) probably not even more efficient, since you’re paying the memory overhead to store the maximum number of elements of each tuple rather than just the couple of extra fields that a Vector needs.

I would recommend:

  • Just use a Dict{OutputType, Vector{InputType}} and use regular push!() operations (exactly as you suggested)
  • If you find that this isn’t fast enough, post a concrete reproducible example here. There are a lot of people on this forum that love speeding up Julia code, but they generally can’t help you unless you can provide a concrete example of something that isn’t fast enough.

Here goes an attempt at a Vector version

function groupby(f, itr::Vector{T}, ::Type{K}) where {T,K}
    d = Dict{K, Vector{T}}()
    
    for x in itr
        key = f(x)
        if !haskey(d, key)
            d[key] = [x]
        else
            push!(d[key], x)
        end
    end
    d
end

So,

const t = round.(UInt32, [ 3e7*rand() for _ in 1:1e8 ]);

t has 100M x UInt32, 4 bytes each, taking up 400MB.

@time const g = groupby(identity, t, UInt32);

takes up 56.7s, ~6GB allocations, 50% of the time collecting garbage. Running gc() takes an extra 9s. For (unfair) comparison, unique(t) takes up 8 seconds and allocates 768MB, 0.76% time collecting garbage.

g is a dict of ~30M keys of 4B each (120MB) + 100M values of 4B each (400MB) spread inside 30M arrays (8B each for the pointer). So I would expect g to be about 760MB. In fact, Base.summarysize(g) gives 1.2GB, so I’m missing something big… Apprently, it seems the dictionary fields g.keys and g.vals are of length 67M even though g has 30M elements. But even then…

I think arrays have some header, extra allocated storage and stuff which this doesn’t measure. From a before-after look at the resident memory on Linux after running gc() (I’m not sure if this makes perfect sense), g takes up ~4.6GB.

In fact, converting it after the fact to tuples, emptying g and collecting garbage,

const gt = Dict{UInt32, Any}( k=> (v...) for (k,v) in g)
g = typeof(g)()
gc()

turns down res memory usage by 1.8GB. Base.summarysize(gt) gives the same 1.2G. Which makes sense, since the “pointer” for each array was replaced by a pointer because now the dict values are of type Any. But, the magic mysterious extra memory overhead went down a lot. Running gc() now takes 2 seconds, while when the Vector version was in memory, it took 8.5s.

That’s really interesting–thanks for posting the full example. One thing to consider about the Tuple version is that, even if it reduces your mystery memory overhead, it’s going to create performance problems down the line when you actually try to use the results of groupby, since each element you pull out of the dict will not have an inferable type.

1 Like

Yep, that’s true.

You could also check out the SplitApplyCombine package for a possible implementation (with Vectors).

1 Like