I’m attempting to create a function groupby(f, itr)
that returns a dictionary whose keys are the results of applying f
to the elements of itr
, and the values are some collection of the elements themselves. Similar to Mathematica’s GroupBy
. Example, groupby(iseven, 1:5) --> Dict(true=>(2, 4), false =>(1, 3, 5))
. The type of the values doesn’t need to be a tuple if there’s a better alternative.
This is partly because I need it, and partly because I’m learning Julia. So, both answers that explain the performance considerations, and answers that tell me “hey, there’s this built-in, you don’t need to do anything” are appreciated.
In my use case, the collection is a very long collection of small items, which only appear uniquely a low number of times. Overall, it takes a good amount of memory. So, the resulting dictionary would be long, and the values are each short and don’t take more than a few bytes, and ideally I wouldn’t be using up 10 times the space for intermediate calculations.
I’m thinking of creating a dictionary d
, looping over itr
, if the key doesn’t exist create it, and then append the value to the collection in d[key]
.
Now, what’s the best type for the values of the dict?
-
Vector
seems like a bad idea, because it seems to have a lot of memory overhead for small lists. -
Tuple
or some other immutable thing likeStaticArrays.SVector
only seem to take up the space needed. But, is there any con if they happen to get long-ish? Furthermore, I think I would have to reallocate every time I push, because even though they are immutable, they are inside a mutable(?); and they need to grow. Also, does the garbage collector have to keep track of every tuple that I create and destroy, and wouldn’t that be a huge performance overhead? Also, the fact that they change type every time I push a new value, does it mean there’s extra overhead on some dict header keeping track of all the different types? -
Tuple
but with predefined size. Say I could guarantee that there will be no more thann
elements in each bin, and don’t mind raising an error eventually. I could find a value that represents “nothing here yet”, or make them tuples ofUnion{T, Void}
or something (overhead), and only in the end trim those away. This seems ugly, and it also wouldn’t avoid reallocating since they are immutables that can’t be changed in place (?reinterpret?/?compiler optimizations?). In that case,StaticArrays.MVector
is mutable but it doesn’t seem to have that much overhead. I wonder if it makes a difference for the garbage collector whether they are mutables or immutables given that they live in the heap anyway.
Thanks!