**Update 12th Nov 2017**

**2nd Update** Implemented a multi-threaded that is almost 2x faster than data.table!

Implemented a new radixgroup that is even faster than the previous method by 18% on 250 million element arrays.

**Update 29th Oct 2017**

I managed to repurpose Radixsort to make Julia’s groupby performance be on par with data.table and can beat it in some cases!

**Original post**

I am preparing to release some new group by benchmarks of Julia vs R data.table group by performance. I am now investigating a number of strategies to compute the mean of a value by group.

I have contained my MWE in the gist below. I basically came up with 4 approaches, please do suggest improvements to the code as I am new to Julia.

I am trying to find the mean value of v1 by distinct values of id4. Normally v1 and id4 would be part a dataframe or table of some sort, but for now I have included them as vectors.

**Update (2017-10-22)**

Thanks to @nalimilan’s helpful tips and comments I managed to get a `meany_best`

approach = Approach

# Approach 1 (`meany_best`

) - zip with Dict

In this approach I use one Dicts that maps to the values and the count and use Base internal ht_keyindex to access values in a Dict quickly. This approach is general and works for any data type so it superior to Approach 4. The only approach that still beats this is Approach 3 which requires large amounts of prior knowledge.

## timing ~10 seconds

# Approach 2 (`meany2`

) - enumerate with Dict

Similar to approach 1 instead of zipping the two I simply enumerate on one of them.

## timing ~60 seconds

# Approach 3 (`meany3`

) - what if I know the number unique values in id4 beforehand?

I can just use two arrays `res`

and `wt`

, enumerate through id4 and and keep adding to it and then summarise to get the mean

## timing ~0.865 seconds

# Approach 4 (`meany4`

) - what if I work out the unique values in id4 first?

I use countmap to get the unique values in `id4`

and then follow approach 3

## timing ~18 seconds

# some comments & questions

Clearly approach 3 only works in limited situations, but approach 4 is more generalisable. I understand approach 1 & 2 to be slow because there were lots of mem allocation there? How can I reduce the number of allocations?

Are there even faster methods I have missed? Multithreading?