If I may give any advice, I would take a look at the implementation I did there, which is 60% slower than that of @ettersi. It can be improved as well, but fundamentally it uses much less lower level Julia syntax. I am saying that because while @ettersi’s implementation is great, I feel it is somewhat hard to understand and may be frustrating.
The general patterns in both cases relative to your original code is to use your own data structure and rely less on vectorized operations and more on the explicit writing of the logic of the algorithm.
@lmiq I think you are right. Maybe I should first start with yours, and later try to understand what @ettersi did there. By the way, In my original implementation, I supported a variable number of factor arguments in order to use loop fusion in the multiplication step (which is where most of the time is spent when operating on larger arrays). I’m gonna try to extend yours to do the same.
What I did there is a literal translation of the arrows of your figure there. I know nothing about the original problem, so maybe it is not general enough, perhaps even not completely correct in the general case. What I do is what one would do “by hand”: count the number of equal factors, build the output table, fill the the table running over the original data.
The mastery etersi has of those iterator functions is uncommon. I don’t think you need that to write efficient Julia code in general (they are not even exported from Base, as you can see). At the same time, my mastery of vectorized operations is minimal, so probably some parts of my code can be simplified and shortened a lot.