Naively speaking (and it’s really my mental model of how julia operates, it can be far away from reality), when you are going through the loop, compiler is unable to determine which function is going to be run next, so it has to determine it in runtime. Which means, that when you are executing program for each new value of i it pauses and tries to determine which function should be executed now. It looks it up with the help of big and complicated dictionary and it takes a lot of time.
On the other hand, when you are applyng map, due to the way this function is implemented, compiler is turning it’s call into something like this (f1(x), f2(x), f3(x), f4(x)), so it knows at compile time which function is going to be call when and it needs not to make this huge dynamical lookup.
Yep, @rdeits that appears to be the best solution so far. Using an organizing type helped and took the whole calculation down to 172 ms, but it’s clunky. The tuple recursion is pretty simple, takes it down to 140 ms, and the new profile is looking great.