Alright, the example in that post is better on master (4x faster):
julia> @btime inversions1($p)
179.826 ns (21 allocations: 640 bytes)
45
compared to 1.0.1:
julia> @btime inversions1($p)
717.085 ns (111 allocations: 3.44 KiB)
45
So while there is still a slow-down compared to the hand written loop, things are improving!
It might not be possible to get every high level abstraction to exactly match the performance of a hand written loop but the goal is to make them close enough that it shouldn’t be a big problem in practice.