I suspect it has to do with some of the data fitting into a CPU cache with the individual benchmarks, but only a higher level cache for the combined ones. I can replicate the phenomenon on 1.4, with multiple runs giving somewhat inconsistent timings.
In any case, making x
and v
10x larger resolves the inconsistency for me (you may have to increase it more if you have a recent desktop CPU, mine is a puny laptop CPU with little cache) eg
julia> @btime toPolar!($x)
4.520 ms (0 allocations: 0 bytes)
julia> @btime toCartesian!($x)
1.284 ms (0 allocations: 0 bytes)
julia> @btime move!($x, $v, $T)
148.501 μs (0 allocations: 0 bytes)
julia> @btime outerFunction!($x, $v, $T)
5.503 ms (0 allocations: 0 bytes)
what the relevant benchmark is depends on your data size I guess.
Also, cf