I get a difference of about 36%, not 3x:
julia> @btime plain!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG);
112.766 ms (0 allocations: 0 bytes)
julia> @btime no_tmp!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG);
153.984 ms (0 allocations: 0 bytes)
In the first version, you perform all of your “reads” of the array before you perform any “writes”. In the second version the reads and writes are interleaved. This will make a difference because the compiler doesn’t know that the Jx, Jy, Jz arrays are distinct from U, V, W.
For example, you read U[i, j, k] when computing upp, and again when computing ux1, uy1, and uz1. In plain!, the compiler probably factors out this common subexpression — it loads U[i,j,k] only once. However, in no_tmp!, because the reads and writes are interleaved, the compiler will have to load U[i,j,k] three times, since it may have been overwritten by the writes to the J arrays (which might be the same as U for all the compiler knows).
Interestingly, most of the difference goes away if I simply put @inbounds in front of the for loops (i.e. write @inbounds for).
Note also that the order of floating-point operations is different in the two cases, because:
-(U[i-1, j, k] + U[i, j, k]) / 2 * (U[i-1, j, k] + U[i, j, k]) / 2
is not the same thing as
upp * ux1 == (-(U[i-1, j, k] + U[i, j, k]) / 2) * ((U[i-1, j, k] + U[i, j, k]) / 2)
(Note parentheses.) Not only may this produce slightly different results due to differences in floating-point roundoff errors, but it in the no_tmp! case it is also likely to prevent the compiler from realizing that upp == -ux1 and pulling out the common subexpression.