Strange performance problem

I get a difference of about 36%, not 3x:

julia> @btime plain!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG);
  112.766 ms (0 allocations: 0 bytes)

julia> @btime no_tmp!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG);
  153.984 ms (0 allocations: 0 bytes)

In the first version, you perform all of your “reads” of the array before you perform any “writes”. In the second version the reads and writes are interleaved. This will make a difference because the compiler doesn’t know that the Jx, Jy, Jz arrays are distinct from U, V, W.

For example, you read U[i, j, k] when computing upp, and again when computing ux1, uy1, and uz1. In plain!, the compiler probably factors out this common subexpression — it loads U[i,j,k] only once. However, in no_tmp!, because the reads and writes are interleaved, the compiler will have to load U[i,j,k] three times, since it may have been overwritten by the writes to the J arrays (which might be the same as U for all the compiler knows).

Interestingly, most of the difference goes away if I simply put @inbounds in front of the for loops (i.e. write @inbounds for).

Note also that the order of floating-point operations is different in the two cases, because:

-(U[i-1, j, k] + U[i, j, k]) / 2 * (U[i-1, j, k] + U[i, j, k]) / 2

is not the same thing as

upp * ux1 == (-(U[i-1, j, k] + U[i, j, k]) / 2) * ((U[i-1, j, k] + U[i, j, k]) / 2)

(Note parentheses.) Not only may this produce slightly different results due to differences in floating-point roundoff errors, but it in the no_tmp! case it is also likely to prevent the compiler from realizing that upp == -ux1 and pulling out the common subexpression.

3 Likes