Strange performance problem

Hi all, I am new to this forum. Nice to meet you.
Today I encouted a strange performance problem.
I benchmarked below codes with Benchmarktools.

NG = 2
N = NX = NY = NZ = 256
U = rand!(zeros(N + 2NG, N + 2NG, N + 2NG));
V = rand!(zeros(N + 2NG, N + 2NG, N + 2NG));
W = rand!(zeros(N + 2NG, N + 2NG, N + 2NG));
Jx = zeros(N + 2NG, N + 2NG, N + 2NG);
Jy = zeros(N + 2NG, N + 2NG, N + 2NG);
Jz = zeros(N + 2NG, N + 2NG, N + 2NG);

function plain!(U, V, W, Jx, Jy, Jz, NX, NY, NZ, NG)
    for k in NG:NZ+1+NG, j in NG:NY+1+NG, i in NG:NX+1+NG
        upp = -(U[i-1, j, k] + U[i, j, k]) / 2
        vpp = -(V[i, j, k] + V[i+1, j, k]) / 2
        wpp = -(W[i, j, k] + W[i+1, j, k]) / 2

        ux1 = (U[i-1, j, k] + U[i, j, k]) / 2
        uy1 = (U[i, j, k] + U[i, j+1, k]) / 2
        uz1 = (U[i, j, k] + U[i, j, k+1]) / 2

        Jx[i, j, k] = upp * ux1
        Jy[i, j, k] = vpp * uy1
        Jz[i, j, k] = wpp * uz1
    end
end

@benchmark plain!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG)

function no_tmp!(U, V, W, Jx, Jy, Jz, NX, NY, NZ, NG)
    for k in NG:NZ+1+NG, j in NG:NY+1+NG, i in NG:NX+1+NG
        Jx[i, j, k] = -(U[i-1, j, k] + U[i, j, k]) / 2 * (U[i-1, j, k] + U[i, j, k]) / 2
        Jy[i, j, k] = -(V[i, j, k] + V[i+1, j, k]) / 2 * (U[i, j, k] + U[i, j+1, k]) / 2
        Jz[i, j, k] = -(W[i, j, k] + W[i+1, j, k]) / 2 * (U[i, j, k] + U[i, j, k+1]) / 2
    end
end

@benchmark no_tmp!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG)

As you can see the difference between two functions is that tmp vairables is assigned or not.
BenchmarkTool said to me the former is 3x faster than the latter.
Why?

Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 Γ— 13th Gen Intel(R) Coreβ„’ i7-13700K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 24 default, 0 interactive, 12 GC (on 24 virtual cores)
Environment:
JULIA_EDITOR = code

Can you give sample data that you used, so that other people can run this?

Thank you for replying. I added sample data.

I tried to run your code and I got the same performance more or less on 1.10.4, can you post the output of versioninfo()?

Thank you for replying. I added version info. I run both Jupyter and Repl. But same ploblem occured.

I get a difference of about 36%, not 3x:

julia> @btime plain!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG);
  112.766 ms (0 allocations: 0 bytes)

julia> @btime no_tmp!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG);
  153.984 ms (0 allocations: 0 bytes)

In the first version, you perform all of your β€œreads” of the array before you perform any β€œwrites”. In the second version the reads and writes are interleaved. This will make a difference because the compiler doesn’t know that the Jx, Jy, Jz arrays are distinct from U, V, W.

For example, you read U[i, j, k] when computing upp, and again when computing ux1, uy1, and uz1. In plain!, the compiler probably factors out this common subexpression β€” it loads U[i,j,k] only once. However, in no_tmp!, because the reads and writes are interleaved, the compiler will have to load U[i,j,k] three times, since it may have been overwritten by the writes to the J arrays (which might be the same as U for all the compiler knows).

Interestingly, most of the difference goes away if I simply put @inbounds in front of the for loops (i.e. write @inbounds for).

Note also that the order of floating-point operations is different in the two cases, because:

-(U[i-1, j, k] + U[i, j, k]) / 2 * (U[i-1, j, k] + U[i, j, k]) / 2

is not the same thing as

upp * ux1 == (-(U[i-1, j, k] + U[i, j, k]) / 2) * ((U[i-1, j, k] + U[i, j, k]) / 2)

(Note parentheses.) Not only may this produce slightly different results due to differences in floating-point roundoff errors, but it in the no_tmp! case it is also likely to prevent the compiler from realizing that upp == -ux1 and pulling out the common subexpression.

3 Likes

I don’t get anywhere near 3x, but still a bigger difference than I would have expected:

julia> @btime plain!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG)
  56.065 ms (0 allocations: 0 bytes)

julia> @btime no_tmp!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG)
  69.160 ms (0 allocations: 0 bytes)

My CPU is quite similar to yours (20 Γ— 13th Gen Intel(R) Core(TM) i7-13800H)

Thank you all. I am surprised at such quick replies.
I completely convinced by stevengj and understand reads should be done before writes. Also, float point operation difference is interesting.

But I still have question about 3x slower. And @inbounds did not fix this problem.
I got 101.028 ms from the former and 324.322 ms from the latter.
This is too slow comparing to nilshg.

1 Like

@benchmark returns a lot of different statistics. Are you looking at the minimum time? This is what @btime reports, and is often the most meaningful number.

Thank you for answer. I saw probably mean time. All outputs are as follows.

julia> @benchmark plain!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG)
BenchmarkTools.Trial: 131 samples with 1 evaluation.
 Range (min … max):  35.522 ms … 58.969 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     35.885 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   38.408 ms Β±  5.696 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‡β–‚ ▁▁                                               ▁       
  β–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–„β–„β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–„β–ˆβ–„β–„ β–„
  35.5 ms      Histogram: log(frequency) by time      52.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark noassign!($U, $V, $W, $Jx, $Jy, $Jz, $NX, $NY, $NZ, $NG)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
 Range (min … max):  88.406 ms …  90.032 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     88.995 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   89.011 ms Β± 318.081 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                            β–ˆ ▁▁   ▁       ▁                    
  β–„β–β–β–β–‡β–„β–β–„β–β–„β–„β–„β–„β–„β–„β–„β–β–„β–„β–„β–β–„β–‡β–β–‡β–„β–ˆβ–„β–ˆβ–ˆβ–„β–„β–„β–ˆβ–„β–‡β–„β–β–β–‡β–„β–ˆβ–β–β–β–β–β–„β–β–β–β–„β–„β–β–„β–β–β–β–β–‡ ▁
  88.4 ms         Histogram: frequency by time         89.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Aside from performance, why use rand!(zeros(..)) and not just rand(..)? The latter would be more concise and faster.

In Julia version 1.11.0-DEV, I get identical timings (--check-bounds=no option is used),

BenchmarkTools.Trial: 69 samples with 1 evaluation.
 Range (min … max):  71.886 ms … 81.659 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     72.572 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   73.115 ms Β±  1.723 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–„β–ˆβ–… β–‚                                                      
  β–„β–†β–ˆβ–ˆβ–ˆβ–ƒβ–ˆβ–…β–„β–†β–…β–ƒβ–„β–„β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒ ▁
  71.9 ms         Histogram: frequency by time        80.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 68 samples with 1 evaluation.
 Range (min … max):  72.288 ms … 96.033 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     73.023 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   74.107 ms Β±  4.257 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–ˆβ–ˆβ–…                                                         
  β–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–β–β–β–β–β–β–β–β–…β–β–β–β–β–β–β–…β–β–β–…β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–β–β–β–β–β–… ▁
  72.3 ms      Histogram: log(frequency) by time        92 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Thank you for benchmarking. I agree rand! is a roundabout way.
I also tried --check-bounds=no but same problem occurred.
I’m convinced that’s the way it is.

Thank you all for answers. If someone gets same problem or knows the reason, please tell me.

What’s noassign! here?

I am sorry noassign! is the same function as no_tmp!.