Strange performance problem

Hi all, I am new to this forum. Nice to meet you.
Today I encouted a strange performance problem.
I benchmarked below codes with Benchmarktools.

``````NG = 2
N = NX = NY = NZ = 256
U = rand!(zeros(N + 2NG, N + 2NG, N + 2NG));
V = rand!(zeros(N + 2NG, N + 2NG, N + 2NG));
W = rand!(zeros(N + 2NG, N + 2NG, N + 2NG));
Jx = zeros(N + 2NG, N + 2NG, N + 2NG);
Jy = zeros(N + 2NG, N + 2NG, N + 2NG);
Jz = zeros(N + 2NG, N + 2NG, N + 2NG);

function plain!(U, V, W, Jx, Jy, Jz, NX, NY, NZ, NG)
for k in NG:NZ+1+NG, j in NG:NY+1+NG, i in NG:NX+1+NG
upp = -(U[i-1, j, k] + U[i, j, k]) / 2
vpp = -(V[i, j, k] + V[i+1, j, k]) / 2
wpp = -(W[i, j, k] + W[i+1, j, k]) / 2

ux1 = (U[i-1, j, k] + U[i, j, k]) / 2
uy1 = (U[i, j, k] + U[i, j+1, k]) / 2
uz1 = (U[i, j, k] + U[i, j, k+1]) / 2

Jx[i, j, k] = upp * ux1
Jy[i, j, k] = vpp * uy1
Jz[i, j, k] = wpp * uz1
end
end

@benchmark plain!(\$U, \$V, \$W, \$Jx, \$Jy, \$Jz, \$NX, \$NY, \$NZ, \$NG)

function no_tmp!(U, V, W, Jx, Jy, Jz, NX, NY, NZ, NG)
for k in NG:NZ+1+NG, j in NG:NY+1+NG, i in NG:NX+1+NG
Jx[i, j, k] = -(U[i-1, j, k] + U[i, j, k]) / 2 * (U[i-1, j, k] + U[i, j, k]) / 2
Jy[i, j, k] = -(V[i, j, k] + V[i+1, j, k]) / 2 * (U[i, j, k] + U[i, j+1, k]) / 2
Jz[i, j, k] = -(W[i, j, k] + W[i+1, j, k]) / 2 * (U[i, j, k] + U[i, j, k+1]) / 2
end
end

@benchmark no_tmp!(\$U, \$V, \$W, \$Jx, \$Jy, \$Jz, \$NX, \$NY, \$NZ, \$NG)
``````

As you can see the difference between two functions is that tmp vairables is assigned or not.
BenchmarkTool said to me the former is 3x faster than the latter.
Why?

Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 Γ 13th Gen Intel(R) Coreβ’ i7-13700K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 24 default, 0 interactive, 12 GC (on 24 virtual cores)
Environment:
JULIA_EDITOR = code

Can you give sample data that you used, so that other people can run this?

Thank you for replying. I added sample data.

I tried to run your code and I got the same performance more or less on 1.10.4, can you post the output of `versioninfo()`?

Thank you for replying. I added version info. I run both Jupyter and Repl. But same ploblem occured.

I get a difference of about 36%, not 3x:

``````julia> @btime plain!(\$U, \$V, \$W, \$Jx, \$Jy, \$Jz, \$NX, \$NY, \$NZ, \$NG);
112.766 ms (0 allocations: 0 bytes)

julia> @btime no_tmp!(\$U, \$V, \$W, \$Jx, \$Jy, \$Jz, \$NX, \$NY, \$NZ, \$NG);
153.984 ms (0 allocations: 0 bytes)
``````

In the first version, you perform all of your βreadsβ of the array before you perform any βwritesβ. In the second version the reads and writes are interleaved. This will make a difference because the compiler doesnβt know that the `Jx, Jy, Jz` arrays are distinct from `U, V, W`.

For example, you read `U[i, j, k]` when computing `upp`, and again when computing `ux1`, `uy1`, and `uz1`. In `plain!`, the compiler probably factors out this common subexpression β it loads `U[i,j,k]` only once. However, in `no_tmp!`, because the reads and writes are interleaved, the compiler will have to load `U[i,j,k]` three times, since it may have been overwritten by the writes to the `J` arrays (which might be the same as `U` for all the compiler knows).

Interestingly, most of the difference goes away if I simply put `@inbounds` in front of the `for` loops (i.e. write `@inbounds for`).

Note also that the order of floating-point operations is different in the two cases, because:

``````-(U[i-1, j, k] + U[i, j, k]) / 2 * (U[i-1, j, k] + U[i, j, k]) / 2
``````

is not the same thing as

``````upp * ux1 == (-(U[i-1, j, k] + U[i, j, k]) / 2) * ((U[i-1, j, k] + U[i, j, k]) / 2)
``````

(Note parentheses.) Not only may this produce slightly different results due to differences in floating-point roundoff errors, but it in the `no_tmp!` case it is also likely to prevent the compiler from realizing that `upp == -ux1` and pulling out the common subexpression.

3 Likes

I donβt get anywhere near 3x, but still a bigger difference than I would have expected:

``````julia> @btime plain!(\$U, \$V, \$W, \$Jx, \$Jy, \$Jz, \$NX, \$NY, \$NZ, \$NG)
56.065 ms (0 allocations: 0 bytes)

julia> @btime no_tmp!(\$U, \$V, \$W, \$Jx, \$Jy, \$Jz, \$NX, \$NY, \$NZ, \$NG)
69.160 ms (0 allocations: 0 bytes)
``````

My CPU is quite similar to yours (`20 Γ 13th Gen Intel(R) Core(TM) i7-13800H`)

Thank you all. I am surprised at such quick replies.
I completely convinced by stevengj and understand reads should be done before writes. Also, float point operation difference is interesting.

But I still have question about 3x slower. And @inbounds did not fix this problem.
I got 101.028 ms from the former and 324.322 ms from the latter.
This is too slow comparing to nilshg.

1 Like

`@benchmark` returns a lot of different statistics. Are you looking at the minimum time? This is what `@btime` reports, and is often the most meaningful number.

Thank you for answer. I saw probably mean time. All outputs are as follows.

``````julia> @benchmark plain!(\$U, \$V, \$W, \$Jx, \$Jy, \$Jz, \$NX, \$NY, \$NZ, \$NG)
BenchmarkTools.Trial: 131 samples with 1 evaluation.
Range (min β¦ max):  35.522 ms β¦ 58.969 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     35.885 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   38.408 ms Β±  5.696 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββ ββ                                               β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
35.5 ms      Histogram: log(frequency) by time      52.4 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark noassign!(\$U, \$V, \$W, \$Jx, \$Jy, \$Jz, \$NX, \$NY, \$NZ, \$NG)
BenchmarkTools.Trial: 57 samples with 1 evaluation.
Range (min β¦ max):  88.406 ms β¦  90.032 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     88.995 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   89.011 ms Β± 318.081 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

β ββ   β       β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
88.4 ms         Histogram: frequency by time         89.7 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

Aside from performance, why use `rand!(zeros(..))` and not just `rand(..)`? The latter would be more concise and faster.

In Julia version 1.11.0-DEV, I get identical timings (`--check-bounds=no` option is used),

``````BenchmarkTools.Trial: 69 samples with 1 evaluation.
Range (min β¦ max):  71.886 ms β¦ 81.659 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     72.572 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   73.115 ms Β±  1.723 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
71.9 ms         Histogram: frequency by time        80.4 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 68 samples with 1 evaluation.
Range (min β¦ max):  72.288 ms β¦ 96.033 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     73.023 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   74.107 ms Β±  4.257 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
72.3 ms      Histogram: log(frequency) by time        92 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

Thank you for benchmarking. I agree rand! is a roundabout way.
I also tried --check-bounds=no but same problem occurred.
Iβm convinced thatβs the way it is.

Thank you all for answers. If someone gets same problem or knows the reason, please tell me.

Whatβs `noassign!` here?

I am sorry noassign! is the same function as no_tmp!.