Same code much faster on a Ryzen than on a Xeon?

ederag · May 4, 2021, 5:09pm

The time_evolution ODE kernel below is blazing fast on a Ryzen, and more than 4 times slower on a Xeon.
This difference is striking, as usually both machines have similar single thread performances.
(more on that below)
I’m looking for clues to unravel this mystery.
The kernel comes from a method of lines / finite difference scheme, with spatially varying parameters.
Here is the stripped down version used for the benchmarks below:

using BenchmarkTools

const N_points = 40_000
const left_C = 500 .+ randn(N_points);
const right_C = 500 .+ randn(N_points);
const sigma = rand(N_points);
const γ = 1e-5

const u0 = randn(N_points);
const du0 = randn(N_points);
const ddu = zeros(Float64, N_points);

function time_evolution(ddu, du, u, γ, t)
	@inbounds for s in 2:N_points-1
		ddu[s] = (
			left_C[s] * (u[s-1] - u[s]) 
			+ right_C[s] * (u[s+1] - u[s])
			+ sigma[s]
			- 2γ * du[s]
		)
	end
end

Machine A (AMD ryzen):

@benchmark time_evolution(ddu, du0, u0, γ, 0.0)
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.201 μs (0.00% GC)
  median time:      16.982 μs (0.00% GC)
  mean time:        17.056 μs (0.00% GC)
  maximum time:     156.137 μs (0.00% GC)

Machine B (Intel xeon):

@benchmark time_evolution(ddu, du0, u0, γ, 0.0)
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     69.923 μs (0.00% GC)
  median time:      70.935 μs (0.00% GC)
  mean time:        71.088 μs (0.00% GC)
  maximum time:     149.609 μs (0.00% GC)

16 µs for the AMD ryzen, 70 µs for the Intel xeon, so the xeon is about 4.3 times slower.
This is reproducible, and both machines are idle otherwise.

Results are similar with @avx from LoopVectorization.jl.

Machines comparison

julia-1.6.1
Machine A: single AMD Ryzen 9 3900X, 64 GB RAM
Machine B: bi-Xeon Gold 6146, 128 GB ECC RAM

versioninfo() details

versioninfo()
Julia Version 1.6.1
Commit 6aaedecc447 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)

Machine B:

Julia Version 1.6.1
Commit 6aaedecc447 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)

Note: the issue was first noticed while under julia 1.5.3, it is not a 1.6 regression.

Single thread memory access is slower on xeons,
but that can’t explain the 4.3 factor slowdown.
machine A (AMD Ryzen):

sysbench memory run --memory-block-size=64 | grep MiB/sec
(567.96 MiB/sec)
sysbench memory run --memory-block-size=1M | grep MiB/sec
(27673.99 MiB/sec)

Machine B (Intel Xeon):

sysbench memory run --memory-block-size=64 | grep MiB/sec
(550.30 MiB/sec)
sysbench memory run --memory-block-size=1M | grep MiB/sec
(19341.05 MiB/sec)

likwid-bench -s 10 -t stream -w S0:100kB:1

Results

Machine A (AMD Ryzen):

LIKWID MICRO BENCHMARK
Test: stream
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 4164 Offset 0
--------------------------------------------------------------------------------
Cycles:                 45549095124
CPU Clock:              3792770502
Cycle Clock:            3792770502
Time:                   1.200945e+01 sec
Iterations:             8388608
Iterations per thread:  8388608
Inner loop executions:  1041
Size (Byte):            99936
Size per thread:        99936
Number of Flops:        69860327424
MFlops/s:               5817.11
Data volume (Byte):     838323929088
MByte/s:                69805.34
Cycles per update:      1.304005
Cycles per cacheline:   10.432037
Loads per update:       2
Stores per update:      1
Load bytes per element: 16
Store bytes per elem.:  8
Load/store ratio:       2.00
Instructions:           165918277649
UOPs:                   227046064128

Machine B (intel Xeon):

LIKWID MICRO BENCHMARK
Test: stream
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 4164 Offset 0
--------------------------------------------------------------------------------
Cycles:                 54668218488
CPU Clock:              3192441309
Cycle Clock:            3192441309
Time:                   1.712427e+01 sec
Iterations:             8388608
Iterations per thread:  8388608
Inner loop executions:  1041
Size (Byte):            99936
Size per thread:        99936
Number of Flops:        69860327424
MFlops/s:               4079.61
Data volume (Byte):     838323929088
MByte/s:                48955.32
Cycles per update:      1.565072
Cycles per cacheline:   12.520575
Loads per update:       2
Stores per update:      1
Load bytes per element: 16
Store bytes per elem.:  8
Load/store ratio:       2.00
Instructions:           165918277649
UOPs:                   227046064128

General benchmarks, from hardinfo --generate-report

Results

Machine A (AMD Ryzen):

CPU Blowfish: 0.309  # seconds, lower better, multiple cores
CPU CryptoHash: 2320.186  # MiB/s, higher better, multiple cores
CPU Fibonacci: 0.925  # seconds, lower better, single core
CPU N-Queens: 0.330  # seconds, lower better, single core
FPU FFT: 0.673  # seconds, lower is better, single core
FPU Raytracing: 11.102  # John Walker's FBENCH, seconds, lower is better multiple cores

Machine B (Intel Xeon):

CPU Blowfish: 0.359
CPU CryptoHash: 3070.564
CPU Fibonacci: 0.968
CPU N-Queens: 0.395
FPU FFT: 0.563
FPU Raytracing: 8.367

Generated code

On both machines, the following commands
give strictly the same result (checked with diff):

@code_warntype time_evolution(ddu, du0, u0, γ, 0.0)

result

Variables
  #self#::Core.Const(time_evolution)
  ddu::Vector{Float64}
  du::Vector{Float64}
  u::Vector{Float64}
  γ::Float64
  t::Float64
  @_7::UNION{NOTHING, TUPLE{INT64, INT64}}
  val::Nothing
  s::Int64

Body::Nothing
1 ─       Core.NewvarNode(:(val))
│         $(Expr(:inbounds, true))
│   %3  = (Main.N_points - 1)::Core.Const(39999)
│   %4  = (2:%3)::Core.Const(2:39999)
│         (@_7 = Base.iterate(%4))
│   %6  = (@_7::Core.Const((2, 2)) === nothing)::Core.Const(false)
│   %7  = Base.not_int(%6)::Core.Const(true)
└──       goto #4 if not %7
2 ┄ %9  = @_7::Tuple{Int64, Int64}::Tuple{Int64, Int64}
│         (s = Core.getfield(%9, 1))
│   %11 = Core.getfield(%9, 2)::Int64
│   %12 = Base.getindex(Main.left_C, s)::Float64
│   %13 = (s - 1)::Int64
│   %14 = Base.getindex(u, %13)::Float64
│   %15 = Base.getindex(u, s)::Float64
│   %16 = (%14 - %15)::Float64
│   %17 = (%12 * %16)::Float64
│   %18 = Base.getindex(Main.right_C, s)::Float64
│   %19 = (s + 1)::Int64
│   %20 = Base.getindex(u, %19)::Float64
│   %21 = Base.getindex(u, s)::Float64
│   %22 = (%20 - %21)::Float64
│   %23 = (%18 * %22)::Float64
│   %24 = Base.getindex(Main.sigma, s)::Float64
│   %25 = (%17 + %23 + %24)::Float64
│   %26 = (2 * γ)::Float64
│   %27 = Base.getindex(du, s)::Float64
│   %28 = (%26 * %27)::Float64
│   %29 = (%25 - %28)::Float64
│         Base.setindex!(ddu, %29, s)
│         (@_7 = Base.iterate(%4, %11))
│   %32 = (@_7 === nothing)::Bool
│   %33 = Base.not_int(%32)::Bool
└──       goto #4 if not %33
3 ─       goto #2
4 ┄       (val = nothing)
│         $(Expr(:inbounds, :pop))
└──       return val

@code_lowered time_evolution(ddu, du0, u0, γ, 0.0)

Result

CodeInfo(
1 ─       Core.NewvarNode(:(val))
│         $(Expr(:inbounds, true))
│   %3  = Main.N_points - 1
│   %4  = 2:%3
│         @_7 = Base.iterate(%4)
│   %6  = @_7 === nothing
│   %7  = Base.not_int(%6)
└──       goto #4 if not %7
2 ┄ %9  = @_7
│         s = Core.getfield(%9, 1)
│   %11 = Core.getfield(%9, 2)
│   %12 = Base.getindex(Main.left_C, s)
│   %13 = s - 1
│   %14 = Base.getindex(u, %13)
│   %15 = Base.getindex(u, s)
│   %16 = %14 - %15
│   %17 = %12 * %16
│   %18 = Base.getindex(Main.right_C, s)
│   %19 = s + 1
│   %20 = Base.getindex(u, %19)
│   %21 = Base.getindex(u, s)
│   %22 = %20 - %21
│   %23 = %18 * %22
│   %24 = Base.getindex(Main.sigma, s)
│   %25 = %17 + %23 + %24
│   %26 = 2 * γ
│   %27 = Base.getindex(du, s)
│   %28 = %26 * %27
│   %29 = %25 - %28
│         Base.setindex!(ddu, %29, s)
│         @_7 = Base.iterate(%4, %11)
│   %32 = @_7 === nothing
│   %33 = Base.not_int(%32)
└──       goto #4 if not %33
3 ─       goto #2
4 ┄       val = nothing
│         $(Expr(:inbounds, :pop))
└──       return val
)

This one gives almost the same result:
@code_llvm time_evolution(ddu, du0, u0, γ, 0.0)

Result on machine A

;  @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:56 within `time_evolution'
define void @julia_time_evolution_1094({}* nonnull align 16 dereferenceable(40) %0, {}* nonnull align 16 dereferenceable(40) %1, {}* nonnull align 16 dereferenceable(40) %2, double %3, double %4) {
top:
;  @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:60 within `time_evolution'
; ┌ @ array.jl within `getindex'
   %5 = load double*, double** inttoptr (i64 139749451363712 to double**), align 8
   %6 = bitcast {}* %2 to double**
   %7 = load double*, double** %6, align 8
   %8 = load double*, double** inttoptr (i64 139749451364272 to double**), align 8
   %9 = load double*, double** inttoptr (i64 139749435769008 to double**), align 8
; └
; ┌ @ promotion.jl:322 within `*' @ float.jl:0
   %10 = fmul double %3, 2.000000e+00
; └
; ┌ @ array.jl within `getindex'
   %11 = bitcast {}* %1 to double**
   %12 = load double*, double** %11, align 8
; └
; ┌ @ array.jl within `setindex!'
   %13 = bitcast {}* %0 to double**
   %14 = load double*, double** %13, align 8
; └
;  @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:59 within `time_evolution'
  %scevgep = getelementptr double, double* %14, i64 1
  %scevgep9 = getelementptr double, double* %14, i64 39999
  %scevgep11 = getelementptr double, double* %5, i64 1
  %scevgep13 = getelementptr double, double* %5, i64 39999
  %scevgep15 = getelementptr double, double* %7, i64 40000
  %scevgep17 = getelementptr double, double* %8, i64 1
  %scevgep19 = getelementptr double, double* %8, i64 39999
  %scevgep21 = getelementptr double, double* %9, i64 1
  %scevgep23 = getelementptr double, double* %9, i64 39999
  %scevgep25 = getelementptr double, double* %12, i64 1
  %scevgep27 = getelementptr double, double* %12, i64 39999
  %bound0 = icmp ult double* %scevgep, %scevgep13
  %bound1 = icmp ult double* %scevgep11, %scevgep9
  %found.conflict = and i1 %bound0, %bound1
  %bound029 = icmp ult double* %scevgep, %scevgep15
  %bound130 = icmp ult double* %7, %scevgep9
  %found.conflict31 = and i1 %bound029, %bound130
  %conflict.rdx = or i1 %found.conflict, %found.conflict31
  %bound032 = icmp ult double* %scevgep, %scevgep19
  %bound133 = icmp ult double* %scevgep17, %scevgep9
  %found.conflict34 = and i1 %bound032, %bound133
  %conflict.rdx35 = or i1 %conflict.rdx, %found.conflict34
  %bound036 = icmp ult double* %scevgep, %scevgep23
  %bound137 = icmp ult double* %scevgep21, %scevgep9
  %found.conflict38 = and i1 %bound036, %bound137
  %conflict.rdx39 = or i1 %conflict.rdx35, %found.conflict38
  %bound040 = icmp ult double* %scevgep, %scevgep27
  %bound141 = icmp ult double* %scevgep25, %scevgep9
  %found.conflict42 = and i1 %bound040, %bound141
  %conflict.rdx43 = or i1 %conflict.rdx39, %found.conflict42
  br i1 %conflict.rdx43, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %top
  %broadcast.splatinsert = insertelement <4 x double> undef, double %10, i32 0
  %broadcast.splat = shufflevector <4 x double> %broadcast.splatinsert, <4 x double> undef, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = or i64 %index, 2
;  @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:60 within `time_evolution'
; ┌ @ array.jl:801 within `getindex'
   %15 = add nsw i64 %offset.idx, -1
   %16 = getelementptr inbounds double, double* %5, i64 %15
   %17 = bitcast double* %16 to <4 x double>*
   %wide.load = load <4 x double>, <4 x double>* %17, align 8
   %18 = getelementptr inbounds double, double* %7, i64 %index
   %19 = bitcast double* %18 to <4 x double>*
   %wide.load44 = load <4 x double>, <4 x double>* %19, align 8
   %20 = getelementptr inbounds double, double* %7, i64 %15
   %21 = bitcast double* %20 to <4 x double>*
   %wide.load45 = load <4 x double>, <4 x double>* %21, align 8
; └
; ┌ @ float.jl:329 within `-'
   %22 = fsub <4 x double> %wide.load44, %wide.load45
; └
; ┌ @ float.jl:332 within `*'
   %23 = fmul <4 x double> %wide.load, %22
; └
; ┌ @ array.jl:801 within `getindex'
   %24 = getelementptr inbounds double, double* %8, i64 %15
   %25 = bitcast double* %24 to <4 x double>*
   %wide.load46 = load <4 x double>, <4 x double>* %25, align 8
   %26 = getelementptr inbounds double, double* %7, i64 %offset.idx
   %27 = bitcast double* %26 to <4 x double>*
   %wide.load47 = load <4 x double>, <4 x double>* %27, align 8
; └
; ┌ @ float.jl:329 within `-'
   %28 = fsub <4 x double> %wide.load47, %wide.load45
; └
; ┌ @ float.jl:332 within `*'
   %29 = fmul <4 x double> %wide.load46, %28
; └
; ┌ @ array.jl:801 within `getindex'
   %30 = getelementptr inbounds double, double* %9, i64 %15
   %31 = bitcast double* %30 to <4 x double>*
   %wide.load48 = load <4 x double>, <4 x double>* %31, align 8
; └
; ┌ @ operators.jl:560 within `+' @ float.jl:326
   %32 = fadd <4 x double> %23, %29
   %33 = fadd <4 x double> %wide.load48, %32
; └
; ┌ @ array.jl:801 within `getindex'
   %34 = getelementptr inbounds double, double* %12, i64 %15
   %35 = bitcast double* %34 to <4 x double>*
   %wide.load49 = load <4 x double>, <4 x double>* %35, align 8
; └
; ┌ @ float.jl:332 within `*'
   %36 = fmul <4 x double> %broadcast.splat, %wide.load49
; └
; ┌ @ float.jl:329 within `-'
   %37 = fsub <4 x double> %33, %36
; └
; ┌ @ array.jl:839 within `setindex!'
   %38 = getelementptr inbounds double, double* %14, i64 %15
   %39 = bitcast double* %38 to <4 x double>*
   store <4 x double> %37, <4 x double>* %39, align 8
   %index.next = add i64 %index, 4
   %40 = icmp eq i64 %index.next, 39996
   br i1 %40, label %scalar.ph, label %vector.body

scalar.ph:                                        ; preds = %vector.body, %top
   %bc.resume.val = phi i64 [ 2, %top ], [ 39998, %vector.body ]
; └
;  @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:59 within `time_evolution'
  br label %L2

L2:                                               ; preds = %L2, %scalar.ph
  %value_phi = phi i64 [ %bc.resume.val, %scalar.ph ], [ %66, %L2 ]
;  @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:60 within `time_evolution'
; ┌ @ array.jl:801 within `getindex'
   %41 = add nsw i64 %value_phi, -1
   %42 = getelementptr inbounds double, double* %5, i64 %41
   %43 = load double, double* %42, align 8
   %44 = add nsw i64 %value_phi, -2
   %45 = getelementptr inbounds double, double* %7, i64 %44
   %46 = load double, double* %45, align 8
   %47 = getelementptr inbounds double, double* %7, i64 %41
   %48 = load double, double* %47, align 8
; └
; ┌ @ float.jl:329 within `-'
   %49 = fsub double %46, %48
; └
; ┌ @ float.jl:332 within `*'
   %50 = fmul double %43, %49
; └
; ┌ @ array.jl:801 within `getindex'
   %51 = getelementptr inbounds double, double* %8, i64 %41
   %52 = load double, double* %51, align 8
   %53 = getelementptr inbounds double, double* %7, i64 %value_phi
   %54 = load double, double* %53, align 8
; └
; ┌ @ float.jl:329 within `-'
   %55 = fsub double %54, %48
; └
; ┌ @ float.jl:332 within `*'
   %56 = fmul double %52, %55
; └
; ┌ @ array.jl:801 within `getindex'
   %57 = getelementptr inbounds double, double* %9, i64 %41
   %58 = load double, double* %57, align 8
; └
; ┌ @ operators.jl:560 within `+' @ float.jl:326
   %59 = fadd double %50, %56
   %60 = fadd double %58, %59
; └
; ┌ @ array.jl:801 within `getindex'
   %61 = getelementptr inbounds double, double* %12, i64 %41
   %62 = load double, double* %61, align 8
; └
; ┌ @ float.jl:332 within `*'
   %63 = fmul double %10, %62
; └
; ┌ @ float.jl:329 within `-'
   %64 = fsub double %60, %63
; └
; ┌ @ array.jl:839 within `setindex!'
   %65 = getelementptr inbounds double, double* %14, i64 %41
   store double %64, double* %65, align 8
; └
; ┌ @ range.jl:674 within `iterate'
; │┌ @ promotion.jl:410 within `=='
    %.not.not = icmp eq i64 %value_phi, 39999
; │└
   %66 = add nuw nsw i64 %value_phi, 1
; └
  br i1 %.not.not, label %L39, label %L2

L39:                                              ; preds = %L2
  ret void
}

save from a few ids:

Differences between machine B and A

< define void @julia_time_evolution_1094({}* nonnull align 16 dereferenceable(40) %0, {}* nonnull align 16 dereferenceable(40) %1, {}* nonnull align 16 dereferenceable(40) %2, double %3, double %4) {
---
> define void @julia_time_evolution_1171({}* nonnull align 16 dereferenceable(40) %0, {}* nonnull align 16 dereferenceable(40) %1, {}* nonnull align 16 dereferenceable(40) %2, double %3, double %4) {
6c6
<    %5 = load double*, double** inttoptr (i64 139749451363712 to double**), align 8
---
>    %5 = load double*, double** inttoptr (i64 140648620321984 to double**), align 8
9,10c9,10
<    %8 = load double*, double** inttoptr (i64 139749451364272 to double**), align 8
<    %9 = load double*, double** inttoptr (i64 139749435769008 to double**), align 8
---
>    %8 = load double*, double** inttoptr (i64 140648620322544 to double**), align 8
>    %9 = load double*, double** inttoptr (i64 140648577288320 to double**), align 8

The native code significantly differs
@code_native time_evolution(ddu, du0, u0, γ, 0.0)

Result on machine A (AMD Ryzen)

        .text
; ┌ @ time_evolution_mwe.jl:56 within `time_evolution'
        pushq   %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        movabsq $139749451363712, %rax          # imm = 0x7F19F467F580
; │ @ time_evolution_mwe.jl:60 within `time_evolution'
; │┌ @ array.jl within `setindex!'
        movq    (%rdi), %rdi
        movq    (%rdx), %rcx
        movq    (%rsi), %rsi
        vaddsd  %xmm0, %xmm0, %xmm0
        movq    (%rax), %r8
        movabsq $139749451364272, %rax          # imm = 0x7F19F467F7B0
        movq    (%rax), %r9
        movabsq $139749435769008, %rax          # imm = 0x7F19F37A00B0
        movq    (%rax), %r10
; │└
; │ @ time_evolution_mwe.jl:59 within `time_evolution'
        leaq    8(%rdi), %rdx
        leaq    319992(%rdi), %rbp
        leaq    320000(%rcx), %r11
        leaq    319992(%r8), %rbx
        leaq    8(%r8), %rax
        cmpq    %rbx, %rdx
        leaq    319992(%r9), %r12
        leaq    8(%r9), %r15
        setb    -1(%rsp)
        cmpq    %rbp, %rax
        leaq    319992(%r10), %rbx
        leaq    8(%r10), %r13
        setb    %r14b
        cmpq    %r11, %rdx
        setb    %al
        cmpq    %rbp, %rcx
        setb    -2(%rsp)
        cmpq    %r12, %rdx
        setb    %r11b
        cmpq    %rbp, %r15
        setb    %r15b
        cmpq    %rbx, %rdx
        leaq    319992(%rsi), %rbx
        setb    %r12b
        cmpq    %rbp, %r13
        setb    -3(%rsp)
        cmpq    %rbx, %rdx
        leaq    8(%rsi), %rdx
        setb    %r13b
        cmpq    %rbp, %rdx
        movl    $2, %edx
        setb    %bpl
        testb   %r14b, -1(%rsp)
        jne     L323
        andb    -2(%rsp), %al
        jne     L323
        andb    %r15b, %r11b
        jne     L323
        andb    -3(%rsp), %r12b
        jne     L323
        andb    %bpl, %r13b
        jne     L323
        vbroadcastsd    %xmm0, %ymm1
        xorl    %eax, %eax
        nop
; │ @ time_evolution_mwe.jl:60 within `time_evolution'
; │┌ @ array.jl:801 within `getindex'
L240:
        vmovupd (%rcx,%rax), %ymm2
        vmovupd 8(%rcx,%rax), %ymm3
        vmovupd 16(%rcx,%rax), %ymm4
; │└
; │┌ @ float.jl:329 within `-'
        vsubpd  %ymm3, %ymm2, %ymm2
        vsubpd  %ymm3, %ymm4, %ymm3
; │└
; │┌ @ float.jl:332 within `*'
        vmulpd  8(%r8,%rax), %ymm2, %ymm2
        vmulpd  8(%r9,%rax), %ymm3, %ymm3
        vmulpd  8(%rsi,%rax), %ymm1, %ymm4
; │└
; │┌ @ operators.jl:560 within `+' @ float.jl:326
        vaddpd  %ymm3, %ymm2, %ymm2
        vaddpd  8(%r10,%rax), %ymm2, %ymm2
; │└
; │┌ @ float.jl:329 within `-'
        vsubpd  %ymm4, %ymm2, %ymm2
; │└
; │┌ @ array.jl:839 within `setindex!'
        vmovupd %ymm2, 8(%rdi,%rax)
        addq    $32, %rax
        cmpq    $319968, %rax                   # imm = 0x4E1E0
        jne     L240
; │└
; │ @ array.jl within `time_evolution'
        movl    $39998, %edx                    # imm = 0x9C3E
; │ @ time_evolution_mwe.jl:59 within `time_evolution'
L323:
        shlq    $3, %rdx
        addq    $-8, %rdi
        addq    $-8, %rsi
        addq    $-8, %r10
        addq    $-8, %r9
        addq    $-8, %r8
        movl    $320000, %eax                   # imm = 0x4E200
; │ @ time_evolution_mwe.jl:60 within `time_evolution'
; │┌ @ array.jl:801 within `getindex'
L352:
        vmovsd  -16(%rcx,%rdx), %xmm1           # xmm1 = mem[0],zero
        vmovsd  -8(%rcx,%rdx), %xmm2            # xmm2 = mem[0],zero
        vmovsd  (%rcx,%rdx), %xmm3              # xmm3 = mem[0],zero
; │└
; │┌ @ range.jl:674 within `iterate'
; ││┌ @ promotion.jl:410 within `=='
        addq    $8, %rcx
        addq    $-8, %rax
; │└└
; │┌ @ float.jl:329 within `-'
        vsubsd  %xmm2, %xmm1, %xmm1
        vsubsd  %xmm2, %xmm3, %xmm2
; │└
; │┌ @ float.jl:332 within `*'
        vmulsd  (%r8,%rdx), %xmm1, %xmm1
        vmulsd  (%r9,%rdx), %xmm2, %xmm2
        vmulsd  (%rsi,%rdx), %xmm0, %xmm3
; │└
; │┌ @ range.jl:674 within `iterate'
; ││┌ @ promotion.jl:410 within `=='
        addq    $8, %rsi
        addq    $8, %r9
        addq    $8, %r8
; │└└
; │┌ @ operators.jl:560 within `+' @ float.jl:326
        vaddsd  %xmm2, %xmm1, %xmm1
        vaddsd  (%r10,%rdx), %xmm1, %xmm1
; │└
; │┌ @ range.jl:674 within `iterate'
; ││┌ @ promotion.jl:410 within `=='
        addq    $8, %r10
; │└└
; │┌ @ float.jl:329 within `-'
        vsubsd  %xmm3, %xmm1, %xmm1
; │└
; │┌ @ array.jl:839 within `setindex!'
        vmovsd  %xmm1, (%rdi,%rdx)
; │└
; │┌ @ range.jl:674 within `iterate'
; ││┌ @ promotion.jl:410 within `=='
        addq    $8, %rdi
        cmpq    %rax, %rdx
; │└└
        jne     L352
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        vzeroupper
        retq
        nopl    (%rax)
; └

Result on machine B (Intel Xeon)

        .text
; ┌ @ time_evolution_mwe.jl:56 within `time_evolution'
        pushq   %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        movabsq $140648620321984, %rax          # imm = 0x7FEB4F0D6CC0
; │ @ time_evolution_mwe.jl:60 within `time_evolution'
; │┌ @ array.jl within `getindex'
        movq    (%rax), %r8
        movq    (%rdx), %rcx
        movabsq $140648620322544, %rax          # imm = 0x7FEB4F0D6EF0
        movq    (%rax), %r9
        movabsq $140648577288320, %rax          # imm = 0x7FEB4C7CC880
        movq    (%rax), %r10
        movq    (%rsi), %rsi
        movq    (%rdi), %rdi
; │└
; │ @ time_evolution_mwe.jl:59 within `time_evolution'
        leaq    8(%rdi), %rdx
        leaq    319992(%rdi), %rbp
        leaq    8(%r8), %rax
        leaq    319992(%r8), %rbx
        leaq    320000(%rcx), %r11
        leaq    8(%r9), %r15
        leaq    319992(%r9), %r12
        cmpq    %rbx, %rdx
        setb    -1(%rsp)
        cmpq    %rbp, %rax
        leaq    8(%r10), %r13
        setb    %r14b
        cmpq    %r11, %rdx
        setb    %al
        cmpq    %rbp, %rcx
        setb    -2(%rsp)
        cmpq    %r12, %rdx
        setb    %r11b
        cmpq    %rbp, %r15
        leaq    319992(%r10), %rbx
        setb    %r15b
        cmpq    %rbx, %rdx
        setb    %r12b
        cmpq    %rbp, %r13
        leaq    319992(%rsi), %rbx
        setb    -3(%rsp)
        cmpq    %rbx, %rdx
        leaq    8(%rsi), %rdx
        setb    %r13b
        cmpq    %rbp, %rdx
; │ @ time_evolution_mwe.jl:60 within `time_evolution'
; │┌ @ promotion.jl:322 within `*' @ float.jl:0
        vaddsd  %xmm0, %xmm0, %xmm0
; │└
; │ @ time_evolution_mwe.jl:59 within `time_evolution'
        setb    %bpl
        movl    $2, %edx
        testb   %r14b, -1(%rsp)
        jne     L323
        andb    -2(%rsp), %al
        jne     L323
        andb    %r15b, %r11b
        jne     L323
        andb    -3(%rsp), %r12b
        jne     L323
        andb    %bpl, %r13b
        jne     L323
        vbroadcastsd    %xmm0, %ymm1
        xorl    %eax, %eax
        nop
; │ @ time_evolution_mwe.jl:60 within `time_evolution'
; │┌ @ array.jl:801 within `getindex'
L240:
        vmovupd (%rcx,%rax), %ymm2
        vmovupd 8(%rcx,%rax), %ymm3
        vmovupd 16(%rcx,%rax), %ymm4
; │└
; │┌ @ float.jl:329 within `-'
        vsubpd  %ymm3, %ymm2, %ymm2
; │└
; │┌ @ float.jl:332 within `*'
        vmulpd  8(%r8,%rax), %ymm2, %ymm2
; │└
; │┌ @ float.jl:329 within `-'
        vsubpd  %ymm3, %ymm4, %ymm3
; │└
; │┌ @ float.jl:332 within `*'
        vmulpd  8(%r9,%rax), %ymm3, %ymm3
; │└
; │┌ @ operators.jl:560 within `+' @ float.jl:326
        vaddpd  %ymm3, %ymm2, %ymm2
        vaddpd  8(%r10,%rax), %ymm2, %ymm2
; │└
; │┌ @ float.jl:332 within `*'
        vmulpd  8(%rsi,%rax), %ymm1, %ymm3
; │└
; │┌ @ float.jl:329 within `-'
        vsubpd  %ymm3, %ymm2, %ymm2
; │└
; │┌ @ array.jl:839 within `setindex!'
        vmovupd %ymm2, 8(%rdi,%rax)
        addq    $32, %rax
        cmpq    $319968, %rax                   # imm = 0x4E1E0
        jne     L240
; │└
; │ @ array.jl within `time_evolution'
        movl    $39998, %edx                    # imm = 0x9C3E
; │ @ time_evolution_mwe.jl:59 within `time_evolution'
L323:
        shlq    $3, %rdx
        addq    $-8, %rdi
        addq    $-8, %rsi
        addq    $-8, %r10
        addq    $-8, %r9
        addq    $-8, %r8
        movl    $320000, %eax                   # imm = 0x4E200
; │ @ time_evolution_mwe.jl:60 within `time_evolution'
; │┌ @ array.jl:801 within `getindex'
L352:
        vmovsd  -16(%rcx,%rdx), %xmm1           # xmm1 = mem[0],zero
        vmovsd  -8(%rcx,%rdx), %xmm2            # xmm2 = mem[0],zero
; │└
; │┌ @ float.jl:329 within `-'
        vsubsd  %xmm2, %xmm1, %xmm1
; │└
; │┌ @ float.jl:332 within `*'
        vmulsd  (%r8,%rdx), %xmm1, %xmm1
; │└
; │┌ @ array.jl:801 within `getindex'
        vmovsd  (%rcx,%rdx), %xmm3              # xmm3 = mem[0],zero
; │└
; │┌ @ float.jl:329 within `-'
        vsubsd  %xmm2, %xmm3, %xmm2
; │└
; │┌ @ float.jl:332 within `*'
        vmulsd  (%r9,%rdx), %xmm2, %xmm2
; │└
; │┌ @ operators.jl:560 within `+' @ float.jl:326
        vaddsd  %xmm2, %xmm1, %xmm1
        vaddsd  (%r10,%rdx), %xmm1, %xmm1
; │└
; │┌ @ float.jl:332 within `*'
        vmulsd  (%rsi,%rdx), %xmm0, %xmm2
; │└
; │┌ @ float.jl:329 within `-'
        vsubsd  %xmm2, %xmm1, %xmm1
; │└
; │┌ @ array.jl:839 within `setindex!'
        vmovsd  %xmm1, (%rdi,%rdx)
; │└
; │┌ @ range.jl:674 within `iterate'
; ││┌ @ promotion.jl:410 within `=='
        addq    $8, %rcx
        addq    $8, %rdi
        addq    $8, %rsi
        addq    $8, %r10
        addq    $8, %r9
        addq    $8, %r8
        addq    $-8, %rax
        cmpq    %rax, %rdx
; │└└
        jne     L352
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        vzeroupper
        retq
        nopl    (%rax)
; └

What next ?

This is already far beyond my zone of comfort, but I’d like to get to the bottom of it,
for intellectual satisfaction, and because it’s a strong bottleneck for the ODE on the Xeon.
[To be fair, it’s already faster than a vectorized octave version,
so julia + DifferentialEquations.jl rocks !]
Any idea ?

chris-b1 · May 4, 2021, 7:46pm

Just poking through the assembly - what sticks out to me is that the instructions are identical, just in a different order. Take a look below, from L352 to the iteration check, which I believe is most of the core loop.

Truly only speculation - but perhaps the more packed order in the Ryzen code (moves, subs, muls are all grouped together) allow the processor to take advantage of more instruction level parallelism, that the Xeon can’t see.

Ryzen

L352:
        vmovsd  -16(%rcx,%rdx), %xmm1           # xmm1 = mem[0],zero
        vmovsd  -8(%rcx,%rdx), %xmm2            # xmm2 = mem[0],zero
        vmovsd  (%rcx,%rdx), %xmm3              # xmm3 = mem[0],zero
        addq    $8, %rcx
        addq    $-8, %rax
        vsubsd  %xmm2, %xmm1, %xmm1
        vsubsd  %xmm2, %xmm3, %xmm2
        vmulsd  (%r8,%rdx), %xmm1, %xmm1
        vmulsd  (%r9,%rdx), %xmm2, %xmm2
        vmulsd  (%rsi,%rdx), %xmm0, %xmm3
        addq    $8, %rsi
        addq    $8, %r9
        addq    $8, %r8
        vaddsd  %xmm2, %xmm1, %xmm1
        vaddsd  (%r10,%rdx), %xmm1, %xmm1
        addq    $8, %r10
        vsubsd  %xmm3, %xmm1, %xmm1
        vmovsd  %xmm1, (%rdi,%rdx)
        addq    $8, %rdi
        cmpq    %rax, %rdx
; │└└
        jne     L352

Xeon

L352:
        vmovsd  -16(%rcx,%rdx), %xmm1           # xmm1 = mem[0],zero
        vmovsd  -8(%rcx,%rdx), %xmm2            # xmm2 = mem[0],zero
        vsubsd  %xmm2, %xmm1, %xmm1
        vmulsd  (%r8,%rdx), %xmm1, %xmm1
        vmovsd  (%rcx,%rdx), %xmm3              # xmm3 = mem[0],zero
        vsubsd  %xmm2, %xmm3, %xmm2
        vmulsd  (%r9,%rdx), %xmm2, %xmm2
        vaddsd  %xmm2, %xmm1, %xmm1
        vaddsd  (%r10,%rdx), %xmm1, %xmm1
        vmulsd  (%rsi,%rdx), %xmm0, %xmm2
        vsubsd  %xmm2, %xmm1, %xmm1
        vmovsd  %xmm1, (%rdi,%rdx)
        addq    $8, %rcx
        addq    $8, %rdi
        addq    $8, %rsi
        addq    $8, %r10
        addq    $8, %r9

        addq    $8, %r8
        addq    $-8, %rax
        cmpq    %rax, %rdx
; │└└
        jne     L352

pjentsch0 · May 4, 2021, 7:57pm

I was curious, so I tested on a Zen 3 Ryzen

julia> @benchmark time_evolution(ddu, du0, u0, γ, 0.0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.770 μs (0.00% GC)
  median time:      17.060 μs (0.00% GC)
  mean time:        17.153 μs (0.00% GC)
  maximum time:     70.999 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

and on an Intel 9900hk

BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.340 μs (0.00% GC)
  median time:      25.803 μs (0.00% GC)
  mean time:        26.297 μs (0.00% GC)
  maximum time:     93.291 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

the problem seems to be pre-coffee lake Xeon-specific I guess

jroon · May 4, 2021, 8:10pm

So I’m a total Julia noob so I can’t comment on the code side of it - but in general don’t Ryzen chips tend to have a truck load more cache than intel chips? Looking up the two chips you mention:

Intel® Xeon® Gold 6146 Processor: L3 Cache 24.75 MB
AMD Ryzen™ 9 3900X: Total L3 Cache 64MB

Could that be a factor? (at least I’d heard Julia makes efficient use of cache!)

foobar_lv2 · May 4, 2021, 8:37pm

You should be memory limited. Try running the following benchmark (for me on a dektop skylake without avx512).

From this, you can see: Small N_points operate out of caches, large ones operate out of main memory. Furthermore, there is a significant (almost 20%) and stable (under repetitions of the benchmark) dependence on where the allocator chooses to place your arrays. This is presumably due to limited cache associativity.

You should still be memory limited (your data fits in L3 and significant parts fit into the super beefy L2 of your xeon). But maybe this benchmark helps figure out the issue.

If I’m right about you being memory/cache-limited (i.e. small enough that caches are useful, but too large to comfortably fit into cache) then there are only very few changes in code that could help. You could consider non-temporal writes to ddu, but that probably won’t help in your real workload (which presumably needs to read from ddu in between calls to time_evolution).

You could try something with explicit/manual prefetching, but the access pattern should be simple enough for your hardware prefetchers.

If there is some extreme cache-way conflict (i.e. some combination of addresses gives super fast timings), then you could try to discover the lucky alignment numbers and then manually allocate your arrays at good adresses.

julia> using BenchmarkTools

julia> lowbits(a)=reinterpret(UInt64,pointer(a))%UInt16;

julia> function time_evolution(ddu, du, u, sigma, right_C, left_C, gamma, N_points)
               @inbounds for s in 2:N_points-1
                       ddu[s] = (
                               left_C[s] * (u[s-1] - u[s]) 
                               + right_C[s] * (u[s+1] - u[s])
                               + sigma[s]
                               - 2*gamma * du[s]
                       )
               end
       end
time_evolution (generic function with 1 method)

julia> for N in [1 , 4, 10, 40, 100, 400, 1000, 4_000, 10_000, 40_000, 100_000, 400_000]
       N_points = N*100
       mem_kb = N_points * 6 * 8 / 1024
       @show N_points, mem_kb
       for i=1:4
       u0 = randn(N_points)
       left_C = randn(N_points).+500
       right_C = randn(N_points).+500
       sigma = randn(N_points)
       gamma = 1e-5
       u0 = randn(N_points)
       du0 = randn(N_points)
       ddu = zeros(Float64, N_points)
       @show lowbits.([ddu, du0, u0, sigma, right_C, left_C])
       t_nanos = 1e9 * @belapsed time_evolution($ddu, $du0, $u0, $sigma, $right_C, $left_C, $gamma, $N_points)
       mem_gbps = (1e9 / (1<<30))* N_points * 8 * 6 / t_nanos
       @show t_nanos/1000, mem_gbps, t_nanos/N_points
       t_nanos = 1e9 * @belapsed time_evolution($ddu, $du0, $u0, $sigma, $right_C, $left_C, $gamma, $N_points)
       mem_gbps = (1e9 / (1<<30))* N_points * 8 * 6 / t_nanos
       @show t_nanos/1000, mem_gbps, t_nanos/N_points

       end
       println()
       end
(N_points, mem_kb) = (100, 4.6875)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x8ad0, 0x8750, 0x83d0, 0x8050, 0xfbd0, 0xf4d0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03394662638469285, 131.6875588017211, 0.33946626384692846)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03252870090634441, 137.4278170845242, 0.3252870090634441)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xaa50, 0xa6d0, 0xa350, 0x9fd0, 0x9c50, 0x9550]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03742237903225806, 119.45655176815082, 0.37422379032258063)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03730544354838709, 119.83099336042, 0.37305443548387096)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x6a50, 0x6350, 0x43d0, 0x4050, 0xbbd0, 0xb4d0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03387512588116817, 131.96551280240257, 0.3387512588116817)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03388016112789526, 131.94590017677427, 0.33880161127895264)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4050, 0x7bd0, 0x7850, 0x74d0, 0x7150, 0x98d0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.033860020140986914, 132.02438567787576, 0.33860020140986913)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.033854984894259824, 132.04402164457184, 0.3385498489425982)

(N_points, mem_kb) = (400, 18.75)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xcec0, 0x6700, 0xd6c0, 0x9400, 0x7340, 0xae40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.1320775462962963, 135.38556654060596, 0.3301938657407408)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.13207060185185185, 135.39268529021595, 0.3301765046296296)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd140, 0xab00, 0x9280, 0x2340, 0x3340, 0x6480]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.11707658643326038, 152.73244614806472, 0.292691466083151)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.11238840262582057, 159.10354640549934, 0.28097100656455143)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa200, 0x76c0, 0x4e40, 0x8b80, 0x3940, 0x2c00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.11332272228320527, 157.79177443275432, 0.28330680570801314)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.11329088913282108, 157.83611170756393, 0.2832272228320527)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5300, 0x3100, 0xbfc0, 0x2340, 0xd5c0, 0x4200]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.11647182320441989, 153.52548745831447, 0.2911795580110497)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.11646850828729283, 153.52985708814234, 0.29117127071823207)

(N_points, mem_kb) = (1000, 46.875)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xe8c0, 0xc8c0, 0x60c0, 0x40c0, 0x44c0, 0xd140]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.4567040816326531, 97.88282036309876, 0.4567040816326531)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.4566751269035533, 97.88902646100111, 0.4566751269035533)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd880, 0xb880, 0x52c0, 0x32c0, 0x12c0, 0xd240]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.4500561224489796, 99.32868669420392, 0.4500561224489796)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.4510765306122449, 99.1039891188022, 0.4510765306122449)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x9ec0, 0x7ec0, 0x6680, 0x4680, 0x4680, 0xb4c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.4403248730964467, 101.5238663834267, 0.4403248730964467)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.4412323232323232, 101.31506969856586, 0.4412323232323232)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x2c00, 0x6340, 0x76c0, 0x56c0, 0xa540, 0x4600]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.4345989847715736, 102.86145423243278, 0.4345989847715736)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.43455837563451777, 102.87106655410669, 0.43455837563451777)

(N_points, mem_kb) = (4000, 187.5)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x6d00, 0x52c0, 0xd580, 0x8000, 0x2180, 0x7f40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.855, 96.39565192785545, 0.46375)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.8455, 96.89186362837815, 0.461375)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x9b40, 0x1600, 0x98c0, 0xe080, 0x6340, 0x2b40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.8642, 95.91993043995916, 0.46605)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.8587, 96.20376302048307, 0.464675)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0e80, 0x5280, 0xd540, 0x3700, 0xb9c0, 0x6500]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.8464, 96.84463514199084, 0.4616)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.8493, 96.6927671692921, 0.462325)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x8080, 0x9140, 0x1400, 0xf300, 0xf280, 0xd800]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.9028, 93.9741088533592, 0.4757)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.9028, 93.9741088533592, 0.4757)

(N_points, mem_kb) = (10000, 468.75)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4bc0, 0x1300, 0xda40, 0xa180, 0x68c0, 0xf740]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (5.930166666666667, 75.38318245391018, 0.5930166666666667)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (5.804, 77.02185317288588, 0.5804)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xae00, 0x7540, 0x3c80, 0x03c0, 0xcb00, 0x5980]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (5.923166666666667, 75.47227031972137, 0.5923166666666667)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (5.923, 75.47439402590405, 0.5923)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xe780, 0xaec0, 0x7600, 0x3d40, 0x0480, 0x9300]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (5.9265, 75.42982127991726, 0.59265)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (5.944833333333333, 75.1972024697238, 0.5944833333333333)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x9180, 0x58c0, 0x2000, 0xe740, 0xae80, 0x3d00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (5.847666666666667, 76.44670281287631, 0.5847666666666667)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (5.853833333333333, 76.36617073975965, 0.5853833333333333)

(N_points, mem_kb) = (40000, 1875.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1600, 0x33c0, 0x5180, 0x6f40, 0x8d00, 0xc880]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (24.466, 73.08670576562244, 0.61165)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (24.457, 73.11360114739006, 0.611425)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4300, 0x60c0, 0x7e80, 0x9c40, 0xba00, 0xf580]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (23.921, 74.75186418885994, 0.598025)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (24.446, 73.14650017433195, 0.61115)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5500, 0x72c0, 0x9080, 0xae40, 0xcc00, 0x0780]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (25.661, 69.68315121241257, 0.641525)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (25.677, 69.63972984623277, 0.641925)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x6700, 0x84c0, 0xa280, 0xc040, 0xde00, 0x1980]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (25.593, 69.86829770881565, 0.639825)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (25.675, 69.64515455741845, 0.641875)

(N_points, mem_kb) = (100000, 4687.5)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1040, 0x5040, 0x9040, 0xd040, 0x1040, 0x9040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (77.693, 57.53862456275722, 0.77693)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (77.631, 57.58457778663546, 0.77631)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xf340, 0xbe00, 0x88c0, 0x5380, 0x1e40, 0xb3c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (66.863, 66.8583275975397, 0.66863)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (65.823, 67.91468572010236, 0.65823)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd280, 0x9d40, 0x6800, 0x32c0, 0xfd80, 0x9300]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (65.941, 67.7931538519934, 0.65941)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (66.938, 66.78341686567117, 0.66938)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x7c80, 0x4740, 0x1200, 0xdcc0, 0xa780, 0x3d00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (66.028, 67.70382804498541, 0.66028)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (65.816, 67.92190893026464, 0.65816)

(N_points, mem_kb) = (400000, 18750.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x3040, 0x1040, 0xf040, 0xd040, 0xb040, 0x7040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (436.723, 40.94447380288464, 1.0918075)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (431.508, 41.43930919616134, 1.07877)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x53c0, 0x7f80, 0xab40, 0xd700, 0x02c0, 0x5a40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (443.625, 40.30745208817625, 1.1090625)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (445.717, 40.11826659655608, 1.1142925)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xca00, 0xf5c0, 0x2180, 0x4d40, 0x7900, 0xd080]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (441.404, 40.51026595277158, 1.10351)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (433.791, 41.22121812720224, 1.0844775)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4040, 0x6c00, 0x97c0, 0xc380, 0xef40, 0x46c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (440.782, 40.56743113969533, 1.101955)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (442.69, 40.39258495248862, 1.106725)

(N_points, mem_kb) = (1000000, 46875.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1480, 0x3040, 0x5040, 0x7040, 0x9040, 0xd040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1559.064, 28.673283188851112, 1.559064)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1555.83, 28.732884429239036, 1.55583)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xb8c0, 0xa680, 0x9440, 0x8200, 0x6fc0, 0x4b40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1455.033, 30.723346880478292, 1.455033)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1449.022, 30.850797007597517, 1.449022)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5d00, 0x4ac0, 0x3880, 0x2640, 0x1400, 0xef80]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1452.639, 30.77398003326564, 1.452639)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1448.2, 30.86830795576783, 1.4482)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0140, 0xef00, 0xdcc0, 0xca80, 0xb840, 0x93c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1423.281, 31.408754547796935, 1.423281)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1422.887, 31.41745168909616, 1.422887)

(N_points, mem_kb) = (4000000, 187500.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x6040, 0xb040, 0x0040, 0x5040, 0xa040, 0x4040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (8069.494999999999, 22.159247180421065, 2.01737375)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (8072.107, 22.15207681540543, 2.01802675)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5580, 0x0d40, 0xc500, 0x7cc0, 0x3480, 0xa400]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (7226.929, 24.74272741937438, 1.80673225)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (7228.528, 24.737254158270105, 1.807132)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xdfc0, 0x9780, 0x4f40, 0x0700, 0xbec0, 0x2e40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (7227.96, 24.739198103776427, 1.80699)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (7197.324, 24.844502529853024, 1.799331)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x21c0, 0xd980, 0x9140, 0x4900, 0x00c0, 0x7040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (7221.523, 24.761249715076982, 1.80538075)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (7220.353, 24.765262075991558, 1.80508825)

(N_points, mem_kb) = (10000000, 468750.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa040, 0x6040, 0x2040, 0xe040, 0xa040, 0x2040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (20965.002, 21.322909285457243, 2.0965002)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (20953.444, 21.334671083924423, 2.0953444)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa040, 0x6040, 0x2040, 0xe040, 0xa040, 0x2040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (21026.907, 21.26013282958971, 2.1026907)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (20924.579, 21.364101797002927, 2.0924579)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x2040, 0xe040, 0xa040, 0x6040, 0x2040, 0xa040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (20993.041, 21.29442970246329, 2.0993041)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (21018.918, 21.268213512009975, 2.1018918)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa040, 0x6040, 0x2040, 0xe040, 0xa040, 0x6040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (20947.731, 21.340489612714126, 2.0947731)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (20960.897, 21.32708518225292, 2.0960897)

(N_points, mem_kb) = (40000000, 1.875e6)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xc040, 0xa040, 0x8040, 0x6040, 0x4040, 0x0040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (84812.457, 21.08345172999432, 2.120311425)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (84807.175, 21.08476485936147, 2.120179375)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x8040, 0x6040, 0x2040, 0xe040, 0x4040, 0x0040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (84841.027, 21.076351931262202, 2.121025675)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (84796.896, 21.087320734732067, 2.1199224)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4040, 0x2040, 0x2040, 0xe040, 0x0040, 0xc040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (84839.67, 21.076689044897496, 2.12099175)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (84784.099, 21.090503577347903, 2.119602475)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0040, 0xe040, 0xe040, 0xa040, 0xc040, 0x8040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (84898.934, 21.06197638785098, 2.12247335)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (85035.511, 21.028148384522776, 2.125887775)

julia> a=rand(1000*1000*100); b=copy(a); t=@belapsed a.+=b; mem_gbps = length(a)*2*8  / (t * (1<<30))
17.184361168680823

ederag · May 4, 2021, 11:45pm

[message part 1 (due to discourse size limit)]

Thanks for the answers !

@jroon Indeed, I did not suspect cache issue since the xeon has a larger caches:
Machine A (AMD Ryzen)
Screenshot_20210505_004242

Machine B (Intel Xeon)
Screenshot_20210505_003753
(from hardinfo gui, Devices > Processor > click one of the cores)

But @foobar_lv2, your test is enlighting.
Indeed, there’s a cross-over between 10_000 (the 6 arrays fit in the L2 cache) and 40_000 points (too large).

Machine B (Intel Xeon)

julia> using BenchmarkTools

julia> lowbits(a)=reinterpret(UInt64,pointer(a))%UInt16;

julia> function time_evolution(ddu, du, u, sigma, right_C, left_C, gamma, N_points)
                      @inbounds for s in 2:N_points-1
                              ddu[s] = (
                                      left_C[s] * (u[s-1] - u[s]) 
                                      + right_C[s] * (u[s+1] - u[s])
                                      + sigma[s]
                                      - 2*gamma * du[s]
                              )
                      end
              end
time_evolution (generic function with 1 method)

julia> for N in [1 , 4, 10, 40, 100, 400, 1000, 4_000, 10_000, 40_000, 100_000, 400_000]
                  N_points = N*100
                  mem_kb = N_points * 6 * 8 / 1024
                  @show N_points, mem_kb
                  for i=1:4
                      u0 = randn(N_points)
                      left_C = randn(N_points).+500
                      right_C = randn(N_points).+500
                      sigma = randn(N_points)
                      gamma = 1e-5
                      u0 = randn(N_points)
                      du0 = randn(N_points)
                      ddu = zeros(Float64, N_points)
                      @show lowbits.([ddu, du0, u0, sigma, right_C, left_C])
                      t_nanos = 1e9 * @belapsed time_evolution($ddu, $du0, $u0, $sigma, $right_C, $left_C, $gamma, $N_points)
                      mem_gbps = (1e9 / (1<<30))* N_points * 8 * 6 / t_nanos
                      @show t_nanos/1000, mem_gbps, t_nanos/N_points
                      t_nanos = 1e9 * @belapsed time_evolution($ddu, $du0, $u0, $sigma, $right_C, $left_C, $gamma, $N_points)
                      mem_gbps = (1e9 / (1<<30))* N_points * 8 * 6 / t_nanos
                      @show t_nanos/1000, mem_gbps, t_nanos/N_points

                  end
                  println()
              end
(N_points, mem_kb) = (100, 4.6875)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa350, 0x9fd0, 0x9c50, 0x98d0, 0x9550, 0x8e50]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.04987955465587044, 89.62286028807138, 0.49879554655870445)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.04987449392712551, 89.63195425473751, 0.49874493927125507)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x9c50, 0x98d0, 0x9550, 0x91d0, 0x8e50, 0x8750]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.050037411526794744, 89.34012015710388, 0.5003741152679474)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.050188068756319516, 89.07193420530662, 0.5018806875631951)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xb850, 0xb4d0, 0xb150, 0xadd0, 0xaa50, 0xa350]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.046411764705882354, 96.31929288798865, 0.4641176470588235)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.046411943319838056, 96.31892220818766, 0.4641194331983806)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xb850, 0xb4d0, 0xb150, 0xadd0, 0x83d0, 0x8ad0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.046393724696356275, 96.35674624989518, 0.46393724696356275)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.04639068825910931, 96.36305314518579, 0.4639068825910931)

(N_points, mem_kb) = (400, 18.75)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5840, 0xc4c0, 0x02c0, 0x68c0, 0x4400, 0x2380]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.1674450331125828, 106.78963179871995, 0.41861258278145697)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.16816666666666666, 106.33137819197535, 0.42041666666666666)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xb940, 0x0ac0, 0xda40, 0xcc40, 0x8640, 0xeec0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.16800933333333334, 106.4309528396271, 0.4200233333333333)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.16915415549597856, 105.71063643211707, 0.4228853887399464)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x53c0, 0x6080, 0xadc0, 0xf500, 0xed40, 0x2c00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.1680407055630936, 106.41108279507507, 0.420101763907734)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.168202170963365, 106.30893364932736, 0.4205054274084125)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x61c0, 0xcb80, 0x5680, 0x9a80, 0x51c0, 0x7b80]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.16645718050065875, 107.42338287140711, 0.41614295125164685)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.16632674571805006, 107.50762515926907, 0.41581686429512515)

(N_points, mem_kb) = (1000, 46.875)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd5c0, 0xb5c0, 0x95c0, 0x9700, 0x7700, 0xee00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.6510674846625767, 68.66182789747374, 0.6510674846625767)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.6508773006134969, 68.68189064114978, 0.6508773006134969)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xc0c0, 0x4240, 0x2240, 0x1f80, 0xff80, 0x4840]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.636452380952381, 70.23853617241423, 0.636452380952381)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.6362976190476191, 70.25561976556361, 0.6362976190476191)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x88c0, 0x27c0, 0x07c0, 0xb4c0, 0x94c0, 0x7980]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.6394850299401197, 69.9054418611315, 0.6394850299401197)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.6394491017964071, 69.90936957446226, 0.6394491017964071)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4040, 0xd2c0, 0xb2c0, 0xe280, 0x7a80, 0x8200]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.6450181818181818, 69.30577283191069, 0.6450181818181818)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.6454363636363636, 69.26086923532672, 0.6454363636363636)

(N_points, mem_kb) = (4000, 187.5)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1300, 0x95c0, 0x6740, 0xea00, 0xba00, 0xd000]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2.4694444444444446, 72.41059207808985, 0.6173611111111111)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2.4693333333333336, 72.41385029407608, 0.6173333333333334)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xf700, 0x79c0, 0xf4c0, 0x7780, 0x7b40, 0xe900]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2.445, 73.13453346673697, 0.61125)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2.4456666666666664, 73.11459765278937, 0.6114166666666666)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x3d40, 0xf6c0, 0x7980, 0x1500, 0x97c0, 0xb480]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2.458222222222222, 72.74115932632195, 0.6145555555555555)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2.4584444444444444, 72.7345841514755, 0.6146111111111111)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd940, 0x5c00, 0xbc00, 0x3ec0, 0x9900, 0x5d00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2.5593333333333335, 69.86738772838181, 0.6398333333333334)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2.5606666666666666, 69.83100793784375, 0.6401666666666667)

(N_points, mem_kb) = (10000, 468.75)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0c40, 0xd380, 0x9ac0, 0x6200, 0x2940, 0xb7c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6.168399999999999, 72.47176509555634, 0.6168399999999999)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6.1621999999999995, 72.54468141498648, 0.61622)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0c80, 0xd3c0, 0x9b00, 0x6240, 0x2980, 0xb800]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6.168399999999999, 72.47176509555634, 0.6168399999999999)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6.1632, 72.53291079559801, 0.61632)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x6280, 0x29c0, 0xf100, 0xb840, 0x7f80, 0x4500]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6.160399999999999, 72.56587815976718, 0.6160399999999999)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6.1632, 72.53291079559801, 0.61632)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x7fc0, 0x4700, 0x0e40, 0xd580, 0x9cc0, 0x4580]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6.2661999999999995, 71.34065874300688, 0.62662)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6.263, 71.37710934303524, 0.6263)

(N_points, mem_kb) = (40000, 1875.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x9040, 0x8040, 0x7040, 0x6040, 0x5040, 0x3040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (69.969, 25.556165491313564, 1.749225)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (69.901, 25.581026641417417, 1.747525)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xcd00, 0xeac0, 0x0880, 0x2640, 0x4400, 0x7f80]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (69.221, 25.83232463070049, 1.730525)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (69.481, 25.735659291917486, 1.737025)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xc1c0, 0xdf80, 0xfd40, 0x1b00, 0x38c0, 0x7440]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (69.053, 25.895172451040775, 1.726325)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (69.34, 25.787991682459168, 1.7335)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xb600, 0xd3c0, 0xf180, 0x0f40, 0x2d00, 0x6880]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (70.251, 25.453578500828726, 1.756275)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (68.335, 26.16725460249826, 1.708375)

(N_points, mem_kb) = (100000, 4687.5)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5040, 0x9040, 0xd040, 0x1040, 0x5040, 0xd040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (178.227, 25.082329603002332, 1.78227)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (178.281, 25.074732350358687, 1.78281)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4240, 0x0d00, 0xd7c0, 0xa280, 0x6d40, 0x02c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (176.16, 25.376636910503503, 1.7616)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (177.582, 25.17343175633959, 1.77582)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x2180, 0xec40, 0xb700, 0x81c0, 0x4c80, 0xe200]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (177.554, 25.1774015688427, 1.77554)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (177.604, 25.17031349606032, 1.77604)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x00c0, 0xcb80, 0x9640, 0x6100, 0x2bc0, 0xc140]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (177.447, 25.19258346522791, 1.77447)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (175.629, 25.453361108668254, 1.75629)

(N_points, mem_kb) = (400000, 18750.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x7040, 0x5040, 0x3040, 0x1040, 0xf040, 0xb040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (740.867, 24.13576719251524, 1.8521675)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (741.322, 24.12095342188305, 1.853305)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd800, 0x03c0, 0x2f80, 0x5b40, 0x8700, 0xde80]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (730.895, 24.465064657190414, 1.8272375)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (731.379, 24.448874567928787, 1.8284475)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4e40, 0x7a00, 0xa5c0, 0xd180, 0xfd40, 0x54c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (727.084, 24.593297930661638, 1.81771)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (728.166, 24.556754136580377, 1.820415)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xc480, 0xf040, 0x1c00, 0x47c0, 0x7380, 0xcb00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (742.281, 24.089790029136118, 1.8557025)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (737.916, 24.232288543163705, 1.84479)

(N_points, mem_kb) = (1000000, 46875.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x98c0, 0x7040, 0x9040, 0xb040, 0xd040, 0x1040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2558.256, 17.47420257454413, 2.558256)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2576.688, 17.34920315596726, 2.576688)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x3d00, 0x2ac0, 0x1880, 0x0640, 0xf400, 0xcf80]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2522.034, 17.725170866666733, 2.522034)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2522.888, 17.71917087938227, 2.522888)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xe140, 0xcf00, 0xbcc0, 0xaa80, 0x9840, 0x73c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2500.808, 17.875616033515154, 2.500808)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2511.586, 17.798906181808217, 2.511586)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x8580, 0x7340, 0x6100, 0x4ec0, 0x3c80, 0x1800]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2503.022, 17.859804500936455, 2.503022)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (2513.545, 17.785034117767125, 2.513545)

(N_points, mem_kb) = (4000000, 187500.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa040, 0xf040, 0x4040, 0x9040, 0xe040, 0x8040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (12994.315, 13.76093578816366, 3.24857875)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (13010.586, 13.743726402959242, 3.2526465)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd9c0, 0x9180, 0x4940, 0x0100, 0xb8c0, 0x2840]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (12875.05, 13.888406982976523, 3.2187625)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (12881.221, 13.88175347089937, 3.22030525)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x6400, 0x1bc0, 0xd380, 0x8b40, 0x4300, 0xb280]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (12963.264, 13.793897457166025, 3.240816)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (12960.208, 13.797150040043483, 3.240052)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa600, 0x5dc0, 0x1580, 0xcd40, 0x8500, 0xf480]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (12926.117, 13.833538279606465, 3.23152925)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (12934.029, 13.82507603208342, 3.23350725)

(N_points, mem_kb) = (10000000, 468750.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xe040, 0xa040, 0x6040, 0x2040, 0xe040, 0x6040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (33977.639, 13.156736282218716, 3.3977639)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (33996.742, 13.149343422832391, 3.3996742)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xe040, 0xa040, 0x6040, 0x2040, 0xe040, 0x6040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (33799.499, 13.226078759789596, 3.3799499)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (33802.126, 13.225050868558673, 3.3802126)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x6040, 0x2040, 0xe040, 0xa040, 0x6040, 0xe040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (33983.385, 13.154511706689304, 3.3983385)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (33976.375, 13.157225743341652, 3.3976375)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xe040, 0xa040, 0x6040, 0x2040, 0xe040, 0xa040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (33786.237, 13.231270348794087, 3.3786237)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (33809.631, 13.222115195975658, 3.3809631)

(N_points, mem_kb) = (40000000, 1.875e6)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0040, 0xe040, 0xc040, 0xa040, 0x8040, 0x4040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (138239.607, 12.935072531432462, 3.455990175)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (138315.92, 12.927935867843114, 3.457898)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0040, 0xe040, 0xc040, 0xa040, 0x8040, 0x4040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (138544.52, 12.906604629773295, 3.463613)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (138581.4, 12.903169857294838, 3.464535)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x8040, 0x6040, 0x6040, 0x2040, 0x4040, 0x6040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (139461.388, 12.821752091422743, 3.4865347)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (139498.011, 12.818385942877127, 3.487450275)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4040, 0x2040, 0x2040, 0x2040, 0x0040, 0xc040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (136778.032, 13.07329340183604, 3.4194508)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (136776.169, 13.07347147047026, 3.419404225)


julia> a=rand(1000*1000*100); b=copy(a); t=@belapsed a.+=b; mem_gbps = length(a)*2*8  / (t * (1<<30))
12.531947410633537

there is a significant (almost 20%) and stable (under repetitions of the benchmark) dependence on where the allocator chooses to place your arrays.

Indeed, the same is observed on the xeon.

ederag · May 4, 2021, 11:45pm

[ message part 2 (due to discourse size limit) ]

The Ryzen seems less sensitive to L2 cache filling, which is very interesting,
and explains the discrepancy between both machines.

Machine A (AMD Ryzen)

julia> using BenchmarkTools

julia> lowbits(a)=reinterpret(UInt64,pointer(a))%UInt16;

julia> function time_evolution(ddu, du, u, sigma, right_C, left_C, gamma, N_points)
                      @inbounds for s in 2:N_points-1
                              ddu[s] = (
                                      left_C[s] * (u[s-1] - u[s]) 
                                      + right_C[s] * (u[s+1] - u[s])
                                      + sigma[s]
                                      - 2*gamma * du[s]
                              )
                      end
              end
time_evolution (generic function with 1 method)

julia> for N in [1 , 4, 10, 40, 100, 400, 1000, 4_000, 10_000, 40_000, 100_000, 400_000]
                  N_points = N*100
                  mem_kb = N_points * 6 * 8 / 1024
                  @show N_points, mem_kb
                  for i=1:4
                      u0 = randn(N_points)
                      left_C = randn(N_points).+500
                      right_C = randn(N_points).+500
                      sigma = randn(N_points)
                      gamma = 1e-5
                      u0 = randn(N_points)
                      du0 = randn(N_points)
                      ddu = zeros(Float64, N_points)
                      @show lowbits.([ddu, du0, u0, sigma, right_C, left_C])
                      t_nanos = 1e9 * @belapsed time_evolution($ddu, $du0, $u0, $sigma, $right_C, $left_C, $gamma, $N_points)
                      mem_gbps = (1e9 / (1<<30))* N_points * 8 * 6 / t_nanos
                      @show t_nanos/1000, mem_gbps, t_nanos/N_points
                      t_nanos = 1e9 * @belapsed time_evolution($ddu, $du0, $u0, $sigma, $right_C, $left_C, $gamma, $N_points)
                      mem_gbps = (1e9 / (1<<30))* N_points * 8 * 6 / t_nanos
                      @show t_nanos/1000, mem_gbps, t_nanos/N_points

                  end
                  println()
              end
(N_points, mem_kb) = (100, 4.6875)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x4050, 0x3bd0, 0x3850, 0x34d0, 0x3150, 0x2a50]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03353776435045317, 133.2929741959348, 0.33537764350453175)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.0334773413897281, 133.53355351945422, 0.33477341389728094)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1fd0, 0x1c50, 0x18d0, 0x1550, 0x11d0, 0x0ad0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03373011077542799, 132.532869160065, 0.33730110775427996)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03369989929506546, 132.65168299208753, 0.33699899295065455)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0ad0, 0x0750, 0x03d0, 0x0050, 0x3bd0, 0x34d0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03353877139979859, 133.28897188467502, 0.3353877139979859)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03390130916414904, 131.86359076898816, 0.3390130916414904)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x34d0, 0x3150, 0x2dd0, 0x2a50, 0x1550, 0x0e50]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.03383081570996979, 132.1383556482472, 0.3383081570996979)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.033639476334340376, 132.88995089352224, 0.3363947633434038)

(N_points, mem_kb) = (400, 18.75)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0f80, 0x8380, 0xa3c0, 0xea80, 0x3e00, 0x7d80]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.1277793952967525, 139.9395684342516, 0.31944848824188127)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.12844754464285713, 139.21164069220342, 0.32111886160714287)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x2100, 0x7240, 0xf440, 0x2e00, 0xbb00, 0x2c40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.12412555555555554, 144.05891963652815, 0.3103138888888889)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.12355345211581294, 144.72597184784485, 0.3088836302895323)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x3cc0, 0xa100, 0xe000, 0x0480, 0x4300, 0x4740]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.12389977728285077, 144.32143321610465, 0.30974944320712694)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.124265625, 143.89653963127122, 0.3106640625)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x2a40, 0x3440, 0x8dc0, 0xf900, 0xde40, 0xc600]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.1238963963963964, 144.32537146122579, 0.309740990990991)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.12324164810690424, 145.0921316559011, 0.30810412026726064)

(N_points, mem_kb) = (1000, 46.875)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x9d00, 0x7d00, 0x5d00, 0x1c80, 0xfc80, 0xf400]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.3585971563981043, 124.66212512827191, 0.3585971563981043)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.35878095238095237, 124.59826332719292, 0.35878095238095237)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd500, 0xb500, 0x2600, 0x0600, 0x3a40, 0x4800]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.3535990566037736, 126.42421620382201, 0.3535990566037736)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.35364622641509436, 126.40735357111372, 0.35364622641509436)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xb7c0, 0x97c0, 0x77c0, 0xef80, 0xcb40, 0x9600]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.3648269230769231, 122.53340078225894, 0.3648269230769231)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.360875, 123.87525758654095, 0.360875)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1300, 0xf300, 0x87c0, 0x67c0, 0x47c0, 0xf3c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.35776388888888894, 124.95247555693518, 0.35776388888888894)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (0.3558909952606635, 125.61004402148758, 0.3558909952606635)

(N_points, mem_kb) = (4000, 187.5)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x2cc0, 0x7f40, 0x2c40, 0xe5c0, 0x3c40, 0x3d40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.4387, 124.28854822143036, 0.359675)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.4558, 122.82864014711627, 0.36395)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd580, 0x5840, 0x9180, 0x3bc0, 0x3fc0, 0x3f40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.4127, 126.57601353873567, 0.353175)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.4136, 126.49542609378317, 0.3534)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x8bc0, 0x0e80, 0x3cc0, 0x4140, 0x4040, 0x3dc0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.3745999999999998, 130.08434040897126, 0.34364999999999996)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.3756, 129.98977488090426, 0.3439)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xee80, 0x4780, 0xca40, 0x3700, 0xb9c0, 0x40c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.3725999999999998, 130.27388483620274, 0.34314999999999996)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1.3716, 130.36886433812472, 0.3429)

(N_points, mem_kb) = (10000, 468.75)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xb580, 0x7cc0, 0x4400, 0x0b40, 0xd280, 0x6100]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (3.68325, 121.36966967092368, 0.368325)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (3.6845, 121.32849391109504, 0.36845)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x7d00, 0x4440, 0x0b80, 0xd2c0, 0x9a00, 0x2880]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (3.6995, 120.83655516027292, 0.36995)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (3.727125, 119.94092921901725, 0.3727125)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd300, 0x9a40, 0x6180, 0x28c0, 0xf000, 0xee40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (3.645625, 122.62227623944582, 0.3645625)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (3.63075, 123.12465353313495, 0.363075)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xf040, 0xb780, 0x7ec0, 0x4600, 0x0d40, 0xef40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (3.6445, 122.660127813261, 0.36445)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (3.650625, 122.45432927661146, 0.3650625)

(N_points, mem_kb) = (40000, 1875.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x3c80, 0x5a40, 0x7800, 0x95c0, 0xb380, 0xef00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16.691, 107.13194795169365, 0.417275)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16.642, 107.44738272213188, 0.41605)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x3140, 0x4f00, 0x6cc0, 0x8a80, 0xa840, 0xe3c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (17.212, 103.88910895083191, 0.4303)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (17.202, 103.94950257305655, 0.43005)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x2580, 0x4340, 0x6100, 0x7ec0, 0x9c80, 0xd800]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (17.243, 103.70233388979405, 0.431075)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16.912, 105.7319857652388, 0.4228)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x19c0, 0x3780, 0x5540, 0x7300, 0x90c0, 0xcc40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16.832, 106.23451421469336, 0.4208)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16.712, 106.99732786391328, 0.4178)

(N_points, mem_kb) = (100000, 4687.5)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa600, 0x70c0, 0x3b80, 0x0640, 0xd100, 0x6680]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (42.611, 104.91066527784602, 0.42611)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (43.343, 103.13887728478178, 0.43343)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x8540, 0x5000, 0x1ac0, 0xe580, 0xb040, 0x45c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (43.462, 102.85648056127874, 0.43462)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (43.352, 103.1174653569454, 0.43352)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x6480, 0x2f40, 0xfa00, 0xc4c0, 0x8f80, 0x2500]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (43.072, 103.7878054920667, 0.43072)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (43.021, 103.91084256884537, 0.43021)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x43c0, 0x0e80, 0xd940, 0xa400, 0x6ec0, 0x0440]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (42.561, 105.0339126936467, 0.42561)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (43.302, 103.23653314291019, 0.43302)

(N_points, mem_kb) = (400000, 18750.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xe040, 0xc040, 0xa040, 0x8040, 0x6040, 0x2040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (388.088, 46.07561540840528, 0.97022)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (382.698, 46.72455417226426, 0.956745)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1b00, 0x46c0, 0x7280, 0x9e40, 0xca00, 0x2180]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (386.454, 46.27043175285335, 0.966135)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (389.891, 45.862544743574965, 0.9747275)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x9140, 0xbd00, 0xe8c0, 0x1480, 0x4040, 0x97c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (360.025, 49.66708820947764, 0.9000625)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (359.684, 49.7141753111542, 0.89921)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x0780, 0x3340, 0x5f00, 0x8ac0, 0xb680, 0x0e00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (354.834, 50.39368671721759, 0.887085)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (356.989, 50.089480159380784, 0.8924725)

(N_points, mem_kb) = (1000000, 46875.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xdbc0, 0xe040, 0x0040, 0x2040, 0x4040, 0x8040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1571.378, 28.448586897323857, 1.571378)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1569.705, 28.478907553675988, 1.569705)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x8000, 0x6dc0, 0x5b80, 0x4940, 0x3700, 0x1280]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1524.71, 29.31933520573943, 1.52471)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1545.269, 28.929256706465328, 1.545269)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x2440, 0x1200, 0xffc0, 0xed80, 0xdb40, 0xb6c0]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1551.17, 28.8192032991503, 1.55117)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1551.66, 28.81010245900711, 1.55166)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xc880, 0xb640, 0xa400, 0x91c0, 0x7f80, 0x5b00]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1550.629, 28.829258050470468, 1.550629)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (1538.917, 29.048664470886322, 1.538917)

(N_points, mem_kb) = (4000000, 187500.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1040, 0x6040, 0xb040, 0x0040, 0x5040, 0xf040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6284.297, 28.45408712003457, 1.57107425)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6388.277, 27.99094878418263, 1.59706925)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x1cc0, 0xd480, 0x8c40, 0x4400, 0xfbc0, 0x6b40]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6247.069, 28.623652840423546, 1.56176725)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6245.536, 28.63067866811942, 1.561384)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xa700, 0x5ec0, 0x1680, 0xce40, 0x8600, 0xf580]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6258.099, 28.57320319256245, 1.56452475)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6254.132, 28.591327193953035, 1.563533)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xe900, 0xa0c0, 0x5880, 0x1040, 0xc800, 0x3780]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6167.197, 28.994360700034697, 1.54179925)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (6186.633, 28.903271670741077, 1.54665825)

(N_points, mem_kb) = (10000000, 468750.0)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5040, 0x1040, 0xd040, 0x9040, 0x5040, 0xd040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16120.955, 27.730046750668908, 1.6120955)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16131.303, 27.71225832255644, 1.6131303)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5040, 0x1040, 0xd040, 0x9040, 0x5040, 0xd040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16100.224, 27.765752564400948, 1.6100224)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16102.849999999999, 27.76122461647657, 1.6102849999999997)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xd040, 0x9040, 0x5040, 0x1040, 0xd040, 0x5040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16164.086000000001, 27.656054033332268, 1.6164086000000002)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (15997.331, 27.94433870346433, 1.5997331)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x5040, 0x1040, 0xd040, 0x9040, 0x5040, 0x1040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16254.438999999998, 27.502323261690528, 1.6254438999999998)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (16229.961000000001, 27.54380221957586, 1.6229961000000002)

(N_points, mem_kb) = (40000000, 1.875e6)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x7040, 0x5040, 0x3040, 0x1040, 0xf040, 0xb040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (64217.66099999999, 27.84497777428734, 1.6054415249999998)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (64295.337999999996, 27.811337476159142, 1.6073834499999997)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0x7040, 0x5040, 0x3040, 0x1040, 0xf040, 0xb040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (63684.296, 28.07818340744033, 1.5921074)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (64843.39, 27.576277909926034, 1.62108475)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xf040, 0xd040, 0xd040, 0x9040, 0xb040, 0xd040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (64967.03600000001, 27.52379442494065, 1.6241759000000002)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (64889.238, 27.556793674502984, 1.62223095)
lowbits.([ddu, du0, u0, sigma, right_C, left_C]) = UInt16[0xb040, 0x9040, 0x9040, 0x9040, 0x7040, 0x3040]
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (64075.481, 27.906764262319292, 1.601887025)
(t_nanos / 1000, mem_gbps, t_nanos / N_points) = (64158.76900000001, 27.870536968402845, 1.6039692250000002)


julia> a=rand(1000*1000*100); b=copy(a); t=@belapsed a.+=b; mem_gbps = length(a)*2*8  / (t * (1<<30))
19.768651298981965

So thanks a lot for solving this mystery !

johnh · May 5, 2021, 2:43am

Just a comment regarding caches and memory access. I configure AMD Rome and soon Milan systems. There is a setting called Numa Per Socket
It might be interesting to see how Julia benchmarks are changed by this setting

The four logical quadrants in a Rome processor allow the CPU to be partitioned into different NUMA domains. This setting is called NUMA per socket or NPS .

https://www.dell.com/support/kbdoc/en-uk/000137696/amd-rome-is-it-for-real-architecture-and-initial-hpc-performance

johnh · May 5, 2021, 2:45am

To get a good overview of the topology of caches and IO devices the utility lstopo is very useful. On a Linux ssytem just use your package manager to install it.

https://www.open-mpi.org/projects/hwloc/lstopo/

ederag · May 5, 2021, 10:00am

Here is the output of lstopo-no-graphics:

machine A (12 cores AMD Ryzen 9 3900X)

Machine (63GB total)
  Package L#0
    NUMANode L#0 (P#0 63GB)
    L3 L#0 (16MB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L3 L#1 (16MB)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
    L3 L#2 (16MB)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
    L3 L#3 (16MB)
      L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)

Machine B (2 x 12 cores Intel Xeon Gold 6146)

Machine (126GB total)
  Package L#0
    NUMANode L#0 (P#0 63GB)
    L3 L#0 (25MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
    HostBridge
      PCI 00:16.2 (IDE)
      PCI 00:17.0 (RAID)
        Block(Disk) "sdd"
        Block(Disk) "sdb"
        Block(Disk) "sdc"
        Block(Disk) "sda"
      PCI 00:1f.6 (Ethernet)
        Net "eth0"
    HostBridge
      PCIBridge
        PCI 9e:00.0 (VGA)
  Package L#1
    NUMANode L#1 (P#1 63GB)
    L3 L#1 (25MB)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)

And the graphical lstopo is even clearer:
A:
Screenshot_20210505_121038

B:

Thanks for the tip !

ederag · May 16, 2021, 10:41pm

Timings (lower better) plotted below, with the same scales.

Machine A ( AMD Ryzen 9 3900X):

Machine B (bi-Xeon Gold 6146):

Bottom axis is the number of points, top axis is the memory used by the 6 arrays of N points.
The blue vertical lines indicate the L1, L2, L3 cache sizes.

Bottom line: the Ryzen has much faster L3 caches than the xeon.

Topic		Replies	Views
Could this kernel go faster? Performance	3	816	January 4, 2022
Julia/JuMP efficiency comparison between different CPUs Performance jump	5	118	September 2, 2024
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	218	June 4, 2025
Simulating quantum Hamiltonian evolution with Yao Quantum question , package	8	351	August 13, 2024
Relative performance discrepancy across cpus even with --cpu-target set Performance simd , benchmark	5	1015	August 14, 2019

Same code much faster on a Ryzen than on a Xeon?

Machines comparison

Generated code

What next ?

Related topics