The time_evolution
ODE kernel below is blazing fast on a Ryzen, and more than 4 times slower on a Xeon.
This difference is striking, as usually both machines have similar single thread performances.
(more on that below)
Iโm looking for clues to unravel this mystery.
The kernel comes from a method of lines / finite difference scheme, with spatially varying parameters.
Here is the stripped down version used for the benchmarks below:
using BenchmarkTools
const N_points = 40_000
const left_C = 500 .+ randn(N_points);
const right_C = 500 .+ randn(N_points);
const sigma = rand(N_points);
const ฮณ = 1e-5
const u0 = randn(N_points);
const du0 = randn(N_points);
const ddu = zeros(Float64, N_points);
function time_evolution(ddu, du, u, ฮณ, t)
@inbounds for s in 2:N_points-1
ddu[s] = (
left_C[s] * (u[s-1] - u[s])
+ right_C[s] * (u[s+1] - u[s])
+ sigma[s]
- 2ฮณ * du[s]
)
end
end
Machine A (AMD ryzen):
@benchmark time_evolution(ddu, du0, u0, ฮณ, 0.0)
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 16.201 ฮผs (0.00% GC)
median time: 16.982 ฮผs (0.00% GC)
mean time: 17.056 ฮผs (0.00% GC)
maximum time: 156.137 ฮผs (0.00% GC)
Machine B (Intel xeon):
@benchmark time_evolution(ddu, du0, u0, ฮณ, 0.0)
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 69.923 ฮผs (0.00% GC)
median time: 70.935 ฮผs (0.00% GC)
mean time: 71.088 ฮผs (0.00% GC)
maximum time: 149.609 ฮผs (0.00% GC)
16 ยตs for the AMD ryzen, 70 ยตs for the Intel xeon, so the xeon is about 4.3 times slower.
This is reproducible, and both machines are idle otherwise.
Results are similar with @avx from LoopVectorization.jl.
Machines comparison
julia-1.6.1
Machine A: single AMD Ryzen 9 3900X
, 64 GB RAM
Machine B: bi-Xeon Gold 6146, 128 GB ECC RAM
versioninfo() details
versioninfo()
Julia Version 1.6.1
Commit 6aaedecc447 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-suse-linux)
CPU: AMD Ryzen 9 3900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
Machine B:
Julia Version 1.6.1
Commit 6aaedecc447 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-suse-linux)
CPU: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)
Note: the issue was first noticed while under julia 1.5.3, it is not a 1.6 regression.
Single thread memory access is slower on xeons,
but that canโt explain the 4.3 factor slowdown.
machine A (AMD Ryzen):
sysbench memory run --memory-block-size=64 | grep MiB/sec
(567.96 MiB/sec)
sysbench memory run --memory-block-size=1M | grep MiB/sec
(27673.99 MiB/sec)
Machine B (Intel Xeon):
sysbench memory run --memory-block-size=64 | grep MiB/sec
(550.30 MiB/sec)
sysbench memory run --memory-block-size=1M | grep MiB/sec
(19341.05 MiB/sec)
likwid-bench -s 10 -t stream -w S0:100kB:1
Results
Machine A (AMD Ryzen):
LIKWID MICRO BENCHMARK
Test: stream
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 4164 Offset 0
--------------------------------------------------------------------------------
Cycles: 45549095124
CPU Clock: 3792770502
Cycle Clock: 3792770502
Time: 1.200945e+01 sec
Iterations: 8388608
Iterations per thread: 8388608
Inner loop executions: 1041
Size (Byte): 99936
Size per thread: 99936
Number of Flops: 69860327424
MFlops/s: 5817.11
Data volume (Byte): 838323929088
MByte/s: 69805.34
Cycles per update: 1.304005
Cycles per cacheline: 10.432037
Loads per update: 2
Stores per update: 1
Load bytes per element: 16
Store bytes per elem.: 8
Load/store ratio: 2.00
Instructions: 165918277649
UOPs: 227046064128
Machine B (intel Xeon):
LIKWID MICRO BENCHMARK
Test: stream
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 4164 Offset 0
--------------------------------------------------------------------------------
Cycles: 54668218488
CPU Clock: 3192441309
Cycle Clock: 3192441309
Time: 1.712427e+01 sec
Iterations: 8388608
Iterations per thread: 8388608
Inner loop executions: 1041
Size (Byte): 99936
Size per thread: 99936
Number of Flops: 69860327424
MFlops/s: 4079.61
Data volume (Byte): 838323929088
MByte/s: 48955.32
Cycles per update: 1.565072
Cycles per cacheline: 12.520575
Loads per update: 2
Stores per update: 1
Load bytes per element: 16
Store bytes per elem.: 8
Load/store ratio: 2.00
Instructions: 165918277649
UOPs: 227046064128
General benchmarks, from hardinfo --generate-report
Results
Machine A (AMD Ryzen):
CPU Blowfish: 0.309 # seconds, lower better, multiple cores
CPU CryptoHash: 2320.186 # MiB/s, higher better, multiple cores
CPU Fibonacci: 0.925 # seconds, lower better, single core
CPU N-Queens: 0.330 # seconds, lower better, single core
FPU FFT: 0.673 # seconds, lower is better, single core
FPU Raytracing: 11.102 # John Walker's FBENCH, seconds, lower is better multiple cores
Machine B (Intel Xeon):
CPU Blowfish: 0.359
CPU CryptoHash: 3070.564
CPU Fibonacci: 0.968
CPU N-Queens: 0.395
FPU FFT: 0.563
FPU Raytracing: 8.367
Generated code
On both machines, the following commands
give strictly the same result (checked with diff):
@code_warntype time_evolution(ddu, du0, u0, ฮณ, 0.0)
result
Variables
#self#::Core.Const(time_evolution)
ddu::Vector{Float64}
du::Vector{Float64}
u::Vector{Float64}
ฮณ::Float64
t::Float64
@_7::UNION{NOTHING, TUPLE{INT64, INT64}}
val::Nothing
s::Int64
Body::Nothing
1 โ Core.NewvarNode(:(val))
โ $(Expr(:inbounds, true))
โ %3 = (Main.N_points - 1)::Core.Const(39999)
โ %4 = (2:%3)::Core.Const(2:39999)
โ (@_7 = Base.iterate(%4))
โ %6 = (@_7::Core.Const((2, 2)) === nothing)::Core.Const(false)
โ %7 = Base.not_int(%6)::Core.Const(true)
โโโ goto #4 if not %7
2 โ %9 = @_7::Tuple{Int64, Int64}::Tuple{Int64, Int64}
โ (s = Core.getfield(%9, 1))
โ %11 = Core.getfield(%9, 2)::Int64
โ %12 = Base.getindex(Main.left_C, s)::Float64
โ %13 = (s - 1)::Int64
โ %14 = Base.getindex(u, %13)::Float64
โ %15 = Base.getindex(u, s)::Float64
โ %16 = (%14 - %15)::Float64
โ %17 = (%12 * %16)::Float64
โ %18 = Base.getindex(Main.right_C, s)::Float64
โ %19 = (s + 1)::Int64
โ %20 = Base.getindex(u, %19)::Float64
โ %21 = Base.getindex(u, s)::Float64
โ %22 = (%20 - %21)::Float64
โ %23 = (%18 * %22)::Float64
โ %24 = Base.getindex(Main.sigma, s)::Float64
โ %25 = (%17 + %23 + %24)::Float64
โ %26 = (2 * ฮณ)::Float64
โ %27 = Base.getindex(du, s)::Float64
โ %28 = (%26 * %27)::Float64
โ %29 = (%25 - %28)::Float64
โ Base.setindex!(ddu, %29, s)
โ (@_7 = Base.iterate(%4, %11))
โ %32 = (@_7 === nothing)::Bool
โ %33 = Base.not_int(%32)::Bool
โโโ goto #4 if not %33
3 โ goto #2
4 โ (val = nothing)
โ $(Expr(:inbounds, :pop))
โโโ return val
@code_lowered time_evolution(ddu, du0, u0, ฮณ, 0.0)
Result
CodeInfo(
1 โ Core.NewvarNode(:(val))
โ $(Expr(:inbounds, true))
โ %3 = Main.N_points - 1
โ %4 = 2:%3
โ @_7 = Base.iterate(%4)
โ %6 = @_7 === nothing
โ %7 = Base.not_int(%6)
โโโ goto #4 if not %7
2 โ %9 = @_7
โ s = Core.getfield(%9, 1)
โ %11 = Core.getfield(%9, 2)
โ %12 = Base.getindex(Main.left_C, s)
โ %13 = s - 1
โ %14 = Base.getindex(u, %13)
โ %15 = Base.getindex(u, s)
โ %16 = %14 - %15
โ %17 = %12 * %16
โ %18 = Base.getindex(Main.right_C, s)
โ %19 = s + 1
โ %20 = Base.getindex(u, %19)
โ %21 = Base.getindex(u, s)
โ %22 = %20 - %21
โ %23 = %18 * %22
โ %24 = Base.getindex(Main.sigma, s)
โ %25 = %17 + %23 + %24
โ %26 = 2 * ฮณ
โ %27 = Base.getindex(du, s)
โ %28 = %26 * %27
โ %29 = %25 - %28
โ Base.setindex!(ddu, %29, s)
โ @_7 = Base.iterate(%4, %11)
โ %32 = @_7 === nothing
โ %33 = Base.not_int(%32)
โโโ goto #4 if not %33
3 โ goto #2
4 โ val = nothing
โ $(Expr(:inbounds, :pop))
โโโ return val
)
This one gives almost the same result:
@code_llvm time_evolution(ddu, du0, u0, ฮณ, 0.0)
Result on machine A
; @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:56 within `time_evolution'
define void @julia_time_evolution_1094({}* nonnull align 16 dereferenceable(40) %0, {}* nonnull align 16 dereferenceable(40) %1, {}* nonnull align 16 dereferenceable(40) %2, double %3, double %4) {
top:
; @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:60 within `time_evolution'
; โ @ array.jl within `getindex'
%5 = load double*, double** inttoptr (i64 139749451363712 to double**), align 8
%6 = bitcast {}* %2 to double**
%7 = load double*, double** %6, align 8
%8 = load double*, double** inttoptr (i64 139749451364272 to double**), align 8
%9 = load double*, double** inttoptr (i64 139749435769008 to double**), align 8
; โ
; โ @ promotion.jl:322 within `*' @ float.jl:0
%10 = fmul double %3, 2.000000e+00
; โ
; โ @ array.jl within `getindex'
%11 = bitcast {}* %1 to double**
%12 = load double*, double** %11, align 8
; โ
; โ @ array.jl within `setindex!'
%13 = bitcast {}* %0 to double**
%14 = load double*, double** %13, align 8
; โ
; @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:59 within `time_evolution'
%scevgep = getelementptr double, double* %14, i64 1
%scevgep9 = getelementptr double, double* %14, i64 39999
%scevgep11 = getelementptr double, double* %5, i64 1
%scevgep13 = getelementptr double, double* %5, i64 39999
%scevgep15 = getelementptr double, double* %7, i64 40000
%scevgep17 = getelementptr double, double* %8, i64 1
%scevgep19 = getelementptr double, double* %8, i64 39999
%scevgep21 = getelementptr double, double* %9, i64 1
%scevgep23 = getelementptr double, double* %9, i64 39999
%scevgep25 = getelementptr double, double* %12, i64 1
%scevgep27 = getelementptr double, double* %12, i64 39999
%bound0 = icmp ult double* %scevgep, %scevgep13
%bound1 = icmp ult double* %scevgep11, %scevgep9
%found.conflict = and i1 %bound0, %bound1
%bound029 = icmp ult double* %scevgep, %scevgep15
%bound130 = icmp ult double* %7, %scevgep9
%found.conflict31 = and i1 %bound029, %bound130
%conflict.rdx = or i1 %found.conflict, %found.conflict31
%bound032 = icmp ult double* %scevgep, %scevgep19
%bound133 = icmp ult double* %scevgep17, %scevgep9
%found.conflict34 = and i1 %bound032, %bound133
%conflict.rdx35 = or i1 %conflict.rdx, %found.conflict34
%bound036 = icmp ult double* %scevgep, %scevgep23
%bound137 = icmp ult double* %scevgep21, %scevgep9
%found.conflict38 = and i1 %bound036, %bound137
%conflict.rdx39 = or i1 %conflict.rdx35, %found.conflict38
%bound040 = icmp ult double* %scevgep, %scevgep27
%bound141 = icmp ult double* %scevgep25, %scevgep9
%found.conflict42 = and i1 %bound040, %bound141
%conflict.rdx43 = or i1 %conflict.rdx39, %found.conflict42
br i1 %conflict.rdx43, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %top
%broadcast.splatinsert = insertelement <4 x double> undef, double %10, i32 0
%broadcast.splat = shufflevector <4 x double> %broadcast.splatinsert, <4 x double> undef, <4 x i32> zeroinitializer
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%offset.idx = or i64 %index, 2
; @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:60 within `time_evolution'
; โ @ array.jl:801 within `getindex'
%15 = add nsw i64 %offset.idx, -1
%16 = getelementptr inbounds double, double* %5, i64 %15
%17 = bitcast double* %16 to <4 x double>*
%wide.load = load <4 x double>, <4 x double>* %17, align 8
%18 = getelementptr inbounds double, double* %7, i64 %index
%19 = bitcast double* %18 to <4 x double>*
%wide.load44 = load <4 x double>, <4 x double>* %19, align 8
%20 = getelementptr inbounds double, double* %7, i64 %15
%21 = bitcast double* %20 to <4 x double>*
%wide.load45 = load <4 x double>, <4 x double>* %21, align 8
; โ
; โ @ float.jl:329 within `-'
%22 = fsub <4 x double> %wide.load44, %wide.load45
; โ
; โ @ float.jl:332 within `*'
%23 = fmul <4 x double> %wide.load, %22
; โ
; โ @ array.jl:801 within `getindex'
%24 = getelementptr inbounds double, double* %8, i64 %15
%25 = bitcast double* %24 to <4 x double>*
%wide.load46 = load <4 x double>, <4 x double>* %25, align 8
%26 = getelementptr inbounds double, double* %7, i64 %offset.idx
%27 = bitcast double* %26 to <4 x double>*
%wide.load47 = load <4 x double>, <4 x double>* %27, align 8
; โ
; โ @ float.jl:329 within `-'
%28 = fsub <4 x double> %wide.load47, %wide.load45
; โ
; โ @ float.jl:332 within `*'
%29 = fmul <4 x double> %wide.load46, %28
; โ
; โ @ array.jl:801 within `getindex'
%30 = getelementptr inbounds double, double* %9, i64 %15
%31 = bitcast double* %30 to <4 x double>*
%wide.load48 = load <4 x double>, <4 x double>* %31, align 8
; โ
; โ @ operators.jl:560 within `+' @ float.jl:326
%32 = fadd <4 x double> %23, %29
%33 = fadd <4 x double> %wide.load48, %32
; โ
; โ @ array.jl:801 within `getindex'
%34 = getelementptr inbounds double, double* %12, i64 %15
%35 = bitcast double* %34 to <4 x double>*
%wide.load49 = load <4 x double>, <4 x double>* %35, align 8
; โ
; โ @ float.jl:332 within `*'
%36 = fmul <4 x double> %broadcast.splat, %wide.load49
; โ
; โ @ float.jl:329 within `-'
%37 = fsub <4 x double> %33, %36
; โ
; โ @ array.jl:839 within `setindex!'
%38 = getelementptr inbounds double, double* %14, i64 %15
%39 = bitcast double* %38 to <4 x double>*
store <4 x double> %37, <4 x double>* %39, align 8
%index.next = add i64 %index, 4
%40 = icmp eq i64 %index.next, 39996
br i1 %40, label %scalar.ph, label %vector.body
scalar.ph: ; preds = %vector.body, %top
%bc.resume.val = phi i64 [ 2, %top ], [ 39998, %vector.body ]
; โ
; @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:59 within `time_evolution'
br label %L2
L2: ; preds = %L2, %scalar.ph
%value_phi = phi i64 [ %bc.resume.val, %scalar.ph ], [ %66, %L2 ]
; @ /home/ederag/share/coll/combe/oscpar/julia/oscpar/plutos/time_evolution_mwe.jl:60 within `time_evolution'
; โ @ array.jl:801 within `getindex'
%41 = add nsw i64 %value_phi, -1
%42 = getelementptr inbounds double, double* %5, i64 %41
%43 = load double, double* %42, align 8
%44 = add nsw i64 %value_phi, -2
%45 = getelementptr inbounds double, double* %7, i64 %44
%46 = load double, double* %45, align 8
%47 = getelementptr inbounds double, double* %7, i64 %41
%48 = load double, double* %47, align 8
; โ
; โ @ float.jl:329 within `-'
%49 = fsub double %46, %48
; โ
; โ @ float.jl:332 within `*'
%50 = fmul double %43, %49
; โ
; โ @ array.jl:801 within `getindex'
%51 = getelementptr inbounds double, double* %8, i64 %41
%52 = load double, double* %51, align 8
%53 = getelementptr inbounds double, double* %7, i64 %value_phi
%54 = load double, double* %53, align 8
; โ
; โ @ float.jl:329 within `-'
%55 = fsub double %54, %48
; โ
; โ @ float.jl:332 within `*'
%56 = fmul double %52, %55
; โ
; โ @ array.jl:801 within `getindex'
%57 = getelementptr inbounds double, double* %9, i64 %41
%58 = load double, double* %57, align 8
; โ
; โ @ operators.jl:560 within `+' @ float.jl:326
%59 = fadd double %50, %56
%60 = fadd double %58, %59
; โ
; โ @ array.jl:801 within `getindex'
%61 = getelementptr inbounds double, double* %12, i64 %41
%62 = load double, double* %61, align 8
; โ
; โ @ float.jl:332 within `*'
%63 = fmul double %10, %62
; โ
; โ @ float.jl:329 within `-'
%64 = fsub double %60, %63
; โ
; โ @ array.jl:839 within `setindex!'
%65 = getelementptr inbounds double, double* %14, i64 %41
store double %64, double* %65, align 8
; โ
; โ @ range.jl:674 within `iterate'
; โโ @ promotion.jl:410 within `=='
%.not.not = icmp eq i64 %value_phi, 39999
; โโ
%66 = add nuw nsw i64 %value_phi, 1
; โ
br i1 %.not.not, label %L39, label %L2
L39: ; preds = %L2
ret void
}
save from a few ids:
Differences between machine B and A
< define void @julia_time_evolution_1094({}* nonnull align 16 dereferenceable(40) %0, {}* nonnull align 16 dereferenceable(40) %1, {}* nonnull align 16 dereferenceable(40) %2, double %3, double %4) {
---
> define void @julia_time_evolution_1171({}* nonnull align 16 dereferenceable(40) %0, {}* nonnull align 16 dereferenceable(40) %1, {}* nonnull align 16 dereferenceable(40) %2, double %3, double %4) {
6c6
< %5 = load double*, double** inttoptr (i64 139749451363712 to double**), align 8
---
> %5 = load double*, double** inttoptr (i64 140648620321984 to double**), align 8
9,10c9,10
< %8 = load double*, double** inttoptr (i64 139749451364272 to double**), align 8
< %9 = load double*, double** inttoptr (i64 139749435769008 to double**), align 8
---
> %8 = load double*, double** inttoptr (i64 140648620322544 to double**), align 8
> %9 = load double*, double** inttoptr (i64 140648577288320 to double**), align 8
The native code significantly differs
@code_native time_evolution(ddu, du0, u0, ฮณ, 0.0)
Result on machine A (AMD Ryzen)
.text
; โ @ time_evolution_mwe.jl:56 within `time_evolution'
pushq %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
movabsq $139749451363712, %rax # imm = 0x7F19F467F580
; โ @ time_evolution_mwe.jl:60 within `time_evolution'
; โโ @ array.jl within `setindex!'
movq (%rdi), %rdi
movq (%rdx), %rcx
movq (%rsi), %rsi
vaddsd %xmm0, %xmm0, %xmm0
movq (%rax), %r8
movabsq $139749451364272, %rax # imm = 0x7F19F467F7B0
movq (%rax), %r9
movabsq $139749435769008, %rax # imm = 0x7F19F37A00B0
movq (%rax), %r10
; โโ
; โ @ time_evolution_mwe.jl:59 within `time_evolution'
leaq 8(%rdi), %rdx
leaq 319992(%rdi), %rbp
leaq 320000(%rcx), %r11
leaq 319992(%r8), %rbx
leaq 8(%r8), %rax
cmpq %rbx, %rdx
leaq 319992(%r9), %r12
leaq 8(%r9), %r15
setb -1(%rsp)
cmpq %rbp, %rax
leaq 319992(%r10), %rbx
leaq 8(%r10), %r13
setb %r14b
cmpq %r11, %rdx
setb %al
cmpq %rbp, %rcx
setb -2(%rsp)
cmpq %r12, %rdx
setb %r11b
cmpq %rbp, %r15
setb %r15b
cmpq %rbx, %rdx
leaq 319992(%rsi), %rbx
setb %r12b
cmpq %rbp, %r13
setb -3(%rsp)
cmpq %rbx, %rdx
leaq 8(%rsi), %rdx
setb %r13b
cmpq %rbp, %rdx
movl $2, %edx
setb %bpl
testb %r14b, -1(%rsp)
jne L323
andb -2(%rsp), %al
jne L323
andb %r15b, %r11b
jne L323
andb -3(%rsp), %r12b
jne L323
andb %bpl, %r13b
jne L323
vbroadcastsd %xmm0, %ymm1
xorl %eax, %eax
nop
; โ @ time_evolution_mwe.jl:60 within `time_evolution'
; โโ @ array.jl:801 within `getindex'
L240:
vmovupd (%rcx,%rax), %ymm2
vmovupd 8(%rcx,%rax), %ymm3
vmovupd 16(%rcx,%rax), %ymm4
; โโ
; โโ @ float.jl:329 within `-'
vsubpd %ymm3, %ymm2, %ymm2
vsubpd %ymm3, %ymm4, %ymm3
; โโ
; โโ @ float.jl:332 within `*'
vmulpd 8(%r8,%rax), %ymm2, %ymm2
vmulpd 8(%r9,%rax), %ymm3, %ymm3
vmulpd 8(%rsi,%rax), %ymm1, %ymm4
; โโ
; โโ @ operators.jl:560 within `+' @ float.jl:326
vaddpd %ymm3, %ymm2, %ymm2
vaddpd 8(%r10,%rax), %ymm2, %ymm2
; โโ
; โโ @ float.jl:329 within `-'
vsubpd %ymm4, %ymm2, %ymm2
; โโ
; โโ @ array.jl:839 within `setindex!'
vmovupd %ymm2, 8(%rdi,%rax)
addq $32, %rax
cmpq $319968, %rax # imm = 0x4E1E0
jne L240
; โโ
; โ @ array.jl within `time_evolution'
movl $39998, %edx # imm = 0x9C3E
; โ @ time_evolution_mwe.jl:59 within `time_evolution'
L323:
shlq $3, %rdx
addq $-8, %rdi
addq $-8, %rsi
addq $-8, %r10
addq $-8, %r9
addq $-8, %r8
movl $320000, %eax # imm = 0x4E200
; โ @ time_evolution_mwe.jl:60 within `time_evolution'
; โโ @ array.jl:801 within `getindex'
L352:
vmovsd -16(%rcx,%rdx), %xmm1 # xmm1 = mem[0],zero
vmovsd -8(%rcx,%rdx), %xmm2 # xmm2 = mem[0],zero
vmovsd (%rcx,%rdx), %xmm3 # xmm3 = mem[0],zero
; โโ
; โโ @ range.jl:674 within `iterate'
; โโโ @ promotion.jl:410 within `=='
addq $8, %rcx
addq $-8, %rax
; โโโ
; โโ @ float.jl:329 within `-'
vsubsd %xmm2, %xmm1, %xmm1
vsubsd %xmm2, %xmm3, %xmm2
; โโ
; โโ @ float.jl:332 within `*'
vmulsd (%r8,%rdx), %xmm1, %xmm1
vmulsd (%r9,%rdx), %xmm2, %xmm2
vmulsd (%rsi,%rdx), %xmm0, %xmm3
; โโ
; โโ @ range.jl:674 within `iterate'
; โโโ @ promotion.jl:410 within `=='
addq $8, %rsi
addq $8, %r9
addq $8, %r8
; โโโ
; โโ @ operators.jl:560 within `+' @ float.jl:326
vaddsd %xmm2, %xmm1, %xmm1
vaddsd (%r10,%rdx), %xmm1, %xmm1
; โโ
; โโ @ range.jl:674 within `iterate'
; โโโ @ promotion.jl:410 within `=='
addq $8, %r10
; โโโ
; โโ @ float.jl:329 within `-'
vsubsd %xmm3, %xmm1, %xmm1
; โโ
; โโ @ array.jl:839 within `setindex!'
vmovsd %xmm1, (%rdi,%rdx)
; โโ
; โโ @ range.jl:674 within `iterate'
; โโโ @ promotion.jl:410 within `=='
addq $8, %rdi
cmpq %rax, %rdx
; โโโ
jne L352
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
vzeroupper
retq
nopl (%rax)
; โ
Result on machine B (Intel Xeon)
.text
; โ @ time_evolution_mwe.jl:56 within `time_evolution'
pushq %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
movabsq $140648620321984, %rax # imm = 0x7FEB4F0D6CC0
; โ @ time_evolution_mwe.jl:60 within `time_evolution'
; โโ @ array.jl within `getindex'
movq (%rax), %r8
movq (%rdx), %rcx
movabsq $140648620322544, %rax # imm = 0x7FEB4F0D6EF0
movq (%rax), %r9
movabsq $140648577288320, %rax # imm = 0x7FEB4C7CC880
movq (%rax), %r10
movq (%rsi), %rsi
movq (%rdi), %rdi
; โโ
; โ @ time_evolution_mwe.jl:59 within `time_evolution'
leaq 8(%rdi), %rdx
leaq 319992(%rdi), %rbp
leaq 8(%r8), %rax
leaq 319992(%r8), %rbx
leaq 320000(%rcx), %r11
leaq 8(%r9), %r15
leaq 319992(%r9), %r12
cmpq %rbx, %rdx
setb -1(%rsp)
cmpq %rbp, %rax
leaq 8(%r10), %r13
setb %r14b
cmpq %r11, %rdx
setb %al
cmpq %rbp, %rcx
setb -2(%rsp)
cmpq %r12, %rdx
setb %r11b
cmpq %rbp, %r15
leaq 319992(%r10), %rbx
setb %r15b
cmpq %rbx, %rdx
setb %r12b
cmpq %rbp, %r13
leaq 319992(%rsi), %rbx
setb -3(%rsp)
cmpq %rbx, %rdx
leaq 8(%rsi), %rdx
setb %r13b
cmpq %rbp, %rdx
; โ @ time_evolution_mwe.jl:60 within `time_evolution'
; โโ @ promotion.jl:322 within `*' @ float.jl:0
vaddsd %xmm0, %xmm0, %xmm0
; โโ
; โ @ time_evolution_mwe.jl:59 within `time_evolution'
setb %bpl
movl $2, %edx
testb %r14b, -1(%rsp)
jne L323
andb -2(%rsp), %al
jne L323
andb %r15b, %r11b
jne L323
andb -3(%rsp), %r12b
jne L323
andb %bpl, %r13b
jne L323
vbroadcastsd %xmm0, %ymm1
xorl %eax, %eax
nop
; โ @ time_evolution_mwe.jl:60 within `time_evolution'
; โโ @ array.jl:801 within `getindex'
L240:
vmovupd (%rcx,%rax), %ymm2
vmovupd 8(%rcx,%rax), %ymm3
vmovupd 16(%rcx,%rax), %ymm4
; โโ
; โโ @ float.jl:329 within `-'
vsubpd %ymm3, %ymm2, %ymm2
; โโ
; โโ @ float.jl:332 within `*'
vmulpd 8(%r8,%rax), %ymm2, %ymm2
; โโ
; โโ @ float.jl:329 within `-'
vsubpd %ymm3, %ymm4, %ymm3
; โโ
; โโ @ float.jl:332 within `*'
vmulpd 8(%r9,%rax), %ymm3, %ymm3
; โโ
; โโ @ operators.jl:560 within `+' @ float.jl:326
vaddpd %ymm3, %ymm2, %ymm2
vaddpd 8(%r10,%rax), %ymm2, %ymm2
; โโ
; โโ @ float.jl:332 within `*'
vmulpd 8(%rsi,%rax), %ymm1, %ymm3
; โโ
; โโ @ float.jl:329 within `-'
vsubpd %ymm3, %ymm2, %ymm2
; โโ
; โโ @ array.jl:839 within `setindex!'
vmovupd %ymm2, 8(%rdi,%rax)
addq $32, %rax
cmpq $319968, %rax # imm = 0x4E1E0
jne L240
; โโ
; โ @ array.jl within `time_evolution'
movl $39998, %edx # imm = 0x9C3E
; โ @ time_evolution_mwe.jl:59 within `time_evolution'
L323:
shlq $3, %rdx
addq $-8, %rdi
addq $-8, %rsi
addq $-8, %r10
addq $-8, %r9
addq $-8, %r8
movl $320000, %eax # imm = 0x4E200
; โ @ time_evolution_mwe.jl:60 within `time_evolution'
; โโ @ array.jl:801 within `getindex'
L352:
vmovsd -16(%rcx,%rdx), %xmm1 # xmm1 = mem[0],zero
vmovsd -8(%rcx,%rdx), %xmm2 # xmm2 = mem[0],zero
; โโ
; โโ @ float.jl:329 within `-'
vsubsd %xmm2, %xmm1, %xmm1
; โโ
; โโ @ float.jl:332 within `*'
vmulsd (%r8,%rdx), %xmm1, %xmm1
; โโ
; โโ @ array.jl:801 within `getindex'
vmovsd (%rcx,%rdx), %xmm3 # xmm3 = mem[0],zero
; โโ
; โโ @ float.jl:329 within `-'
vsubsd %xmm2, %xmm3, %xmm2
; โโ
; โโ @ float.jl:332 within `*'
vmulsd (%r9,%rdx), %xmm2, %xmm2
; โโ
; โโ @ operators.jl:560 within `+' @ float.jl:326
vaddsd %xmm2, %xmm1, %xmm1
vaddsd (%r10,%rdx), %xmm1, %xmm1
; โโ
; โโ @ float.jl:332 within `*'
vmulsd (%rsi,%rdx), %xmm0, %xmm2
; โโ
; โโ @ float.jl:329 within `-'
vsubsd %xmm2, %xmm1, %xmm1
; โโ
; โโ @ array.jl:839 within `setindex!'
vmovsd %xmm1, (%rdi,%rdx)
; โโ
; โโ @ range.jl:674 within `iterate'
; โโโ @ promotion.jl:410 within `=='
addq $8, %rcx
addq $8, %rdi
addq $8, %rsi
addq $8, %r10
addq $8, %r9
addq $8, %r8
addq $-8, %rax
cmpq %rax, %rdx
; โโโ
jne L352
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
vzeroupper
retq
nopl (%rax)
; โ
What next ?
This is already far beyond my zone of comfort, but Iโd like to get to the bottom of it,
for intellectual satisfaction, and because itโs a strong bottleneck for the ODE on the Xeon.
[To be fair, itโs already faster than a vectorized octave version,
so julia + DifferentialEquations.jl rocks !]
Any idea ?