And check this out (Apple M1 in a Mac Mini)…
julia> for N in 1_000:1_000:10_000
@show N
A = @btime eval_exp($N)
B = @btime eval_exp_tvectorize($N)
@assert A ≈ B
println()
end
N = 1000
5.005 ms (24 allocations: 15.26 MiB)
1.515 ms (2 allocations: 15.26 MiB)
N = 2000
19.180 ms (24 allocations: 61.04 MiB)
6.081 ms (2 allocations: 61.04 MiB)
N = 3000
43.262 ms (23 allocations: 137.33 MiB)
15.970 ms (2 allocations: 137.33 MiB)
N = 4000
77.505 ms (23 allocations: 244.14 MiB)
26.313 ms (2 allocations: 244.14 MiB)
N = 5000
119.022 ms (23 allocations: 381.47 MiB)
40.525 ms (2 allocations: 381.47 MiB)
N = 6000
169.070 ms (23 allocations: 549.32 MiB)
58.247 ms (2 allocations: 549.32 MiB)
N = 7000
228.990 ms (23 allocations: 747.68 MiB)
78.732 ms (2 allocations: 747.68 MiB)
N = 8000
296.763 ms (23 allocations: 976.56 MiB)
103.725 ms (5 allocations: 976.56 MiB)
N = 9000
375.306 ms (23 allocations: 1.21 GiB)
130.583 ms (5 allocations: 1.21 GiB)
N = 10000
459.623 ms (23 allocations: 1.49 GiB)
159.329 ms (5 allocations: 1.49 GiB)
julia> versioninfo()
Julia Version 1.8.0-DEV.54
Commit 6d2c0a7766* (2021-06-19 00:28 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.5.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.0 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 4
The N = 10_000 case requires 100x the work as the N = 1_000, and the M1 takes about 100x longer.
Meanwhile, the 2600X took about 180x longer, and the 10980XE took 400x longer.
Good scaling with increased problem size on the M1.
We have a lot of stencil benchmarks in this thread, which (because we’re evaluating them for a single time point, meaning there is no temporal locality we can take advantage of) are memory bound. There, the M1 was around 2x faster for solving the ODEs.
It seems to do well on these memory-intensive tasks.