Here’s how your new kernel could be written using LoopVectorization.jl
using LoopVectorization: @tvectorize
function eval_exp_tvectorize(N)
a = range(0, stop=2*pi, length=N)
A = Matrix{ComplexF64}(undef, N, N)
_A = reinterpret(reshape, Float64, A)
@tvectorize for j in 1:N, i in 1:N
x = sqrt(a[i]^2 + a[j]^2)
prefac = exp(-x)
s, c = sincos(100*x)
_A[1,i,j] = prefac * c
_A[2,i,j] = prefac * s
end
return A
end
It might be a little hard to tell what I’m doing here, but I’m basically writing
\exp\left(i (100+i) \sqrt{a_i^2 + a_j^2} \right) = \exp\left(-\sqrt{a_i^2 + a_j^2} \right)\left(\cos\left(100 \sqrt{a_i^2 + a_j^2} \right) + i \sin\left(100 \sqrt{a_i^2 + a_j^2} \right) \right)
Here’s how it stacks up against your version:
function eval_exp(N)
a = range(0, stop=2*pi, length=N)
A = Matrix{ComplexF64}(undef, N, N)
Threads.@threads for j in 1:N
for i in 1:N
A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
end
end
return A
end
using BenchmarkTools
print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1_000:1_000:10_000
@show N
A = @btime eval_exp($N)
B = @btime eval_exp_tvectorize($N)
@assert A ≈ B
println()
end
#+RESULTS:
running loop on 6 threads
N = 1000
4.868 ms (33 allocations: 15.26 MiB)
1.332 ms (2 allocations: 15.26 MiB)
N = 2000
22.427 ms (33 allocations: 61.04 MiB)
8.384 ms (2 allocations: 61.04 MiB)
N = 3000
51.026 ms (33 allocations: 137.33 MiB)
19.169 ms (2 allocations: 137.33 MiB)
N = 4000
90.038 ms (33 allocations: 244.14 MiB)
35.894 ms (2 allocations: 244.14 MiB)
N = 5000
140.807 ms (35 allocations: 381.47 MiB)
54.000 ms (4 allocations: 381.47 MiB)
N = 6000
208.826 ms (35 allocations: 549.32 MiB)
76.401 ms (3 allocations: 549.32 MiB)
N = 7000
276.598 ms (35 allocations: 747.68 MiB)
106.647 ms (4 allocations: 747.68 MiB)
N = 8000
371.500 ms (35 allocations: 976.57 MiB)
187.425 ms (5 allocations: 976.56 MiB)
N = 9000
460.661 ms (36 allocations: 1.21 GiB)
233.272 ms (5 allocations: 1.21 GiB)
N = 10000
616.968 ms (35 allocations: 1.49 GiB)
242.980 ms (5 allocations: 1.49 GiB)
As you can see, this is a huge win. I wouldn’t be be super surprised if Fortran could go a bit faster here, but I’ll also point out that this isn’t the first time Julia and LoopVectorization.jl have slayed Fortran performance numbers Simple summation 8x slower than in Julia - Fortran Discourse with no real suggestions from experts on how to do better without improving the compiler.
I should also note that my CPU is a Zen 1+ CPU which means it’s SIMD performance isn’t that great. I’d expect an even more impressive gap between these results on an AVX 512 Intel CPU, or maybe a Zen 3 CPU.
Regarding MKL, julia does not use it by default, it needs to be manually installed to use it.