Here’s how your new kernel could be written using LoopVectorization.jl
using LoopVectorization: @tvectorize
function eval_exp_tvectorize(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)
    _A = reinterpret(reshape, Float64, A)
    @tvectorize for j in 1:N, i in 1:N
        x = sqrt(a[i]^2 + a[j]^2)
        prefac = exp(-x)
        s, c = sincos(100*x)
        
        _A[1,i,j] = prefac * c
        _A[2,i,j] = prefac * s
    end
    return A
end
It might be a little hard to tell what I’m doing here, but I’m basically writing
\exp\left(i (100+i) \sqrt{a_i^2 + a_j^2} \right) = \exp\left(-\sqrt{a_i^2 + a_j^2} \right)\left(\cos\left(100 \sqrt{a_i^2 + a_j^2} \right) + i \sin\left(100 \sqrt{a_i^2 + a_j^2} \right) \right)
Here’s how it stacks up against your version:
function eval_exp(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)
    Threads.@threads for j in 1:N
    for i in 1:N
        A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
    end
    end
    return A
end
using BenchmarkTools
print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1_000:1_000:10_000
    @show N
    A = @btime eval_exp($N)
    B = @btime eval_exp_tvectorize($N)
    @assert A ≈ B
    println()
end
#+RESULTS:
running loop on 6 threads 
N = 1000
  4.868 ms (33 allocations: 15.26 MiB)
  1.332 ms (2 allocations: 15.26 MiB)
N = 2000
  22.427 ms (33 allocations: 61.04 MiB)
  8.384 ms (2 allocations: 61.04 MiB)
N = 3000
  51.026 ms (33 allocations: 137.33 MiB)
  19.169 ms (2 allocations: 137.33 MiB)
N = 4000
  90.038 ms (33 allocations: 244.14 MiB)
  35.894 ms (2 allocations: 244.14 MiB)
N = 5000
  140.807 ms (35 allocations: 381.47 MiB)
  54.000 ms (4 allocations: 381.47 MiB)
N = 6000
  208.826 ms (35 allocations: 549.32 MiB)
  76.401 ms (3 allocations: 549.32 MiB)
N = 7000
  276.598 ms (35 allocations: 747.68 MiB)
  106.647 ms (4 allocations: 747.68 MiB)
N = 8000
  371.500 ms (35 allocations: 976.57 MiB)
  187.425 ms (5 allocations: 976.56 MiB)
N = 9000
  460.661 ms (36 allocations: 1.21 GiB)
  233.272 ms (5 allocations: 1.21 GiB)
N = 10000
  616.968 ms (35 allocations: 1.49 GiB)
  242.980 ms (5 allocations: 1.49 GiB)
As you can see, this is a huge win. I wouldn’t be be super surprised if Fortran could go a bit faster here, but I’ll also point out that this isn’t the first time Julia and LoopVectorization.jl have slayed Fortran performance numbers Simple summation 8x slower than in Julia - Fortran Discourse with no real suggestions from experts on how to do better without improving the compiler.
I should also note that my CPU is a Zen 1+ CPU which means it’s SIMD performance isn’t that great. I’d expect an even more impressive gap between these results on an AVX 512 Intel CPU, or maybe a Zen 3 CPU.
Regarding MKL, julia does not use it by default, it needs to be manually installed to use it.