Dear all,

I am trying to rewrite my Matlab code that calculates certain band structure to Julia. Unfortunately, I am unable to get close to the performance of Matlab code for some reason that I don’t understand. I will be very thankful if you help me.

My Julia code is the following:

```
#assert Threads.nthreads() == 8
using BenchmarkTools
using FFTW
using MKL
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1);
FFTW.set_provider!("mkl")
function Hamiltonian(nk, a, v)
nx = length(v);
E = zeros(ComplexF64, nk);
K = 2*π*collect(fftshift(fftfreq(nx, nx)))
Hf = zeros(ComplexF64, nx, nx);
k = collect(range(-π, π, nk));
fft!(v)
fftshift!(v, v/length(v))
for j=1:nx, i=1:nx
idx = i - j + floor(Int64, nx/2 + 1)
if 0 < idx <= nx
@inbounds Hf[i, j] = v[idx];
end
end
Threads.@threads for i in eachindex(k)
e = eigvals(Hermitian(Diagonal(-0.5.*abs.(k[i] .+ K).^a) .+ Hf));
sort!(e, lt = (x, y) -> abs(real(x)) < abs(real(y)));
@inbounds E[i] = e[1];
end
return E
end
x = range(0, 1, 1001);
@btime H = Hamiltonian(100, 1.5, complex(100*cos.(2*pi*x).*sinh.(-x)));
```

Julia version is 1.10. I am using MKL explicitly because Matlab uses MKL on intel processor and it does give better performance on my system. I have turned off auto parallelism of MKL because parallelism is already achieved by Threads.@threads on outer for loop, and again it gives better performance. I have also tried @distrbuted and got roughly the same results. This code is around 50% slower than its Matlab 2023a counter part:

```
function e = Hamiltonian(nk, a, p)
p = fftshift(fft(p))/length(p);
k = linspace(-pi, pi, nk);
L = length(p);
m = fix(L/2);
M = -m:m;
K = 2*pi*M;
Hf = zeros(L, L);
[i, j] = meshgrid(1:L, 1:L);
mask = round(i - j + floor(L/2));
Hf(mask >= 0 & mask < L) = p(mask(mask >= 0 & mask < L)+1);
e = zeros(1, nk);
parfor i=1:nk
H = diag(-0.5*abs(k(i) + K).^a) + Hf;
E = eig(H, 'vector');
[~, m] = sort(abs(real(E)));
E = E(m);
e(i) = real(E(1));
end
tic
x = linspace(0, 1, 1001);
e = Hamiltonian(100, 1.5, 100*cos(2*pi*x).*sinh(-x));
toc
```

The performance of these two are roughly the same for smaller matrices <500x500 size. But as matrix matrix becomes larger, I see huger gap. In general the potential term p,v (third argument of Hamiltonian) is Complex, so I use fft not rfft. Also I know performance wise the whole function is irrelevant, only the part eigenvalues are calculated causes bad performance. Moreover these two functions are completely equivalent (i.e eig solver in both cases sees the same matrix and produce same results).