Hi there,

I have been trying to implement a form of the scaled FFT for use in optical propagation. The code was also rewritten in MATLAB to act as a comparison. My Julia implementation seems much slower than what I was expecting. MATLAB for reference takes around 0.38 s compared to ~0.95 s in Julia to run for identical parameters used in the Julia script (MATLAB cpu usage is significantly higher throughout). I have used the built-in profiler to try and analyse the situation and it seems to indicate that Julia is only using 1 thread despite me setting the environment variable in vscode to use 8 threads. I have also tried running in a standalone instance of Julia where I have set the number of threads to 8 before launching (-t 8). Does anyone have any insight into the possible slowdown?

If all is working with the multithreading then does anyone have any ideas as to why my code is noticeably slower in Julia? Am I missing an obvious trick that is causing wild inefficiencies in Julia?

Thanks in advance for any help!

```
using FFTW
using GridCreation
using IJulia
using LazyGrids
using ProfileView
# notebook()
N = 1024;
M = 1024;
alpha = 1.0;
beta = 1.0;
dx = 0.1; # Sampling unit for function
dy = 0.1;
Y_array, X_array = ndgrid(-N/2 : N/2 -1,-M/2 : M/2 -1);
typX = typeof(X_array);
typY = typeof(Y_array);
f = @. cis(pi * (dx*X_array + dy*Y_array));
function SFFT2(f::Array{ComplexF64}, X_array::typX, Y_array::typY, alpha::Float64, beta::Float64, M::Int64, N::Int64)
function exp_fact(kernel::Float64, array, A::Int64,shift)
@.(cis(kernel * (array - 2.0*A*shift)^2 /A))
end
exp_factor_n = exp_fact(pi*alpha,X_array,N,0);
exp_factor_m = exp_fact(pi*beta,Y_array,M,0);
exp_factor_n_shift = exp_fact(pi*alpha,X_array,N,1);
exp_factor_m_shift = exp_fact(pi*beta,Y_array,M,1);
exp_factor = 1.0 ./ (exp_factor_n .* exp_factor_m);
g1_bar_pad = Matrix{ComplexF64}(undef,nextpow(2,N+M-1),nextpow(2,N+M-1));
g2_bar_pad = copy(g1_bar_pad);
g1_bar_pad[1:N,1:M] = @.(f * cis(pi*(alpha*X_array + beta*Y_array)) * exp_factor);
g2_bar_pad[1:N,1:M] = 1.0./exp_factor; # top left chunk
g2_bar_pad[N+1:2*N,1:M] = exp_factor_n_shift .* exp_factor_m; # bottom left chunk
g2_bar_pad[1:N,M+1:2*M] = exp_factor_m_shift .* exp_factor_n; # top right chunk
g2_bar_pad[N+1:2*N,M+1:2*M] = exp_factor_n_shift .* exp_factor_m_shift; # bottom right chunk
conv_g1g2 = ifft(fft(g1_bar_pad) .* fft(g2_bar_pad))[1:N,1:M];
@.(cis(pi*(alpha*(X_array - 0.5*N) + beta*(Y_array - 0.5*M))) * exp_factor * conv_g1g2)
end
@profview SFFT(f, X_array, Y_array, alpha, beta, M, N)
```