Scaled FFT implementation seems to only use one thread despite setting no. of threads to 8

Hi there,

I have been trying to implement a form of the scaled FFT for use in optical propagation. The code was also rewritten in MATLAB to act as a comparison. My Julia implementation seems much slower than what I was expecting. MATLAB for reference takes around 0.38 s compared to ~0.95 s in Julia to run for identical parameters used in the Julia script (MATLAB cpu usage is significantly higher throughout). I have used the built-in profiler to try and analyse the situation and it seems to indicate that Julia is only using 1 thread despite me setting the environment variable in vscode to use 8 threads. I have also tried running in a standalone instance of Julia where I have set the number of threads to 8 before launching (-t 8). Does anyone have any insight into the possible slowdown?

If all is working with the multithreading then does anyone have any ideas as to why my code is noticeably slower in Julia? Am I missing an obvious trick that is causing wild inefficiencies in Julia?

Thanks in advance for any help!

using FFTW
using GridCreation
using IJulia
using LazyGrids
using ProfileView
# notebook()

N = 1024;
M = 1024;
alpha = 1.0;
beta = 1.0;

dx = 0.1; # Sampling unit for function 
dy = 0.1;

Y_array, X_array = ndgrid(-N/2 : N/2 -1,-M/2 : M/2 -1);

typX = typeof(X_array);
typY = typeof(Y_array);

f = @. cis(pi * (dx*X_array + dy*Y_array));

function SFFT2(f::Array{ComplexF64}, X_array::typX, Y_array::typY, alpha::Float64, beta::Float64, M::Int64, N::Int64)
    function exp_fact(kernel::Float64, array, A::Int64,shift)
        @.(cis(kernel * (array - 2.0*A*shift)^2 /A))
    exp_factor_n = exp_fact(pi*alpha,X_array,N,0);
    exp_factor_m = exp_fact(pi*beta,Y_array,M,0);
    exp_factor_n_shift = exp_fact(pi*alpha,X_array,N,1);
    exp_factor_m_shift = exp_fact(pi*beta,Y_array,M,1);

    exp_factor = 1.0 ./ (exp_factor_n .* exp_factor_m);
    g1_bar_pad = Matrix{ComplexF64}(undef,nextpow(2,N+M-1),nextpow(2,N+M-1));
    g2_bar_pad = copy(g1_bar_pad);
    g1_bar_pad[1:N,1:M] = @.(f * cis(pi*(alpha*X_array + beta*Y_array)) * exp_factor);
    g2_bar_pad[1:N,1:M] = 1.0./exp_factor; # top left chunk
    g2_bar_pad[N+1:2*N,1:M] = exp_factor_n_shift .* exp_factor_m; # bottom left chunk
    g2_bar_pad[1:N,M+1:2*M] = exp_factor_m_shift .* exp_factor_n; # top right chunk
    g2_bar_pad[N+1:2*N,M+1:2*M] = exp_factor_n_shift .* exp_factor_m_shift; # bottom right chunk

    conv_g1g2 = ifft(fft(g1_bar_pad) .* fft(g2_bar_pad))[1:N,1:M];
    @.(cis(pi*(alpha*(X_array - 0.5*N) + beta*(Y_array - 0.5*M))) * exp_factor * conv_g1g2)

@profview SFFT(f, X_array, Y_array, alpha, beta, M, N)

If you care about efficiency, you should use plan_fft to precompute a plan and re-use it, rather than calling fft repeatedly. (And ideally pass flags=FFTW.MEASURE or flags=FFTW.PATIENT.) This is the key to using FFTW effectively, as described in the FFTW FAQ.

(Matlab precomputes plans for a set of common sizes IIRC.)

There is also no need to use nextpow(2, …). FFTW is fast as long the sizes are composites of small factors 2,3,5,7. In particular, you can use nextprod([2,3,5], n) or nextprod([2,3,5,7], n) as discussed in another thread: How to set the length of fft … but it is better to set your length to be such a number to begin with rather than padding (which changes the transform, not to mention incurring copies).

1 Like

Great thank you for the suggestions I’ll implement them. I’ll also post the profiler breakdown.