Scaled FFT implementation seems to only use one thread despite setting no. of threads to 8

Thaegy · October 27, 2022, 7:32pm

Hi there,

I have been trying to implement a form of the scaled FFT for use in optical propagation. The code was also rewritten in MATLAB to act as a comparison. My Julia implementation seems much slower than what I was expecting. MATLAB for reference takes around 0.38 s compared to ~0.95 s in Julia to run for identical parameters used in the Julia script (MATLAB cpu usage is significantly higher throughout). I have used the built-in profiler to try and analyse the situation and it seems to indicate that Julia is only using 1 thread despite me setting the environment variable in vscode to use 8 threads. I have also tried running in a standalone instance of Julia where I have set the number of threads to 8 before launching (-t 8). Does anyone have any insight into the possible slowdown?

If all is working with the multithreading then does anyone have any ideas as to why my code is noticeably slower in Julia? Am I missing an obvious trick that is causing wild inefficiencies in Julia?

Thanks in advance for any help!

using FFTW
using GridCreation
using IJulia
using LazyGrids
using ProfileView
# notebook()

N = 1024;
M = 1024;
alpha = 1.0;
beta = 1.0;

dx = 0.1; # Sampling unit for function 
dy = 0.1;

Y_array, X_array = ndgrid(-N/2 : N/2 -1,-M/2 : M/2 -1);

typX = typeof(X_array);
typY = typeof(Y_array);

f = @. cis(pi * (dx*X_array + dy*Y_array));

function SFFT2(f::Array{ComplexF64}, X_array::typX, Y_array::typY, alpha::Float64, beta::Float64, M::Int64, N::Int64)
    function exp_fact(kernel::Float64, array, A::Int64,shift)
        @.(cis(kernel * (array - 2.0*A*shift)^2 /A))
    end
   
    exp_factor_n = exp_fact(pi*alpha,X_array,N,0);
    exp_factor_m = exp_fact(pi*beta,Y_array,M,0);
    exp_factor_n_shift = exp_fact(pi*alpha,X_array,N,1);
    exp_factor_m_shift = exp_fact(pi*beta,Y_array,M,1);

    exp_factor = 1.0 ./ (exp_factor_n .* exp_factor_m);
    
    g1_bar_pad = Matrix{ComplexF64}(undef,nextpow(2,N+M-1),nextpow(2,N+M-1));
    g2_bar_pad = copy(g1_bar_pad);
    g1_bar_pad[1:N,1:M] = @.(f * cis(pi*(alpha*X_array + beta*Y_array)) * exp_factor);
    
    g2_bar_pad[1:N,1:M] = 1.0./exp_factor; # top left chunk
    g2_bar_pad[N+1:2*N,1:M] = exp_factor_n_shift .* exp_factor_m; # bottom left chunk
    g2_bar_pad[1:N,M+1:2*M] = exp_factor_m_shift .* exp_factor_n; # top right chunk
    g2_bar_pad[N+1:2*N,M+1:2*M] = exp_factor_n_shift .* exp_factor_m_shift; # bottom right chunk

    conv_g1g2 = ifft(fft(g1_bar_pad) .* fft(g2_bar_pad))[1:N,1:M];
    @.(cis(pi*(alpha*(X_array - 0.5*N) + beta*(Y_array - 0.5*M))) * exp_factor * conv_g1g2)
end

@profview SFFT(f, X_array, Y_array, alpha, beta, M, N)

stevengj · October 27, 2022, 7:35pm

If you care about efficiency, you should use plan_fft to precompute a plan and re-use it, rather than calling fft repeatedly. (And ideally pass flags=FFTW.MEASURE or flags=FFTW.PATIENT.) This is the key to using FFTW effectively, as described in the FFTW FAQ.

(Matlab precomputes plans for a set of common sizes IIRC.)

There is also no need to use nextpow(2, …). FFTW is fast as long the sizes are composites of small factors 2,3,5,7. In particular, you can use nextprod([2,3,5], n) or nextprod([2,3,5,7], n) as discussed in another thread: How to set the length of fft … but it is better to set your length to be such a number to begin with rather than padding (which changes the transform, not to mention incurring copies).

Thaegy · October 27, 2022, 8:06pm

Great thank you for the suggestions I’ll implement them. I’ll also post the profiler breakdown.

Topic		Replies	Views
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7020	December 19, 2016
FFTW Performance General Usage fftw , performance	25	4447	November 2, 2022
Is multithreaded FFTW on multiple arrays causing too much overhead? Performance fftw	15	892	August 19, 2023
FFTW multithreading and plans General Usage fftw , multithreading	3	772	July 11, 2023
Code introspection recommendations/advice General Usage	2	457	February 4, 2020

Scaled FFT implementation seems to only use one thread despite setting no. of threads to 8

Related topics