Is multithreaded FFTW on multiple arrays causing too much overhead?

erny123 · August 15, 2023, 3:07pm

I have a couple of matrices that I have FFT to computer a nonlinear function and then IFFT.

My question: Is using a multithreaded FFT plan creating too much overhead because of each instance the FFTW is called? Is there anyway to speed up or optimize this code?

Here’s an example:

using FFTW
using Base.Threads
#using MKL
#setting provider to MKL
#FFTW.set_provider!("mkl")
#using .SRRSBPMMod

sz = 2^13
Q0 = rand(ComplexF64,(sz,sz));
Q1 = rand(ComplexF64,(sz,sz));
Q2 = rand(ComplexF64,(sz,sz));
Operator = rand(ComplexF64,(sz,sz));
#setting provider to MKL
#FFTW.set_provider!("mkl")
nth = Threads.nthreads();
FFTW.set_num_threads(nth)
plan_thread = FFTW.plan_fft!(Q0,1,flags=FFTW.MEASURE);
inv(plan_thread);

fr = Vector(FFTW.fftfreq(sz));
dwj = rand(10);

function calculation!(plan, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, OP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64})
   
    #FFT
    plan*Qa;
    plan*Qb;
    plan*Qc;
    
    @threads for l in 1:size(Qa)[2]
        for j in 1:size(wj)[1]
            #do columnwise operations
            @views @. Q0[:,l] = Q0[:,l] +  wj[j] *abs2.(OP[:,l]) / (-(f)^2 - 1.0im*(f)*wj[j] + wj[j]^2) ;
                
            @views @. Qp2[:,l] = Qp2[:,l] +   wj[j] *abs2.(OP[:,l]) / (-(f)^2 - 1.0im*(f)*wj[j] + wj[j]^2) ;
                
            @views @. Qn2[:,l] = Qn2[:,l] +   wj[j] *abs2.(OP[:,l]) / (-(f)^2 - 1.0im*(f)*wj[j] + wj[j]^2) ;
                
        end
    end
                            
    plan\Qa;
    plan\Qb;
    plan\Qc;
    
    
end

I run like:

calculation!(plan_thread, Q0, Q1,Q2, Operator, fr, dwj);

And benchmark with:

@btime calculation!(plan_thread, $Q0, $Q1,$Q2, $Operator, $fr, $dwj);

On my 32 core workstation with Intel Xeon E5-2640 v3 @ 2.6GHz I get:
2.855 s (4086145 allocations: 174.87 MiB)

stevengj · August 15, 2023, 3:47pm

What fraction of the time is spent in this loop? (If I were you I would replace the for j in 1:size(wj)[1] loop with an explicit for i in 1:size(Q0,1), j in 1:length(wj) loop and get rid of the broadcasting — broadcasting over each column multiple times in the j loop has poor temporal locality.) Make sure you know what you actually need to optimize!

Are you using FFTW or MKL? MKL’s multi-threading is not as composable with Julia’s (it uses its own thread pool), so you may get some thread contention, though maybe it isn’t a problem here since you aren’t nesting the FFT calls in a @threads loop.

In principle you can improve multithreading of the FFTs by replacing your 3 FFTs with a single FFT over a 3d array (i.e. “stack” your arrays into a (sz,sz,3) array, and transform along the first direction). That will let it parallelize the 3 transforms instead of performing them serially.

Alternatively, you can plan a serial 1d FFT only and do it inside your @threads for l in 1:size(Qa,2) loop, so that you get better cache locality: transform a column, do some processing on it, and then inverse transform before moving on to the next column. (If you plan 1d FFTs, I would do them out-of-place to a pre-allocated buffer using mul!, since in-place 1d FFTs are usually less efficient. Note that FFTW plan execution is thread-safe, and you want a serial plan since it is nested inside a @threads loop.) This will also give better cache locality on your other processing, since it will always be on the same buffers.

I would also suggest explicitly planning a “backwards” plan (via plan_bfft), rather than using plan \ .... A backwards plan is an unnormalized inverse, so it is missing a 1/N factor, but you can combine that scalar 1/N factor with your other processing loop at essentially no cost. This saves you a (non-threaded) extra pass over the array (since the inverse plan internally does a backwards FFT and then does a separate loop to scale the array). See also the FFTW FAQ on inverse transforms.

erny123 · August 15, 2023, 4:04pm

1/3 of the time is spent on the FFTW. And yes about the know what I need to optimize. That’s why i’m asking this question of whether there’s a way to optimize the FFTW or if I should just focus on the other parts of the code.

So from what I know, column-wise operations are faster which is why I did that.

Your suggestion:

for i in 1:size(Q0,1), j in 1:length(wj)

How would that get rid of the broadcasting? I would still need to divide each element in the column by each element in the frequency vector plus the wj.

Alternatively, you can plan a serial 1d FFT only and do it inside your @threads for l in 1:size(Qa,2) loop, so that you get better cache locality: transform a column, do some processing on it, and then inverse transform before moving on to the next column. (If you plan 1d FFTs, I would do them out-of-place to a pre-allocated buffer using mul! , since in-place 1d FFTs are usually less efficient. Note that FFTW plan execution is thread-safe, and you want a serial plan since it is nested inside a @threads loop.) This will also give better cache locality on your other processing, since it will always be on the same buffers.

Interesting, I’ll give this a shot.

stevengj · August 15, 2023, 4:20pm

For example:

for i in 1:size(Q0,1), j in 1:length(wj)
    Q0[i,l] = Q0[i,l] +  wj[j] *abs2(OP[i,l]) / (-f[i]^2 - 1.0im*(f[i])*wj[j] + wj[j]^2) 
    ....
end

This replaces the broadcasting with the i loop, and more importantly it switches the loop orders so that the j loop is innermost. This should improve cache locality, because now all of the calculations on Q0[i,l] are completed before moving on to the next i.

erny123 · August 16, 2023, 2:47pm

Ok so I tried a couple of things:

I tried fixing the loop so that there wouldn’t be any broadcasting using cartesian index:

function calculation1!(plan, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, OP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64}, dwj::Vector{Float64}, gj::Vector{Float64})
   
    #FFT
    plan*Qa;
    plan*Qb;
    plan*Qc;
    
    @threads for l in CartesianIndices(Qa)
        m = l[1];
        n = l[2];
        for j in 1:size(wj)[1]
            
            
            Qa[l] = Qa[l] +0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2)*abs2(OP[l]);
                
            Qb[l] = Qb[l] +0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2)*conj(OP[l])*OP[l];
                
            Qc[l] = Qc[l] +0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2)*conj(OP[l])*OP[l];
             
                
        end
    end
                            
    plan\Qa;
    plan\Qb;
    plan\Qc;
    
    
end

Benchmarking gave me:

@btime calculation1!(plan_thread, Q0, Q1, Q2, Operator, fr, wj,dwj, gj );
3.122 s (609 allocations: 71.98 KiB)

So this was actually worse.

Next I realized that the frequency operation (wj , f, dwj, and gj) is the same for all three matrices and can be computed once for all three rather than for each:

function calculation2!(plan, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, OP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64}, dwj::Vector{Float64}, gj::Vector{Float64})
   
    #FFT
    plan*Qa;
    plan*Qb;
    plan*Qc;
    
    @threads for l in CartesianIndices(Qa)
        m = l[1];
        n = l[2];
        for j in 1:size(wj)[1]
            
            gainop = 0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2);
            
            Qa[l] = Qa[l] +gainop*abs2(OP[l]);
                
            Qb[l] = Qb[l] +gainop*conj(OP[l])*OP[l];
                
            Qc[l] = Qc[l] +gainop*conj(OP[l])*OP[l];
             
                
        end
    end
                            
    plan\Qa;
    plan\Qb;
    plan\Qc;
    
    
end

With benchmarking of:

@btime calculation2!(plan_thread, Q0, Q1, Q2, Operator, fr, wj,dwj, gj );
  2.023 s (204 allocations: 19.98 KiB)

This definitely improved it and the code looks much cleaner than broadcasting.

Next, rather than use CartesianIndexing (which I read somewhere a long while ago that it is slow?) I used the loop configuration that @stevengj recommended:

function calculation3!(plan, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, OP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64}, dwj::Vector{Float64}, gj::Vector{Float64})
   
    #FFT
    plan*Qa;
    plan*Qb;
    plan*Qc;
    
    
    @threads for l in 1:size(Qa)[2]
        for i in 1:size(Qa,1), j in 1:length(wj)
            
            gainop = 0.5*wj[j]*dwj[j]*gj[j]/(-f[i]^2 - 1.0im *f[i] *dwj[j] + wj[j]^2);
            
            Qa[i,l] = Qa[i,l] +gainop*abs2(OP[i,l]);
                
            Qb[i,l] = Qb[i,l] +gainop*conj(OP[i,l])*OP[i,l];
                
            Qc[i,l] = Qc[i,l] +gainop*conj(OP[i,l])*OP[i,l];
                
        end
    end
                            
    plan\Qa;
    plan\Qb;
    plan\Qc;
    
    
end

With benchmarking:

@btime calculation3!(plan_thread, Q0, Q1, Q2, Operator, fr, wj,dwj, gj );
  1.995 s (206 allocations: 20.05 KiB)

This improved it just a tad bit. It makes me wonder what the difference is in CartesianIndices and specifying loop parameters explicitly.

Next, because the frequency operator (which I call gOperator) was constant for all matrices, I realized again that it was a constant throughout all computations. So I made it into a matrix operator.

wj = rand(10);
dwj = rand(10);
gj = rand(10);
gop = zeros(ComplexF64, size(Q0));

function goperator!(gOP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64}, dwj::Vector{Float64}, gj::Vector{Float64})
    
    
    @threads for l in 1:size(gOP,2)
        for m in 1:size(gOP,1), j in 1:length(wj)
            
           
            gOP[m,l] = 0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2);
            
        end
    end
    
end

With this I made a new calculation function:

function calculation4!(plan, In::Matrix{ComplexF64},Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, gOP::Matrix{ComplexF64})
   
    #FFT
    plan*Qa;
    plan*Qb;
    plan*Qc;
    
    
    @threads for l in eachindex(Qa)
            
            
            Qa[l] = Qa[l] +gOP[l]*abs2(In[l]);
                
            Qb[l] = Qb[l] +gOP[l]*conj(In[l])*In[l];
                
            Qc[l] = Qc[l] +gOP[l]*conj(In[l])*In[l];
                
    end
                            
    plan\Qa;
    plan\Qb;
    plan\Qc;
    
    
end

with benchmark:

@btime calculation4!(plan_thread, Intensity, Q0, Q1, Q2, gop);
  1.268 s (214 allocations: 19.23 KiB)

This made it significantly faster. Although if things were actually changing in the gOperator this might not be possible. But for my problem this works.

NOTE: I used eachindex() function

Next, rather than using the eachindex() to loop, I did explicit for loop:

function calculation5!(plan, In::Matrix{ComplexF64},Q::Array{ComplexF64, 3}, gOP::Matrix{ComplexF64})
   
    #FFT
    plan*Q;
    
    @threads for l in 1:size(Q,2)
        for m in 1:size(Q,1)


                Q[m,l,1] = Q[m,l,1]  +gOP[m,l]*abs2(In[m,l]);

                Q[m,l,2] = Q[m,l,2] +gOP[m,l]*conj(In[m,l])*In[m,l];

                Q[m,l,3]= Q[m,l,3] +gOP[m,l]*conj(In[m,l])*In[m,l];

        end
    end
                            
    plan\Q;
end

@btime calculation5!(plan_thread3D, Intensity,Q3D, gop);
  1.147 s (680 allocations: 56.45 KiB)

This was again faster than using CartesianIndices or eachindex

Next, I combined my Q matrices into a 3D array of 2D matrices (sz,sz,3).
NOTE: For 6 and 7 I had to return to using OpenBLAS because MKL did not want to do non-square FFT

Q3D = rand(ComplexF64, (sz,sz,3));

and formed a FFT plan in the 1st dimension:

plan_thread3D = FFTW.plan_fft!(Q3D,1,flags=FFTW.MEASURE);

function calculation6!(plan, In::Matrix{ComplexF64},Q::Array{ComplexF64, 3}, gOP::Matrix{ComplexF64})
   
    #FFT
    plan*Q;
    
    @threads for l in 1:size(Q,3)
        for m in 1:size(Q,2)


                Q[1,m,l] = Q[1,m,l]  +gOP[m,l]*abs2(In[m,l]);

                Q[2,m,l] = Q[2,m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

                Q[3,m,l]= Q[3,m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

        end
    end
                            
    plan\Q;
    
end

@btime calculation6!(plan_thread3D, Intensity,Q3D, gop);
1.352 s (678 allocations: 56.42 KiB)

This was slower than doing individual 1D FFT on each of the three 2D matrices although not by much.

I thought maybe the slowing on 6 came from the way the matrices were stacked so I flipped the order:

Q3Da = rand(ComplexF64, (3,sz,sz));
plan_thread3Da = FFTW.plan_fft!(Q3Da,2,flags=FFTW.MEASURE);

and using calculation6!

@btime calculation6!(plan_thread3Da, Intensity,Q3Da, gop);
1.238 s (678 allocations: 56.42 KiB)

Last I did a serial FFT planner to see if I could incorporate it into the multithread operations:

seralarray = rand(ComplexF64, sz);
serailplan = FFTW.plan_fft!(seralarray,flags=FFTW.MEASURE);

function calculation7!(plan, In::Matrix{ComplexF64}, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, gOP::Matrix{ComplexF64})
   
    #FFT
    #plan*Q;
    
    @threads for l in 1:size(Qa,2)
        plan*Qa[:,l];
        plan*Qb[:,l];
        plan*Qc[:,l];
        
        for m in 1:size(Qa,1)


                Qa[m,l] = Qa[m,l]  +gOP[m,l]*abs2(In[m,l]);

                Qb[m,l] = Qb[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

                Qc[m,l]= Qc[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

        end
        plan\Qa[:,l];
        plan\Qb[:,l];
        plan\Qc[:,l];
        
    end
                            
    #plan\Q;
    
end

With the benchmark:

@btime calculation7!(serailplan, Intensity, Q0, Q1, Q2, gop);
1.504 s (10226925 allocations: 6.84 GiB)

The allocations necessary for this was huge and did not provide any extra time benefit.

CONCLUSION

There’s still more to be learned about best way to optimize FFT over many matrices using multithreading.
There’s something I don’t understand in looping when explicitly setting the loop indices and using CartesianIndices and eachindex. Additonally, I’m not sure broadcasting doesn’t do the same thing as explicitly indexing in for loops? Perhaps @stevengj can comment on this

NOTE: For 6 and 7 I had to return to using OpenBLAS because MKL did not want to do non-square FFT

stevengj · August 16, 2023, 3:11pm

Are you forgetting @views?

You might also try @inbounds or, better yet, LoopVectorization.jl, on your loops. (If you use @inbounds, probably do a check before the beginning of your loop that all of your arrays match in size/axes.)

erny123 · August 16, 2023, 4:20pm

Here’s the new trials:

Using serial 1D FFT but with @views :

function calculation8!(plan, In::Matrix{ComplexF64}, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, gOP::Matrix{ComplexF64})
   
    #FFT
    #plan*Q;
    
    @threads for l in 1:size(Qa,2)
        @views plan*Qa[:,l];
        @views plan*Qb[:,l];
        @views plan*Qc[:,l];
        
        for m in 1:size(Qa,1)


                Qa[m,l] = Qa[m,l]  +gOP[m,l]*abs2(In[m,l]);

                Qb[m,l] = Qb[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

                Qc[m,l]= Qc[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

        end
        @views plan\Qa[:,l];
        @views plan\Qb[:,l];
        @views plan\Qc[:,l];
        
    end
                            
    #plan\Q;
    
end

Benchmarking:

calculation8!(serialplan, Intensity, Q0, Q1, Q2, gop);
@btime calculation8!(serialplan, Intensity, Q0, Q1, Q2, gop);
421.126 ms (49356 allocations: 2.64 MiB)

That is pretty amazing and awesome speed up.

Next I added @inbounds although I wasn’t sure exactly where to put it with @threads

function calculation9!(plan, In::Matrix{ComplexF64}, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, gOP::Matrix{ComplexF64})
   
    #FFT
    #plan*Q;
    
    @inbounds @threads for l in 1:size(Qa,2)
        @views plan*Qa[:,l];
        @views plan*Qb[:,l];
        @views plan*Qc[:,l];
        
        @inbounds for m in 1:size(Qa,1)


                Qa[m,l] = Qa[m,l]  +gOP[m,l]*abs2(In[m,l]);

                Qb[m,l] = Qb[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

                Qc[m,l]= Qc[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

        end
        @views plan\Qa[:,l];
        @views plan\Qb[:,l];
        @views plan\Qc[:,l];
        
    end
                            
    #plan\Q;
    
end

calculation9!(serialplan, Intensity, Q0, Q1, Q2, gop);
@btime calculation9!(serialplan, Intensity, Q0, Q1, Q2, gop);
401.456 ms (49355 allocations: 2.64 MiB)

This provided a nominal speedup which is great!

CONCLUSION
Using serial 1D FFTW inside the multithreaded loop (9) has made it faster by by about 2x than (6-7) using a multithread 1D FFTW outside a multithreaded loop.

NOTE: loopvectorization.jl doesn’t work with Complex Numbers and so I wasn’t able to get it to work for this application.

stevengj · August 16, 2023, 4:24pm

As I mentioned earlier, you should be able to get an additional speedup if you explicitly form a “backwards” plan (with plan_bfft!) instead of plan \ ..., and then fold the 1/N normalization factor into your for m loop.

This is mathematically the same as (but less efficient than) abs2(In[m,l]). Looks like you are calculating gOP[m,l]*abs2(In[m,l]) three times?

erny123 · August 16, 2023, 4:50pm

Ya, it was more of a stand-in for multiplying two different In matrices. for example, in my real code it would be something like:
conj(In1[m,])*In2[m,l];

I’ll give the backwards transform a shot but I’m actually pretty happy and surprised at how much I’ve learned from this.

erny123 · August 18, 2023, 3:12pm

@stevengj I’m going back and implementing this suggestion now that I have more time. Only problem here is that the mul! for out-of-place pre-allocated FFTW doesn’t work for 1D FFT. Is there another function for vector products?

Or would the following be equivalent for pre-allocated Qa and qa matrices?:

@. qa[:,l] = backwardFFTplan1D * Qa[:,l] ;

stevengj · August 18, 2023, 3:44pm

Works for me:

julia> using FFTW, LinearAlgebra

julia> x = rand(ComplexF64, 1000); y = similar(x);

julia> p = plan_fft(x);

julia> mul!(y, p, x);

julia> y ≈ fft(x)
true

The only tricky part, in your context, is that you want a pre-allocated buffer per thread/task in order to do multiple FFTs in parallel.

No, an FFT is not an element-wise operation so you can’t use broadcasting for this.

erny123 · August 18, 2023, 3:48pm

Works for me:

Huh, something weird was going on with my plans so I restarted the kernel and seems to work now.

erny123 · August 19, 2023, 3:07pm

@stevengj a couple of things:

mul!(y, p, x); seems to work properly in some code and not in others when using vectors x and y. For example, to get it to work inside a function:

function doFFT1Don2D!(Y::Matrix{ComplexF64}, X::Matrix{ComplexF64}, plan1D)

for m in 1:size(Y,2)

mul!(Y[:,m], plan1D, X[:,m]);

end
end

Did not work and changed all of Y into zeros. Instead I had to use @view:

function doFFT1Don2D!(Y::Matrix{ComplexF64}, X::Matrix{ComplexF64}, plan1D)

for m in 1:size(Y,2)

mul!((@view Y[:,m]), plan1D, X[:,m]);

end
end

This issue happened for serial and threaded.

The only tricky part, in your context, is that you want a pre-allocated buffer per thread/task in order to do multiple FFTs in parallel.

Can I get around this by having pre-allocated matrices the same size as the 2D matrix say X that I want to perform a columnwise 1DFFT? For example:

function calculation9!(forplan, bacplan, In::Matrix{ComplexF64}, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, Qd::Matrix{ComplexF64}, Qe::Matrix{ComplexF64},Qf::Matrix{ComplexF64},  gOP::Matrix{ComplexF64})
   
    
    #FFT
    #forplan - out of place forward unnormalized fft
    #bacplan - out of place backwards unnormalized fft
    #get size of 1D array for normalization
    sz1 = size(Qa,1);
    invsz1 = 1.0/sz1;
    
    @inbounds @threads for l in 1:size(Qa,2)
        #forward transform
        mul!((@view Qd[:,m]), forplan, Qa[:,m]);
        mul!((@view Qe[:,m]), forplan, Qb[:,m]);
        mul!((@view Qf[:,m]), forplan, Qc[:,m]);
        
        @inbounds for m in 1:size(Qa,1)

                #divide by sz1 for normalization
                Qa[m,l] = invsz1*(Qd[m,l]  +gOP[m,l]*abs2(In[m,l]) );

                Qb[m,l] = invsz1*( Qe[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l] );

                Qc[m,l]= invsz1*( Qf[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l] );

        end
        #backwards transform
        mul!((@view Qd[:,m]), bacplan, Qa[:,m]);
        mul!((@view Qe[:,m]), bacplan, Qb[:,m]);
        mul!((@view Qf[:,m]), bacplan, Qc[:,m]);
        
    end

    
end

I believe doing this solves the racing issue? Since each thread will be looking specifically at a pre-allocated address. Of course this isn’t ideal because if the Q matrices are too large it will consume a lot of memory and the allocation of each column might not be in the local cache of the thread operation? I’m hoping to learn more about pre-allocation buffer per threads/tasks but for now I gotta get this up and running.

stevengj · August 19, 2023, 3:58pm

This has nothing to do with FFTW and is just a basic consequence of how Julia works. It won’t work because Y[:, m] creates a copy of the m-th column, so the output of the FFT goes into the copy and not into Y. All slices in Julia create copies by default. Just use:

@views mul!(Y[:,m], plan1D, X[:,m])

so that Y[:,m] is a view (i.e. the output goes into Y and also X[:,m] is a view (just to improve efficiency by avoiding the copy of the column of X).

Yes, but this is a lot more buffer space than strictly necessary. You really only need one 1d buffer per thread/task.

PS. You don’t need a semicolon after every line in Julia. Semicolons after a line are only useful when running code in the REPL or other interactive prompt, to suppress output.

lmiq · August 19, 2023, 5:21pm

Shouldn’t be l here?

One thing is that BLAS mul! runs multi-threaded by default. Since you are threading on a higher level, you might want to disable BLAS threading by using BLAS.set_numthreads(1). That may be beneficial here.

stevengj · August 19, 2023, 6:04pm

This is not a BLAS call. mul!(y, plan, x) calls fftw_execute_dft(plan, x, y) to execute plan (perform an FFT) on x with output in y.

(mul! is the appropriate function to overload here because an FFT is a linear operator.)

Topic		Replies	Views
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7016	December 19, 2016
FFTW multithreading and plans General Usage fftw , multithreading	3	772	July 11, 2023
FFTW Performance General Usage fftw , performance	25	4425	November 2, 2022
FFTW.jl support fftw_plan_many_dft? Performance fftw	6	579	August 6, 2023
Performance Help on Large Matrix Manipulation Performance performance , matrices	16	1806	September 24, 2020

Is multithreaded FFTW on multiple arrays causing too much overhead?

Related topics