I have a couple of matrices that I have FFT to computer a nonlinear function and then IFFT.

My question: Is using a multithreaded FFT plan creating too much overhead because of each instance the FFTW is called? Is there anyway to speed up or optimize this code?

Here’s an example:

``````using FFTW
#using MKL
#setting provider to MKL
#FFTW.set_provider!("mkl")
#using .SRRSBPMMod

sz = 2^13
Q0 = rand(ComplexF64,(sz,sz));
Q1 = rand(ComplexF64,(sz,sz));
Q2 = rand(ComplexF64,(sz,sz));
Operator = rand(ComplexF64,(sz,sz));
#setting provider to MKL
#FFTW.set_provider!("mkl")

fr = Vector(FFTW.fftfreq(sz));
dwj = rand(10);
``````
``````function calculation!(plan, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, OP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64})

#FFT
plan*Qa;
plan*Qb;
plan*Qc;

for j in 1:size(wj)[1]
#do columnwise operations
@views @. Q0[:,l] = Q0[:,l] +  wj[j] *abs2.(OP[:,l]) / (-(f)^2 - 1.0im*(f)*wj[j] + wj[j]^2) ;

@views @. Qp2[:,l] = Qp2[:,l] +   wj[j] *abs2.(OP[:,l]) / (-(f)^2 - 1.0im*(f)*wj[j] + wj[j]^2) ;

@views @. Qn2[:,l] = Qn2[:,l] +   wj[j] *abs2.(OP[:,l]) / (-(f)^2 - 1.0im*(f)*wj[j] + wj[j]^2) ;

end
end

plan\Qa;
plan\Qb;
plan\Qc;

end
``````

I run like:

`calculation!(plan_thread, Q0, Q1,Q2, Operator, fr, dwj);`

And benchmark with:

`@btime calculation!(plan_thread, \$Q0, \$Q1,\$Q2, \$Operator, \$fr, \$dwj);`

On my 32 core workstation with Intel Xeon E5-2640 v3 @ 2.6GHz I get:
`2.855 s (4086145 allocations: 174.87 MiB)`

1 Like

What fraction of the time is spent in this loop? (If I were you I would replace the `for j in 1:size(wj)[1]` loop with an explicit `for i in 1:size(Q0,1), j in 1:length(wj)` loop and get rid of the broadcasting — broadcasting over each column multiple times in the `j` loop has poor temporal locality.) Make sure you know what you actually need to optimize!

Are you using FFTW or MKL? MKL’s multi-threading is not as composable with Julia’s (it uses its own thread pool), so you may get some thread contention, though maybe it isn’t a problem here since you aren’t nesting the FFT calls in a `@threads` loop.

In principle you can improve multithreading of the FFTs by replacing your 3 FFTs with a single FFT over a 3d array (i.e. “stack” your arrays into a `(sz,sz,3)` array, and transform along the first direction). That will let it parallelize the 3 transforms instead of performing them serially.

Alternatively, you can plan a serial 1d FFT only and do it inside your `@threads for l in 1:size(Qa,2)` loop, so that you get better cache locality: transform a column, do some processing on it, and then inverse transform before moving on to the next column. (If you plan 1d FFTs, I would do them out-of-place to a pre-allocated buffer using `mul!`, since in-place 1d FFTs are usually less efficient. Note that FFTW plan execution is thread-safe, and you want a serial plan since it is nested inside a `@threads` loop.) This will also give better cache locality on your other processing, since it will always be on the same buffers.

I would also suggest explicitly planning a “backwards” plan (via `plan_bfft`), rather than using `plan \ ...`. A backwards plan is an unnormalized inverse, so it is missing a 1/N factor, but you can combine that scalar 1/N factor with your other processing loop at essentially no cost. This saves you a (non-threaded) extra pass over the array (since the inverse plan internally does a backwards FFT and then does a separate loop to scale the array). See also the FFTW FAQ on inverse transforms.

1/3 of the time is spent on the FFTW. And yes about the know what I need to optimize. That’s why i’m asking this question of whether there’s a way to optimize the FFTW or if I should just focus on the other parts of the code.

So from what I know, column-wise operations are faster which is why I did that.

`for i in 1:size(Q0,1), j in 1:length(wj)`

How would that get rid of the broadcasting? I would still need to divide each element in the column by each element in the frequency vector plus the `wj`.

Alternatively, you can plan a serial 1d FFT only and do it inside your `@threads for l in 1:size(Qa,2)` loop, so that you get better cache locality: transform a column, do some processing on it, and then inverse transform before moving on to the next column. (If you plan 1d FFTs, I would do them out-of-place to a pre-allocated buffer using `mul!` , since in-place 1d FFTs are usually less efficient. Note that FFTW plan execution is thread-safe, and you want a serial plan since it is nested inside a `@threads` loop.) This will also give better cache locality on your other processing, since it will always be on the same buffers.

Interesting, I’ll give this a shot.

For example:

``````for i in 1:size(Q0,1), j in 1:length(wj)
Q0[i,l] = Q0[i,l] +  wj[j] *abs2(OP[i,l]) / (-f[i]^2 - 1.0im*(f[i])*wj[j] + wj[j]^2)
....
end
``````

This replaces the broadcasting with the `i` loop, and more importantly it switches the loop orders so that the `j` loop is innermost. This should improve cache locality, because now all of the calculations on `Q0[i,l]` are completed before moving on to the next `i`.

Ok so I tried a couple of things:

1. I tried fixing the loop so that there wouldn’t be any broadcasting using cartesian index:
``````function calculation1!(plan, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, OP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64}, dwj::Vector{Float64}, gj::Vector{Float64})

#FFT
plan*Qa;
plan*Qb;
plan*Qc;

m = l[1];
n = l[2];
for j in 1:size(wj)[1]

Qa[l] = Qa[l] +0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2)*abs2(OP[l]);

Qb[l] = Qb[l] +0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2)*conj(OP[l])*OP[l];

Qc[l] = Qc[l] +0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2)*conj(OP[l])*OP[l];

end
end

plan\Qa;
plan\Qb;
plan\Qc;

end
``````

Benchmarking gave me:

``````@btime calculation1!(plan_thread, Q0, Q1, Q2, Operator, fr, wj,dwj, gj );
3.122 s (609 allocations: 71.98 KiB)
``````

So this was actually worse.

1. Next I realized that the frequency operation (`wj` , `f`, `dwj`, and `gj`) is the same for all three matrices and can be computed once for all three rather than for each:
``````function calculation2!(plan, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, OP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64}, dwj::Vector{Float64}, gj::Vector{Float64})

#FFT
plan*Qa;
plan*Qb;
plan*Qc;

m = l[1];
n = l[2];
for j in 1:size(wj)[1]

gainop = 0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2);

Qa[l] = Qa[l] +gainop*abs2(OP[l]);

Qb[l] = Qb[l] +gainop*conj(OP[l])*OP[l];

Qc[l] = Qc[l] +gainop*conj(OP[l])*OP[l];

end
end

plan\Qa;
plan\Qb;
plan\Qc;

end
``````

With benchmarking of:

``````@btime calculation2!(plan_thread, Q0, Q1, Q2, Operator, fr, wj,dwj, gj );
2.023 s (204 allocations: 19.98 KiB)
``````

This definitely improved it and the code looks much cleaner than broadcasting.

1. Next, rather than use `CartesianIndexing` (which I read somewhere a long while ago that it is slow?) I used the loop configuration that @stevengj recommended:
``````function calculation3!(plan, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, OP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64}, dwj::Vector{Float64}, gj::Vector{Float64})

#FFT
plan*Qa;
plan*Qb;
plan*Qc;

for i in 1:size(Qa,1), j in 1:length(wj)

gainop = 0.5*wj[j]*dwj[j]*gj[j]/(-f[i]^2 - 1.0im *f[i] *dwj[j] + wj[j]^2);

Qa[i,l] = Qa[i,l] +gainop*abs2(OP[i,l]);

Qb[i,l] = Qb[i,l] +gainop*conj(OP[i,l])*OP[i,l];

Qc[i,l] = Qc[i,l] +gainop*conj(OP[i,l])*OP[i,l];

end
end

plan\Qa;
plan\Qb;
plan\Qc;

end
``````

With benchmarking:

``````@btime calculation3!(plan_thread, Q0, Q1, Q2, Operator, fr, wj,dwj, gj );
1.995 s (206 allocations: 20.05 KiB)
``````

This improved it just a tad bit. It makes me wonder what the difference is in `CartesianIndices` and specifying loop parameters explicitly.

1. Next, because the frequency operator (which I call `gOperator`) was constant for all matrices, I realized again that it was a constant throughout all computations. So I made it into a matrix operator.
``````wj = rand(10);
dwj = rand(10);
gj = rand(10);
gop = zeros(ComplexF64, size(Q0));

function goperator!(gOP::Matrix{ComplexF64}, f::Vector{Float64}, wj::Vector{Float64}, dwj::Vector{Float64}, gj::Vector{Float64})

for m in 1:size(gOP,1), j in 1:length(wj)

gOP[m,l] = 0.5*wj[j]*dwj[j]*gj[j]/(-f[m]^2 - 1.0im *f[m] *dwj[j] + wj[j]^2);

end
end

end
``````

With this I made a new calculation function:

``````function calculation4!(plan, In::Matrix{ComplexF64},Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, gOP::Matrix{ComplexF64})

#FFT
plan*Qa;
plan*Qb;
plan*Qc;

Qa[l] = Qa[l] +gOP[l]*abs2(In[l]);

Qb[l] = Qb[l] +gOP[l]*conj(In[l])*In[l];

Qc[l] = Qc[l] +gOP[l]*conj(In[l])*In[l];

end

plan\Qa;
plan\Qb;
plan\Qc;

end
``````

with benchmark:

``````@btime calculation4!(plan_thread, Intensity, Q0, Q1, Q2, gop);
1.268 s (214 allocations: 19.23 KiB)
``````

This made it significantly faster. Although if things were actually changing in the gOperator this might not be possible. But for my problem this works.

NOTE: I used `eachindex()` function

1. Next, rather than using the `eachindex()` to loop, I did explicit for loop:
``````function calculation5!(plan, In::Matrix{ComplexF64},Q::Array{ComplexF64, 3}, gOP::Matrix{ComplexF64})

#FFT
plan*Q;

for m in 1:size(Q,1)

Q[m,l,1] = Q[m,l,1]  +gOP[m,l]*abs2(In[m,l]);

Q[m,l,2] = Q[m,l,2] +gOP[m,l]*conj(In[m,l])*In[m,l];

Q[m,l,3]= Q[m,l,3] +gOP[m,l]*conj(In[m,l])*In[m,l];

end
end

plan\Q;
end
``````
``````@btime calculation5!(plan_thread3D, Intensity,Q3D, gop);
1.147 s (680 allocations: 56.45 KiB)
``````

This was again faster than using `CartesianIndices` or `eachindex`

1. Next, I combined my Q matrices into a 3D array of 2D matrices (sz,sz,3).
NOTE: For 6 and 7 I had to return to using OpenBLAS because MKL did not want to do non-square FFT

`Q3D = rand(ComplexF64, (sz,sz,3));`

and formed a FFT plan in the 1st dimension:

`plan_thread3D = FFTW.plan_fft!(Q3D,1,flags=FFTW.MEASURE);`

``````function calculation6!(plan, In::Matrix{ComplexF64},Q::Array{ComplexF64, 3}, gOP::Matrix{ComplexF64})

#FFT
plan*Q;

for m in 1:size(Q,2)

Q[1,m,l] = Q[1,m,l]  +gOP[m,l]*abs2(In[m,l]);

Q[2,m,l] = Q[2,m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

Q[3,m,l]= Q[3,m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

end
end

plan\Q;

end
``````
``````@btime calculation6!(plan_thread3D, Intensity,Q3D, gop);
1.352 s (678 allocations: 56.42 KiB)
``````

This was slower than doing individual 1D FFT on each of the three 2D matrices although not by much.

1. I thought maybe the slowing on 6 came from the way the matrices were stacked so I flipped the order:
``````Q3Da = rand(ComplexF64, (3,sz,sz));
``````

and using `calculation6!`

``````@btime calculation6!(plan_thread3Da, Intensity,Q3Da, gop);
1.238 s (678 allocations: 56.42 KiB)
``````
1. Last I did a serial FFT planner to see if I could incorporate it into the multithread operations:
``````seralarray = rand(ComplexF64, sz);
serailplan = FFTW.plan_fft!(seralarray,flags=FFTW.MEASURE);
``````
``````function calculation7!(plan, In::Matrix{ComplexF64}, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, gOP::Matrix{ComplexF64})

#FFT
#plan*Q;

plan*Qa[:,l];
plan*Qb[:,l];
plan*Qc[:,l];

for m in 1:size(Qa,1)

Qa[m,l] = Qa[m,l]  +gOP[m,l]*abs2(In[m,l]);

Qb[m,l] = Qb[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

Qc[m,l]= Qc[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

end
plan\Qa[:,l];
plan\Qb[:,l];
plan\Qc[:,l];

end

#plan\Q;

end
``````

With the benchmark:

``````@btime calculation7!(serailplan, Intensity, Q0, Q1, Q2, gop);
1.504 s (10226925 allocations: 6.84 GiB)
``````

The allocations necessary for this was huge and did not provide any extra time benefit.

CONCLUSION

• There’s still more to be learned about best way to optimize FFT over many matrices using multithreading.
• There’s something I don’t understand in looping when explicitly setting the loop indices and using `CartesianIndices` and `eachindex`. Additonally, I’m not sure broadcasting doesn’t do the same thing as explicitly indexing in for loops? Perhaps @stevengj can comment on this

NOTE: For 6 and 7 I had to return to using OpenBLAS because MKL did not want to do non-square FFT

Are you forgetting `@views`?

You might also try `@inbounds` or, better yet, LoopVectorization.jl, on your loops. (If you use `@inbounds`, probably do a check before the beginning of your loop that all of your arrays match in size/axes.)

Here’s the new trials:

1. Using serial 1D FFT but with `@views` :
``````function calculation8!(plan, In::Matrix{ComplexF64}, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, gOP::Matrix{ComplexF64})

#FFT
#plan*Q;

@views plan*Qa[:,l];
@views plan*Qb[:,l];
@views plan*Qc[:,l];

for m in 1:size(Qa,1)

Qa[m,l] = Qa[m,l]  +gOP[m,l]*abs2(In[m,l]);

Qb[m,l] = Qb[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

Qc[m,l]= Qc[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

end
@views plan\Qa[:,l];
@views plan\Qb[:,l];
@views plan\Qc[:,l];

end

#plan\Q;

end
``````

Benchmarking:

``````calculation8!(serialplan, Intensity, Q0, Q1, Q2, gop);
@btime calculation8!(serialplan, Intensity, Q0, Q1, Q2, gop);
421.126 ms (49356 allocations: 2.64 MiB)
``````

That is pretty amazing and awesome speed up.

1. Next I added `@inbounds` although I wasn’t sure exactly where to put it with `@threads`
``````function calculation9!(plan, In::Matrix{ComplexF64}, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, gOP::Matrix{ComplexF64})

#FFT
#plan*Q;

@inbounds @threads for l in 1:size(Qa,2)
@views plan*Qa[:,l];
@views plan*Qb[:,l];
@views plan*Qc[:,l];

@inbounds for m in 1:size(Qa,1)

Qa[m,l] = Qa[m,l]  +gOP[m,l]*abs2(In[m,l]);

Qb[m,l] = Qb[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

Qc[m,l]= Qc[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l];

end
@views plan\Qa[:,l];
@views plan\Qb[:,l];
@views plan\Qc[:,l];

end

#plan\Q;

end
``````
``````calculation9!(serialplan, Intensity, Q0, Q1, Q2, gop);
@btime calculation9!(serialplan, Intensity, Q0, Q1, Q2, gop);
401.456 ms (49355 allocations: 2.64 MiB)
``````

This provided a nominal speedup which is great!

CONCLUSION
Using serial 1D FFTW inside the multithreaded loop (9) has made it faster by by about 2x than (6-7) using a multithread 1D FFTW outside a multithreaded loop.

NOTE: `loopvectorization.jl` doesn’t work with Complex Numbers and so I wasn’t able to get it to work for this application.

As I mentioned earlier, you should be able to get an additional speedup if you explicitly form a “backwards” plan (with `plan_bfft!`) instead of `plan \ ...`, and then fold the 1/N normalization factor into your `for m` loop.

This is mathematically the same as (but less efficient than) `abs2(In[m,l])`. Looks like you are calculating `gOP[m,l]*abs2(In[m,l])` three times?

Ya, it was more of a stand-in for multiplying two different `In` matrices. for example, in my real code it would be something like:
`conj(In1[m,])*In2[m,l];`

I’ll give the backwards transform a shot but I’m actually pretty happy and surprised at how much I’ve learned from this.

@stevengj I’m going back and implementing this suggestion now that I have more time. Only problem here is that the `mul!` for out-of-place pre-allocated FFTW doesn’t work for 1D FFT. Is there another function for vector products?

Or would the following be equivalent for pre-allocated `Qa` and `qa` matrices?:

`@. qa[:,l] = backwardFFTplan1D * Qa[:,l] ;`

Works for me:

``````julia> using FFTW, LinearAlgebra

julia> x = rand(ComplexF64, 1000); y = similar(x);

julia> p = plan_fft(x);

julia> mul!(y, p, x);

julia> y ≈ fft(x)
true
``````

The only tricky part, in your context, is that you want a pre-allocated buffer per thread/task in order to do multiple FFTs in parallel.

No, an FFT is not an element-wise operation so you can’t use broadcasting for this.

Works for me:

Huh, something weird was going on with my plans so I restarted the kernel and seems to work now.

@stevengj a couple of things:

1. `mul!(y, p, x);` seems to work properly in some code and not in others when using vectors `x` and ` y`. For example, to get it to work inside a function:
``````function doFFT1Don2D!(Y::Matrix{ComplexF64}, X::Matrix{ComplexF64}, plan1D)

for m in 1:size(Y,2)

mul!(Y[:,m], plan1D, X[:,m]);

end
end
``````

Did not work and changed all of `Y` into zeros. Instead I had to use `@view`:

``````function doFFT1Don2D!(Y::Matrix{ComplexF64}, X::Matrix{ComplexF64}, plan1D)

for m in 1:size(Y,2)

mul!((@view Y[:,m]), plan1D, X[:,m]);

end
end
``````

This issue happened for serial and threaded.

The only tricky part, in your context, is that you want a pre-allocated buffer per thread/task in order to do multiple FFTs in parallel.

Can I get around this by having pre-allocated matrices the same size as the 2D matrix say X that I want to perform a columnwise 1DFFT? For example:

``````function calculation9!(forplan, bacplan, In::Matrix{ComplexF64}, Qa::Matrix{ComplexF64}, Qb::Matrix{ComplexF64}, Qc::Matrix{ComplexF64}, Qd::Matrix{ComplexF64}, Qe::Matrix{ComplexF64},Qf::Matrix{ComplexF64},  gOP::Matrix{ComplexF64})

#FFT
#forplan - out of place forward unnormalized fft
#bacplan - out of place backwards unnormalized fft
#get size of 1D array for normalization
sz1 = size(Qa,1);
invsz1 = 1.0/sz1;

@inbounds @threads for l in 1:size(Qa,2)
#forward transform
mul!((@view Qd[:,m]), forplan, Qa[:,m]);
mul!((@view Qe[:,m]), forplan, Qb[:,m]);
mul!((@view Qf[:,m]), forplan, Qc[:,m]);

@inbounds for m in 1:size(Qa,1)

#divide by sz1 for normalization
Qa[m,l] = invsz1*(Qd[m,l]  +gOP[m,l]*abs2(In[m,l]) );

Qb[m,l] = invsz1*( Qe[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l] );

Qc[m,l]= invsz1*( Qf[m,l] +gOP[m,l]*conj(In[m,l])*In[m,l] );

end
#backwards transform
mul!((@view Qd[:,m]), bacplan, Qa[:,m]);
mul!((@view Qe[:,m]), bacplan, Qb[:,m]);
mul!((@view Qf[:,m]), bacplan, Qc[:,m]);

end

end
``````

I believe doing this solves the racing issue? Since each thread will be looking specifically at a pre-allocated address. Of course this isn’t ideal because if the Q matrices are too large it will consume a lot of memory and the allocation of each column might not be in the local cache of the thread operation? I’m hoping to learn more about pre-allocation buffer per threads/tasks but for now I gotta get this up and running.

This has nothing to do with FFTW and is just a basic consequence of how Julia works. It won’t work because `Y[:, m]` creates a copy of the m-th column, so the output of the FFT goes into the copy and not into `Y`. All slices in Julia create copies by default. Just use:

``````@views mul!(Y[:,m], plan1D, X[:,m])
``````

so that `Y[:,m]` is a view (i.e. the output goes into `Y` and also `X[:,m]` is a view (just to improve efficiency by avoiding the copy of the column of `X`).

Yes, but this is a lot more buffer space than strictly necessary. You really only need one 1d buffer per thread/task.

PS. You don’t need a semicolon after every line in Julia. Semicolons after a line are only useful when running code in the REPL or other interactive prompt, to suppress output.

1 Like

Shouldn’t be `l` here?

One thing is that BLAS `mul!` runs multi-threaded by default. Since you are threading on a higher level, you might want to disable BLAS threading by using `BLAS.set_numthreads(1)`. That may be beneficial here.

This is not a BLAS call. `mul!(y, plan, x)` calls `fftw_execute_dft(plan, x, y)` to execute `plan` (perform an FFT) on `x` with output in `y`.

(`mul!` is the appropriate function to overload here because an FFT is a linear operator.)

1 Like