test=Array{ComplexF64}(undef, [[4,4];[4,4]]...);
p1=plan_fft(test,[3,4])
result1=fft(reshape([1:4^4;],4,4,4,4))
result2=p1*reshape([1:4^4;],4,4,4,4)
result1==result2 #false means things are right
The transform along an interior dimension (dimension 3 of 4 here) is not supported in MKL because it can’t be mapped to a constant distance between starting points of the FFTs.
MKL has a single parameter for this (DFTI_INPUT_DISTANCE, DFTI_OUTPUT_DISTANCE), implying a 1D array of starting points equally spaced at that distance. FFTW can take a multi-dimensional array of starting points through parameters called “howmany_rank” and “howmany_dims” in their guru interface.
The limitation in MKL is documented in the FFTW3 interface docs (“the only supported values for parameter howmany_rank in guru and guru64 plan creation functions are 0 and 1”).
You could use some lazy reshape “trick” to make MKL work on this example.
(As dim1 and dim2 could be merged.)
Ideally we could let FFTW.jl do this for us, but plan_fft(test,[2,4]) is still “broken”.
We need to loop dim3 mannually in this case, but it’s hard to make it efficient on all size.
Edit: If you can’t use FFTW backend for some reason, and you don’t need rfft.
You can try this patched version: GitHub - N5N3/FFTW.jl at SelfUse.
The reason I raise the question is that in my personal computer I find that MKL makes fft faster and support allocation free multithread.
Now I move to cuFFT for large 3D fft in a cluster and it works well except that CUDA.jl now has no high level wrapper for multi-gpu fft support. PencilFFT now supports CuArray but it is complicated to implement if the cluster do not have cuda-aware MPI installed.
Hope that someday CUDA.jl and MPI.jl can just implement cuda-aware MPI with artifacts.