FFTW.jl support fftw_plan_many_dft?

diadora77 · August 4, 2023, 12:35am

Hi - I have a scenario where I’m working with a large number of 3d arrays and want to efficiently calculate FFT’s for each column. I’ve written a few test functions that iterate in various ways across columns in calculating the associate 1d FFT’s shown below – along with test code. The slicing approach works fastest though reshaping comes close (much better memory and allocs). In digging into FFTW, it appears to allow ability to create a plan to do essentially the same using fftw_plan_many_dft. Does FFTW.jl support ability to use?

using FFTW
using BenchmarkTools

function scdfft(data::Array{ComplexF64, 3})

    #get size tuple
    stple = size(data)
    
    #create data container
    S = Array{ComplexF64}(undef, stple)
    
    #fft across the rows, i.e. each column
    for idx2 in 1:stple[2]
        for idx3 in 1:stple[3]
            @inbounds S[:, idx2, idx3] = fftshift(fft(@view data[:,idx2,idx3]))
        end
    end
    
    return S
end
function scdfftslice(data::Array{ComplexF64, 3})
    
    #create data container
    S = Array{ComplexF64}(undef, size(data))
    
    #fft down the columns
    for (idx, col) in enumerate(eachslice(data, dims=2))
        @inbounds S[:, idx, :] = fftshift(fft(col, 1), 1)
    end
    
    return S
end
function scdfftreshape(data::Array{ComplexF64, 3})
    
    #get size tuple
    stple = size(data)
    
    #fft down the columns after reshaping
    temp= fftshift(fft(reshape(data, (stple[1],:)), 1), 1)
    
    return reshape(temp, stple)
end

Testing performance…

tst = rand(ComplexF64, (1024, 128,128));
#compile and confirm
scdfft(tst) == scdfftslice(tst) == scdfftreshape(tst) ? println("good") : error("bad")

good

@benchmark scdfft($tst)

Time (mean ± σ): 345.378 ms ± 14.753 ms ┊ GC (mean ± σ): 4.57% ± 0.13%
Memory estimate: 776.75 MiB, allocs estimate: 98

@benchmark scdfftslice($tst)

Time (mean ± σ): 207.515 ms ± 6.647 ms ┊ GC (mean ± σ): 5.50% ± 0.14%
Memory estimate: 768.04 MiB, allocs estimate: 898.

@benchmark scdfftreshape($tst)

Time (mean ± σ): 269.400 ms ± 11.756 ms ┊ GC (mean ± σ): 8.55% ± 2.11%
Memory estimate: 512.00 MiB, allocs estimate: 11.

stevengj · August 4, 2023, 3:00am

Yes, you can simply pass the dims argument to fft or plan_fft to transform one or more dimensions of a multidimensional array.

Note that if you care about performance, you should precompute a plan, and apply it to a preallocated output array with mul!. Also I would omit the fftshift and just work directly with the data in the natural order for the DFT.

diadora77 · August 5, 2023, 11:32am

Nevermind, I tracked the problem down in using mkl as the FFT provider. It doesn’t seem to support passing singleton dimensions, i.e., dims=1 for higher dimensional arrays. I get a FFT can not create a plan error. If I change provider to fftw, then works as expected. Hmmm, mkl has been faster on my machine for 2d or smaller arrays.

stevengj · August 5, 2023, 12:03pm

It would help if you used FFTW as intended, i.e. by creating a plan with FFTW.MEASURE via plan_fft and re-using it with mul!, rather than re-creating an ESTIMATE plan each time by calling fft. See also the FFTW FAQ.

(Of course, there are still cases where MKL beats FFTW, but the two should be much closer if used properly.)

diadora77 · August 5, 2023, 11:08pm

Ok. Went in and updated some of the functions as well as added some to my original post. This time I took in account fft_plan when using fftw. For my computer and for size of data that I’m using, mkl is still better for the iterative 2d slicing loop across 3d array. Perhaps more surprising is that fftw is faster for iterative 2d slicing loop with simpler fft_plan than when applying to 3d array. And, yes, if I pre-allocated an array outside of functions, it would be faster/have less allocations – same for each and all so didn’t bother.

function scdfft(data::Array{ComplexF64, 3})
    
    stple = size(data)
    
    #create data container
    S = Array{ComplexF64}(undef, stple)
    
    #fft across the rows, i.e. each column
    for idx2 in 1:stple[2]
        for idx3 in 1:stple[3]
            @inbounds S[:, idx2, idx3] = fftshift(fft(@view data[:,idx2,idx3]))
        end
    end
    
    return S
end
function scdfftslice(data::Array{ComplexF64, 3})
    
    #create data container
    S = Array{ComplexF64}(undef, size(data))
    #create fft plan using given col of data
    fftplan = plan_fft((@view data[:,1,:]), 1)
    
    #fft down the columns
    for (idx, col) in enumerate(eachslice(data, dims=2))
        #@inbounds S[:, idx, :] = fftshift(fft(col, 1), 1)
        @inbounds S[:, idx, :] = fftshift(fftplan*col, 1)
    end
    
    return S
end
function scdfftsliceMKL(data::Array{ComplexF64, 3})
    
    #create data container
    S = Array{ComplexF64}(undef, size(data))
    
    #fft down the columns
    for (idx, col) in enumerate(eachslice(data, dims=2))
        @inbounds S[:, idx, :] = fftshift(fft(col, 1), 1)
    end
    
    return S
end
function scdfftreshape(data::Array{ComplexF64, 3})
    
    #create data container
    stple = size(data)
    
    #fft down the columns after reshaping
    temp= fftshift(fft(reshape(data, (stple[1],:)), 1), 1)
    
    return reshape(temp, stple)
end
function scdfftplan(data::Array{ComplexF64, 3}, fftplan)
    
    #create data container
    S = Array{ComplexF64}(undef, size(data))
    
    #fft accorrding to plan_fft
    fftshift!(S, fftplan * data, 1)
    
    return S
end

Confirm functions…

tst = rand(ComplexF64, (1024, 128,128));

if FFTW.get_provider() == "fftw"
    fftplan = plan_fft(tst, 1)
    #compile and confirm
    scdfftplan(tst, fftplan) == scdfft(tst) == scdfftslice(tst) == scdfftreshape(tst) ? println("good") : error("bad")
elseif FFTW.get_provider() == "mkl"
    scdfft(tst) == scdfftsliceMKL(tst) == scdfftreshape(tst) ? println("good") : error("bad")
end

good

Now set up the benchmarking…

suite = BenchmarkGroup()
suite[FFTW.get_provider()] = BenchmarkGroup([])

#functions tested depend on fft provider
if FFTW.get_provider() == "fftw"
    
    for f in (scdfft, scdfftslice, scdfftreshape)
        suite[FFTW.get_provider()][string(f)] = @benchmarkable $(f)($tst)
    end
    suite[FFTW.get_provider()][string(scdfftplan)] = @benchmarkable $(scdfftplan)($tst, $fftplan)
    
elseif FFTW.get_provider() == "mkl"
    
    for f in (scdfft, scdfftsliceMKL, scdfftreshape)
        suite[FFTW.get_provider()][string(f)] = @benchmarkable $(f)($tst)
    end
end

Tune and run benchmarks…

tune!(suite)
results = run(suite, verbose = true);

Ran the above twice, once with fftw and once with mkl. Summary is:

 1-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "mkl" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "scdfftsliceMKL" => Trial(202.087 ms)
	  "scdfftreshape" => Trial(230.172 ms)
	  "scdfft" => Trial(728.470 ms)

1-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "fftw" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "scdfftreshape" => Trial(248.244 ms)
	  "scdfftslice" => Trial(224.825 ms)
	  "scdfft" => Trial(314.189 ms)
	  "scdfftplan" => Trial(241.369 ms)

stevengj · August 6, 2023, 11:30am

On my machine they are comparable, but more careful benchmarking shows that the time is hugely distorted by the fftshift (which kills the cache by making a separate pass over the array after the FFT), and the fftshift locality is worse in the scdfftslice case because you only do the fftshift after transforming the whole array, rather than after each slice while it is still in cache. Furthermore, pre-allocating has a huge impact on performance on my machine, to the point where I think that benchmarking non-preallocated functions is misleading — it’s not really telling you the FFT speed per se.

In particular, for:

using FFTW, BenchmarkTools, LinearAlgebra
tst = rand(ComplexF64, (1024, 128,128));
res = similar(tst); # pre-allocated output array
fftplan = plan_fft(tst, 1);
fftplan2 = plan_fft(tst, 1, flags=FFTW.MEASURE);

I get:

julia> @btime scdfftplan($tst, $fftplan);    # with fftshift
  306.982 ms (12 allocations: 512.00 MiB)

julia> @btime scdfftslice($tst);             # with fftshift & by slices
  306.143 ms (517 allocations: 768.01 MiB)

julia> @btime $fftplan * $tst;               # no fftshift
  177.628 ms (2 allocations: 256.00 MiB)

julia> @btime mul!($res, $fftplan, $tst);    # no fftshift, pre-allocated output
  70.448 ms (0 allocations: 0 bytes)

julia> @btime mul!($res, $fftplan2, $tst);   # no fftshift, pre-allocated output, MEASURE plan
  60.636 ms (0 allocations: 0 bytes)

You can see that the final result is 5x faster than scdfftplan — 80% of your time is due to the fftshift and the lack of pre-allocation!

Moral: create a plan, pre-allocate, and don’t fftshift in performance-critical circumstances. (Learn to use the FFT outputs in natural order. fftshift is something I would only recommend in low-performance situations like visualization.)

diadora77 · August 6, 2023, 4:52pm

Excellent point on fftshift. You’re right in that there is no need for it unless visualization. I had also assume (I know, I know) that it would be fixed timing bias across my functions which, as you pointed out, is not true. Updated functions to pass in pre-allocated array… results are similar to yours though I didn’t test the use of FFTW.MEASURE flag.

removed fftshift, pre-allocated output... fftw
  scdfftreshape: 185.212 ms (     9 allocations: 256.000 MiB)
    scdfftslice: 103.251 ms (   259 allocations: 256.006 MiB)
         scdfft: 221.150 ms ( 81920 allocations: 262.750 MiB)
     scdfftplan:  61.441 ms (     0 allocations:   0.000 MiB)

removed fftshift, pre-allocated output... mkl
 scdfftsliceMKL:  84.687 ms (   640 allocations: 256.039 MiB)
  scdfftreshape: 159.433 ms (     9 allocations: 256.000 MiB)
    scdfftslice:  80.945 ms (   259 allocations: 256.006 MiB)
         scdfft: 620.653 ms ( 81920 allocations: 262.750 MiB)

Thanks for the discussion!

Topic		Replies	Views
Is multithreaded FFTW on multiple arrays causing too much overhead? Performance fftw	15	888	August 19, 2023
FFTW Performance General Usage fftw , performance	25	4434	November 2, 2022
FFTW multithreading and plans General Usage fftw , multithreading	3	772	July 11, 2023
Performance Help on Large Matrix Manipulation Performance performance , matrices	16	1806	September 24, 2020
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7019	December 19, 2016

FFTW.jl support fftw_plan_many_dft?

Related topics