How to avoid memory blow up with FFT without expensive garbage collection?

I am trying to write a code that has a for loop inside which an array is broken into chunks, sent to multiple processors, and FFT is performed on those chunks. Without using garbage collection GC.gc() command at the end of each iteration, however, memory usage keeps increasing eventually causing the computer to crash. On the other hand, garbage collection is causing a slow down by a factor of more than 10. This is prohibitively expensive for the project I am working on. To reproduce the issue, I wrote a simple example as follows:

using Distributed
addprocs(2)
x_in = rand(100, 100)

@everywhere workers() begin
        using FFTW
        const fft_plan = plan_fft!(ComplexF64.($x_in))
        function update_y(y_in)
                for i in 1:100
                    y_in .= fft_plan * y_in
                end
                return y_in
        end
end

function test(x)
        for t in 1:1000
            @show t
            @everywhere workers() begin
                y = ComplexF64.(copy($x))
                y .= update_y(y)
               # GC.gc()
            end
        end
        return nothing
end
@time test(x_in)

On my laptop (julia 1.10.4, Ubuntu 22, i5 processor), it takes about 8 seconds to run this code. If I uncomment GC.gc() line, it takes about 132 seconds.

I have been trying to find a way to permanently allocate memory for the FFT operation but I do not understand how I may do that. Even with the in-place FFT plan, memory keeps blowing up.

I would really appreciate any help. Thank you so much.

Maybe you need to pre-allocate the output array,
and use in-place mul! as explained in the doc of plan_fft ?

Here is an example
(for a 1d-fft of real data on GPU, so you’ll need to adapt,
mostly removing the CUDA prefixes).

julia> using CUDA

julia> using CUDA.CUFFT

julia> N = 10_000_000

julia> a = CUDA.rand(N);

julia> tf = rfft(a);

julia> plan = plan_rfft(a)
CUFFT real-to-complex forward plan for 10000000-element CuArray of Float32

julia> CUFFT.mul!(tf, plan, a)

julia> CUDA.@time CUFFT.mul!(tf, plan, a);
  0.003897 seconds (261 CPU allocations: 4.078 KiB)

Got to go; others will probably be able to help with the multi-processing aspects
and give better advises.

2 Likes

One thing that will help a lot would be to use multi-threading rather than Distributed. Unless you are using multiple computers, multithreading generally has lower overhead.

1 Like

I was able to make it work by rewriting the code as follows:

using Distributed
addprocs(2)
x_in = rand(100, 100)

@everywhere workers() begin
    using LinearAlgebra, FFTW
    function update_y(y_in)
        fft_plan = plan_fft(y_in)
        y_out = copy(y_in)
        for i in 1:100
            mul!(y_out, fft_plan, y_in)
        end
        return y_out
    end
end


function test(x)
    @everywhere workers() y = zeros(ComplexF64, 100, 100)
    for t in 1:10000
        @show t
        @everywhere workers() begin
            y .= ComplexF64.(copy($x))
            y .= update_y(y)
        end
    end
    return nothing
end
@time test(x_in)

No garbage collection required and the code runs fast.

1 Like