GPU/CPU Agnostic FFT code

bpsomu · February 25, 2025, 10:48am

I am looking to port my Professor’s 15 year old C++ code to Julia in a GPU/CPU Agnostic way maybe by using AcceleratedKernels.jl. The C++ code uses FFTW library, I know CUDA has CUDAFFTs but I don’t know if AcceleratedKernels.jl or JuliaGPU has such a framwork for FFTs. Is this a good idea? Or is this too ambitious and I should go for CUDAFFTs or something else?

yolhan_mannes · February 25, 2025, 10:55am

Julia has a FFTW.jl library that dispatch on cpu and gpu and no problem mixing CuArray programing and kernel programing, both will be agnostic, if you need something more than AcceleratedKernels.jl, KernelAbstraction.jl creates agnostic kernels for cpu and gpu. Also, CxxWrap.jl can help you port things that you don’t know how to translate but that won’t be agnostic. For instance in this code

using FFTW, CUDA, BenchmarkTools
julia> function try_FFT_on_cpu()
           values = rand(256, 256, 256)
           value_complex = ComplexF32.(values)
           cvalues = similar((value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues, flags=FFTW.MEASURE)
           @btime a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cpu (generic function with 1 method)

julia> function try_FFT_on_cuda()
           values = rand(256, 256, 256)
           value_complex = ComplexF32.(values)
           cvalues = similar(cu(value_complex), ComplexF32)
           copyto!(cvalues, values)
           cy = similar(cvalues)
           cF = plan_fft!(cvalues)
           @btime CUDA.@sync a = ($cF*$cy)
           return nothing
       end
try_FFT_on_cuda (generic function with 1 method)

from Unreasonably fast FFT on CUDA - #8 by roflmaostc

roflmaostc · February 25, 2025, 11:16am

It’s not FFTW.jl but instead AbstractFFTs.jl and CUDA.jl which dispatches CuArray to CUDA.

bpsomu · April 28, 2025, 8:35am

figured out the way to do it, when I used just AbstractFFTs.jl naively as follows for a GPU/CPU agnostic implementation, it did not work:

function poisson_solve!(out_array::AbstractArray{T,2}, 
                                       inp_array::AbstractArray{T,2},
                                       coefficient::AbstractArray{T,2}) where T
            println("Backend agnostic FFT for Poisson solve")
            fourier_array = rfft(inp_array)
            fourier_array .*= coefficient
            out_array .= irfft(fourier_array, size(inp_array, 2))
            return out_array
end

Apparently, you need to specify the backend and without it I was getting StackOverflow errors.

So, I tried the following with multiple dispatch and the Requires.jl library:

function __init__()
    # CUDA implementation - only loaded if CUDA.jl is available
    @require CUDA="052768ef-5323-5732-b1bb-66c8b64840ba" begin
        function poisson_solve!(out_array::CUDA.CuArray{T,2}, 
                              inp_array::CUDA.CuArray{T,2},
                              coefficient::CUDA.CuArray{T,2}) where T
            println("Using CUDA FFT for Poisson solve")
            fourier_array = rfft(inp_array)
            fourier_array .*= coefficient
            out_array .= irfft(fourier_array, size(inp_array, 2))
            return out_array
        end
    end

    # AMDGPU implementation - only loaded if AMDGPU.jl is available
    @require AMDGPU="21141c5a-9bdb-4563-92ae-f87d6854732e" begin
        function poisson_solve!(out_array::AMDGPU.ROCArray{T,2}, 
                                inp_array::AMDGPU.ROCArray{T,2},
                                coefficient::AMDGPU.ROCArray{T,2}) where T
            println("Using ROCm FFT for Poisson solve")
            fourier_array = rfft(inp_array)
            fourier_array .*= coefficient
            out_array .= irfft(fourier_array, size(inp_array, 2))
            return out_array
        end
    end
end

for CUDA and ROCm and the following for CPU, oneAPI and Metal:

function poisson_solve!(out_array::AbstractArray{T,2}, 
                        inp_array::AbstractArray{T,2},
                        coefficient::AbstractArray{T,2}) where T
    # Convert inputs to CPU Arrays if they aren't already
    # This handles oneAPI and Metal arrays that need conversion
    inp_cpu = Array(inp_array)
    coeff_cpu = Array(coefficient)

    # Use FFTW's implementation explicitly
    fourier_array = FFTW.rfft(inp_cpu)

    # Multiply in Fourier space
    fourier_array .*= coeff_cpu

    # Perform inverse FFT
    result = FFTW.irfft(fourier_array, size(inp_cpu, 2))

    # Copy back to output array (handles conversion back to original array type)
    out_array .= result

    return out_array
end

Dependancies: using AbstractFFTs, using FFTW, using Requires, using AMDGPU or using CUDA
Now, it works regardless of backend. But maybe folks have suggestions for a cleaner implementation.

roflmaostc · April 28, 2025, 9:08am

Sorry, I might not understand.

But why is this not working?


using AbstractFFTs, FFTW, AMDGPU, CUDA
function poisson_solve!(out_array::AbstractArray{T,2}, 
                                       inp_array::AbstractArray{T,2},
                                       coefficient::AbstractArray{T,2}) where T

            fourier_array = rfft(inp_array)
            fourier_array .*= coefficient
            out_array .= irfft(fourier_array, size(inp_array, 2))
            return out_array
end

rfft is provided by AbstractFFTs.jl and CUDA, AMDGPU, FFTW already do the dispatch for the concrete types such as CuArray, …

bpsomu · April 28, 2025, 11:03am

You are right, it does work for ROCArray, CuArray and Array. I had trouble because I had failed to import FFTW and therefore I was getting StackOverflow errors for CPU Array.

A separate implementation/dispatch is needed for MtlArray and oneArray because they do not have their own FFT backend.

bpsomu · April 29, 2025, 11:29am

Found a better way to do it such that it works on CPUs, AMDGPUs, CUDA GPUs, oneAPI and Metal (oneAPI and Metal fallback to CPU):
Dependancies depending on backend: using AMDGPU, using CUDA, using oneAPI, using Metal.
Here is the code:

using AbstractFFTs
using Requires
using FFTW
#using AMDGPU or CUDA or oneAPI or Metal

# Define traits for array types based on FFT support
abstract type FFTSupport end
struct NativeFFTSupport <: FFTSupport end
struct CPUFallbackFFT <: FFTSupport end

# Trait functions to determine FFT support
fft_trait(::Type{<:AbstractArray{T,N}}) where {T,N} = NativeFFTSupport()  # Default: native support
fft_trait(::Type) = CPUFallbackFFT()  # Unknown types: CPU fallback

# Specialize for types that need CPU fallback
function __init__()
    @require oneAPI="8f75cd03-7ff8-4ecb-9b8f-daf728133b1b" begin
        fft_trait(::Type{<:oneAPI.oneArray}) = CPUFallbackFFT()
    end
    @require Metal="dde4c033-4e86-420c-a63e-0dd931031962" begin
        fft_trait(::Type{<:Metal.MtlArray}) = CPUFallbackFFT()
    end
end

function poisson_solve!(out_array::AbstractArray{T,2}, 
                        inp_array::AbstractArray{T,2},
                        coefficient::AbstractArray{T,2}, 
                        ::NativeFFTSupport = fft_trait(typeof(inp_array))) where T
    # Do Fourier Transform
    fourier_array = rfft(inp_array)

    # Multiply in Fourier space
    fourier_array .*= coefficient

    # Perform inverse FFT
    out_array .= irfft(fourier_array, size(inp_array, 2))

    return out_array
end

# Method for arrays requiring CPU fallback
function poisson_solve!(out_array::AbstractArray{T,2}, 
                        inp_array::AbstractArray{T,2},
                        coefficient::AbstractArray{T,2},
                        ::CPUFallbackFFT = fft_trait(typeof(inp_array))) where T
    # CPU fallback implementation
    array_type = typeof(inp_array).name.wrapper

    # Convert to CPU
    inp_cpu = Array(inp_array)
    coef_cpu = Array(coefficient)

    # Perform FFT on CPU
    fourier_array = rfft(inp_cpu)
    # Multiply coefficient in k-space
    fourier_array .*= coef_cpu
    # Do inverse fourier transform
    result_cpu = irfft(fourier_array, size(inp_cpu, 2))

    # Convert back to original type
    out_array .= array_type(result_cpu)

    return out_array
end

bpsomu · June 10, 2025, 10:21am

Have improved it further to make FFT plans that are reused, this helps do FFT on arrays that are of size ‘x \times y’:

using AbstractFFTs
using FFTW 

"""
Plan caches for forward and inverse transforms:
    These caches will store FFT plans for different array types and sizes
        to avoid redundant computations.
    The keys are generated based on the type and size of the arrays.
    The values are the FFT plans created by the FFTW or AbstractFFTs.jl library.
    The cache is implemented as a dictionary for efficient lookups.
    The cache is cleared when the module is reloaded or when explicitly requested.
"""
const FFT_PLAN_CACHE = Dict{UInt64, Any}()
const IFFT_PLAN_CACHE = Dict{UInt64, Any}()

"""
Key Generation Function:
This function generates a unique key for the cache based on the type and size of the array.
    It uses the `hash` function to create a hash value that serves as the key.
    The key is used to store and retrieve FFT plans from the cache.
    The key is a tuple containing the array type, element type, and size.
"""
function plan_cache_key(array)
    # Include type and size in the hash
    return hash((typeof(array).name.wrapper, eltype(array), size(array)))
end


"""
The function implements a way to check for existing FFT plans and if not found, create new ones:
    1) Generate unique key for the given array using its type and size using the 
`plan_cache_key()` function.
    2) Check if the key exists in the cache.
    3) If it exists, return the cached plan.
    4) If it doesn't exist, create a new FFT plan using the `plan_rfft` function.
    5) Store the new plan in the cache using the generated key.
"""
function get_or_create_fft_plan(array)
    key = plan_cache_key(array)
    get!(FFT_PLAN_CACHE, key) do
        plan_rfft(array)
    end
end

"""
1) Generate unique key for the given array using its type and size via the 
hash() function using the `complex_array` and `original_size` arguments.
2) This is needed since inverse FFTs need to know the original size of the array.
"""
function get_or_create_ifft_plan(complex_array, original_size)
    key = hash((typeof(complex_array).name.wrapper, eltype(complex_array), 
                size(complex_array), original_size))
    get!(IFFT_PLAN_CACHE, key) do
        plan_irfft(complex_array, original_size)
    end
end


function clear_plan_cache!()
    empty!(FFT_PLAN_CACHE)
    empty!(IFFT_PLAN_CACHE)
    GC.gc()  # Encourage garbage collection
    return nothing
end

function poisson_solve!(out_array::AbstractArray{T,2}, 
                        inp_array::AbstractArray{T,2},
                        coefficient::AbstractArray{T,2}, 
                        ::NativeFFTSupport = fft_trait(typeof(inp_array))) where T
    # Get cached forward plan
    fft_plan = get_or_create_fft_plan(inp_array)
    
    # Execute forward FFT with plan
    fourier_array = fft_plan * inp_array

    # Get cached inverse plan
    # Note: size(inp_array, 1) is used to ensure the output size matches the input
    ifft_plan = get_or_create_ifft_plan(fourier_array, size(inp_array, 1))

    # Multiply in Fourier space
    fourier_array .*= coefficient

    # Execute inverse FFT with plan
    out_array .= ifft_plan * fourier_array

    return out_array
end

Please suggest improvements if you see the possibility.

Topic		Replies	Views
CuPy CuFFT ~2x faster than CUDA.jl CuFFT GPU performance , cuda , fft	15	2821	February 27, 2023
[ANN] RustFFT.jl: Compute forward and inverse FFTs with RustFFT Package Announcements	13	1530	May 30, 2023
Generic Julia FFT Numerics	1	843	July 10, 2017
Why fft with MEASURE plan 10x slower than calling fft directly with CUDA.CUFFT? Performance gpu , cuda	7	170	September 22, 2024
Plan FFTs both in julia 0.6.3 and 0.7 General Usage fftw	2	1243	June 17, 2018

GPU/CPU Agnostic FFT code

Related topics