Is CUDA.jl and FFTW threadsafe?

I am getting the following error when using CUDA.jl FFTW plans in multiple threads. Is this interface not threadsafe? If not, do I just need a mutex around plan_fft!(), or might the actual fft be not threadsafe as well?

using CUDA, FFTW

function gpu_fft_thread()
    try 
        X = CUDA.randn(ComplexF32, 1024,1024)
        myfft = plan_fft!(X, 1)
        myfft * X
    catch e
        error(e)
    end
    return nothing
end

function run_fft_threads()
    try
        for nn=1:10
            tids = [Threads.@spawn gpu_fft_thread() for nn=1:10]
            while !all(istaskdone.(tids))
                yield();
            end
        end
    catch e
        error(e)
    end
end

run_fft_threads()

Output:

error in running finalizer: ErrorException("val already in a list")
error at ./error.jl:35
push! at ./linked_list.jl:53 [inlined]
_wait2 at ./condition.jl:87
#wait#621 at ./condition.jl:127
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
lock at ./lock.jl:227
push! at /home/jon/.julia/packages/CUDA/p5OVK/lib/utils/cache.jl:72 [inlined]
cufftReleasePlan at /home/jon/.julia/packages/CUDA/p5OVK/lib/cufft/wrappers.jl:158 [inlined]
#137 at /home/jon/.julia/packages/CUDA/p5OVK/lib/cufft/fft.jl:30 [inlined]
#context!#59 at /home/jon/.julia/packages/CUDA/p5OVK/lib/cudadrv/state.jl:170 [inlined]
context! at /home/jon/.julia/packages/CUDA/p5OVK/lib/cudadrv/state.jl:165 [inlined]
unsafe_free! at /home/jon/.julia/packages/CUDA/p5OVK/lib/cufft/fft.jl:29
unknown function (ip: 0x7fefe0126a62)
_jl_invoke at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
run_finalizer at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gc.c:417
jl_gc_run_finalizers_in_list at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gc.c:507
run_finalizers at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gc.c:553
enable_finalizers at ./gcutils.jl:126 [inlined]
unlock at ./locks-mt.jl:68 [inlined]
push! at ./task.jl:703
enq_work at ./task.jl:783
yield at ./task.jl:862
run_fft_threads at /home/jon/processing_analysis/fftw_threads.jl:19
unknown function (ip: 0x7fefe015402f)
_jl_invoke at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_call at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:226
eval_stmt_value at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:762
top-level scope at /home/jon/processing_analysis/fftw_threads.jl:27
jl_toplevel_eval_flex at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1864
_jl_invoke at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1924
include at ./client.jl:478
unknown function (ip: 0x7fefe00998a2)
_jl_invoke at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_call at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:226
eval_stmt_value at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:762
top-level scope at REPL[1]:1
jl_toplevel_eval_flex at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:856
jl_toplevel_eval_flex at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
eval at ./Base.jl:68 [inlined]
repleval at /home/jon/.vscode-server/extensions/julialang.language-julia-1.47.2/scripts/packages/VSCodeServer/src/repl.jl:222
#107 at /home/jon/.vscode-server/extensions/julialang.language-julia-1.47.2/scripts/packages/VSCodeServer/src/repl.jl:186
unknown function (ip: 0x7fefe009982f)
_jl_invoke at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
with_logstate at ./logging.jl:514
with_logger at ./logging.jl:626 [inlined]
#106 at /home/jon/.vscode-server/extensions/julialang.language-julia-1.47.2/scripts/packages/VSCodeServer/src/repl.jl:187
unknown function (ip: 0x7fefe009960f)
_jl_invoke at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
jl_f__call_latest at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/builtins.c:774
#invokelatest#2 at ./essentials.jl:816 [inlined]
invokelatest at ./essentials.jl:813
unknown function (ip: 0x7fefe0098772)
_jl_invoke at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_apply at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/builtins.c:730
macro expansion at /home/jon/.vscode-server/extensions/julialang.language-julia-1.47.2/scripts/packages/VSCodeServer/src/eval.jl:34 [inlined]
#61 at ./task.jl:514
unknown function (ip: 0x7fefe007bb9f)
_jl_invoke at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
start_task at /cache/build/default-amdci4-4/julialang/julia-release-1-dot-9/src/task.c:1092

The error is not fully deterministic, but occurs most of the time on my machine with the above code. If you have difficultly reproducing, you may try running multiple times or increase the number of loops/threads.

Hi,

can confirm the crash.

just to clarify, you don’t need to load FFTW.jl but instead CUDA.CUFFT. is enough.
FFTW.jl only handles Arrays whereas CUDA.CUFFT handles CuArrays.

I’m wondering, why don’t you use batched FFTs. This ensures that the FFT is properly parallelized on the GPU which would not happen with a for-loop. Or doesn’t this work in your application?

julia> using CUDA, CUDA.CUFFT

julia> arr = CUDA.randn(32, 124, 124);

julia> p = plan_fft(arr, (1,2))
CUFFT complex forward plan for 32×124×124 CuArray of ComplexF32

julia> CUDA.@time CUDA.@sync p * arr; # FFT along first and second dim
  0.026062 seconds (13.54 k CPU allocations: 958.493 KiB) (3 GPU allocations: 11.262 MiB, 0.07% memmgmt time)

julia> CUDA.@time CUDA.@sync p * arr; # FFT along first and second dim
  0.000897 seconds (92 CPU allocations: 4.406 KiB) (3 GPU allocations: 11.262 MiB, 35.61% memmgmt time)

The code snippet is a simple MWE just designed to reproduce the crash. My actual problem is more complicated and organized a bit differently – I am doing more than just ffts and am using threads to maintain separate GPU streams as well as parallelization of CPU bound tasks.

Thanks for the CUFFT hint, I had thought I needed to load FFTW to access the plan interface. I do seem to get the same crash with CUDA.CUFFT though.

1 Like

The functionality in question was supposed to be thread-safe – all accesses to cache.active_handles are behind a lock – and it’s the lock operation that’s failing here…

error in running finalizer: ErrorException("val already in a list")
error at ./error.jl:35
push! at ./linked_list.jl:53 [inlined]
_wait2 at ./condition.jl:87
#wait#621 at ./condition.jl:127
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
lock at ./lock.jl:227
push! at /home/jon/.julia/packages/CUDA/p5OVK/lib/utils/cache.jl:72 [inlined]

Please open an issue, and be sure to mention which Julia version you’re using.

1 Like

Thanks for the quick response. I submitted an issue

1 Like