Occasional long delays in CUDA.jl

We have a real-time CUDA.jl application that must maintain a given time budget in order to keep up with a streaming input signal. Although the steady state performance meets timing, we occasional see fairly long delays that can exceed our buffer depths, causing the application to fall behind and drop data. Below is a MWE with some simple broadcast arithmetic. Each loop typically takes around 15ms, but longer loop times of 30-50ms occur regularly, and occasionally a very long stall of >500ms is observed. This isn’t due to compilation but occurs at random points throughout the loop. In this simple example, such long stalls are relatively infrequent (one or two in 1000 iterations), but in our multithreaded application they seem to occur more regularly, as often as every few seconds.

We initially suspected GC, but that does not appear to be the issue, assuming the GC stats can be trusted. Running a similar experiment on the CPU produces much more uniform results. Is there something in the CUDA.jl or cuda driver that might be causing this behavior?

using CUDA

function simpleDummyLoop(gpu = true, Nloops = 1000)

    timevec = zeros(Float64,Nloops)

    if gpu
        x = CUDA.randn(1000,1000,100);
        y = CUDA.randn(1000,1000,100);
    else
        #reduce size for CPU ops to keep compute time similar
        x = randn(1000,1000,5);
        y = randn(1000,1000,5);
    end

    maxGC = 0;
    for n=1:Nloops
        gcstats = Base.gc_num()

        # sleep(0.01);

        CUDA.synchronize()
        t = @CUDA.elapsed begin
            x .= y .* 2.0
            y .= x ./ 2.0
            CUDA.synchronize()
        end

        timevec[n] = t*1e3
        gcstats = Base.GC_Diff(gcstats, Base.gc_num())
        gctime = gcstats.total_time / 1e6
        gcevents = gcstats.pause
        if gctime > maxGC 
            maxGC = gctime
        end

        @info "Loop $n, Time $(round(t*1e3,digits=1))ms -- GC: $gcevents events, $(round(gctime,digits=1))ms"
    end

    tmax = round(maximum(timevec),digits=1)
    tmean = round(sum(timevec)/Nloops,digits=1)

    if gpu
        @info "GPU:"
    else    
        @info "CPU:"
    end
    @info "$Nloops loops, mean time $(tmean)ms, max time: $(tmax)ms, max GC time: $(maxGC)ms"
end
[ Info: GPU:
[ Info: 1000 loops, mean time 14.5ms, max time: 757.9ms, max GC time: 0ms
[ Info: CPU:
[ Info: 1000 loops, mean time 11.1ms, max time: 12.3ms, max GC time: 0ms
1 Like

GC pauses would be my main thought as well. Try running under NSight Systems to see where the stalls occur.

Another pause I’ve ran into is due to the CUDA driver itself compacting its memory pool, GPU stalls due to stream synchronization -- even when idle? - CUDA Programming and Performance - NVIDIA Developer Forums, which I should have improved by allowing the pool to consume all memory, Slight changes to pool management by maleadt · Pull Request #1344 · JuliaGPU/CUDA.jl · GitHub. That’s part of CUDA.jl 3.8 – which version are you using? And which version of the CUDA driver?

1 Like

I am running CUDA v11.6 and CUDA.jl v3.10.0

Looking at the NSight trace, it appears that the GPU is fully idle and all CPU threads are waiting on a mutex (repeated calls to pthread_mutex_trylock and jl_task_get_next). I dont think I see any GC activity apart from several calls to get_pgcstack_static which dont appear to take any time. However Im new to using NSight Systems so its possible Im missing something, and Im not quite sure what a GC event would look like.

Here is a list of all events within the first 5ms or so of the stall:

Name Start Duration TID
pthread_cond_wait 6.65738s 15.852 ms 7072
pthread_cond_wait 6.6574s 15.805 ms 7066
pthread_cond_wait 6.6574s 15.874 ms 7074
pthread_cond_wait 6.6574s 15.815 ms 7067
pthread_cond_wait 6.6574s 15.874 ms 7071
pthread_cond_wait 6.65741s 15.845 ms 7070
pthread_cond_wait 6.65741s 15.844 ms 7069
pthread_cond_wait 6.65741s 15.840 ms 7075
pthread_cond_wait 6.65741s 15.837 ms 7076
pthread_cond_wait 6.65747s 15.753 ms 7068
pthread_cond_wait 6.65748s 15.771 ms 7073
sem_wait 6.67305s 425.543 ms 8822
poll 6.67311s 83.114 ms 7946
pthread_cond_signal 6.67317s 2.431 ÎĽs 7058
Nsight Systems frames 6.67319s - 7058
pthread_cond_signal 6.67319s 2.201 ÎĽs 7058
pthread_cond_signal 6.67319s 2.338 ÎĽs 7058
pthread_cond_signal 6.67319s 1.960 ÎĽs 7058
pthread_cond_signal 6.6732s 2.124 ÎĽs 7058
pthread_cond_signal 6.6732s 2.444 ÎĽs 7058
pthread_cond_signal 6.6732s 2.457 ÎĽs 7058
pthread_cond_signal 6.6732s 2.144 ÎĽs 7058
pthread_cond_signal 6.67321s 5.645 ÎĽs 7058
pthread_cond_signal 6.67321s 2.754 ÎĽs 7058
pthread_cond_signal 6.67322s 2.774 ÎĽs 7058
epoll_wait 6.67324s 2.991 ÎĽs 7058
pthread_cond_wait 6.67324s 53.882 ÎĽs 7066
pthread_cond_wait 6.67325s 380.147 ms 7068
pthread_cond_signal 6.67325s 32.334 ÎĽs 7058
pthread_cond_wait 6.67325s 2.158 ms 7067
pthread_cond_signal 6.67328s 2.751 ÎĽs 7058
pthread_cond_signal 6.67328s 16.447 ÎĽs 7058
pthread_mutex_trylock 6.67337s 4.710 ÎĽs 7066
pthread_mutex_trylock 6.67337s 6.033 ÎĽs 7076
pthread_mutex_trylock 6.67337s 63.120 ÎĽs 7073
Nsight Systems frames 6.67382s - 7072
libjulia-internal.so.1.7!get_next_task 6.67387s - 7070
libjulia-internal.so.1.7!get_next_task 6.67402s - 7071
libToolsInjection64.so!NSYS_OSRT_pthread_mutex_trylock_0 6.6741s - 7073
libjulia-internal.so.1.7!get_next_task 6.67417s - 7076
libjulia-internal.so.1.7!uv_mutex_trylock 6.67423s - 7075
pthread_mutex_trylock 6.67434s 14.903 ÎĽs 7066
pthread_mutex_trylock 6.67437s 3.147 ÎĽs 7074
pthread_mutex_trylock 6.67437s 4.217 ÎĽs 7076
libjulia-internal.so.1.7!get_next_task 6.67455s - 7074
libjulia-internal.so.1.7!get_next_task 6.67462s - 7069
libjulia-internal.so.1.7!get_next_task 6.67495s - 7058
pthread_mutex_trylock 6.67511s 57.013 ÎĽs 7073
pthread_mutex_trylock 6.67535s 60.289 ÎĽs 7073
pthread_mutex_trylock 6.67545s 1.072 ÎĽs 7074
libjulia-internal.so.1.7!get_next_task 6.67545s - 7070
libjulia-internal.so.1.7!get_next_task 6.67546s - 7072
libjulia-internal.so.1.7!get_next_task 6.67564s - 7071

That’s likely the implementation of synchronize, and specifically the non-blocking synchronization that tries to yield to other tasks while waiting for the GPU to synchronize: https://github.com/JuliaGPU/CUDA.jl/blob/183e21b8d651bff367a5586ce5203b0de7b8cf4a/lib/cudadrv/stream.jl#L137-L189=. However, here, the GPU is seemingly idle. Are you sure all threads are stuck like that? One thing you could try, is to remove the call to nonblocking_synchronize from our synchronize implementation: https://github.com/JuliaGPU/CUDA.jl/blob/183e21b8d651bff367a5586ce5203b0de7b8cf4a/lib/cudadrv/stream.jl#L126-L128=. Instead, it will then call to the (blocking) API call, which shouldn’t make a difference but should be much clearer in the NSight API trace.

Very interesting. Removing the nonblocking_synchronize call appears to eliminate the stalls. Is nonblocking_synchronize primarily to allow other tasks within the same thread to continue to run during GPU sync? If we have a multithreaded application with a single Task per GPU thread would this call still be useful/needed?

Regarding the thread usage during stalls, I am not super familiar with using Nsight system, but if I set the filter over the time slice of the stall and select “Show in Events View” on the top level Threads item, then sort by duration, the only non-zero duration events I see are: sem_wait, pthread_cond_wait, poll, and pthread_mutex_trylock. Most of the zero duration events are get_next_task and Nsight frame calls, along with a handful of other things that look fairly benign. Im happy to share the report if you like.

I can try to further instrument nonblocking_synchronize to see why it is stalling when the GPU is idle. Let me know if you have any specific suggestions of what to look for.

I added some timers as well as a direct polling loop to nonblocking_synchronize and am getting some seemingly inconsistent results, where the stall appears to occur at different points.

This suggests to me that the root cause may be slow context switches between threads?

EDIT: Ive added NVTX events and confirmed that on situations where my polling loop is "slow (tpoll >> 15ms), the polling loop is simply not running during that interval. So the root cause appears thread preemption either by the Julia or OS scheduler.

[ Info: nonblocking_synchronize: tpoll: 0.10452413558959961, tevent: 0.1044621467590332, tcatch: 0.10453915596008301, tdone: 0.1045839786529541
[ Info: nonblocking_synchronize: tpoll: 0.015796899795532227, tevent: 0.01578497886657715, tcatch: 0.7607128620147705, tdone: 0.7607369422912598
nonblocking_synchronize: tpoll: 0.015258073806762695, tevent: 0.38925600051879883, tcatch: 0.38937807083129883, tdone: 0.3894209861755371

Code:

@inline function nonblocking_synchronize(stream::CuStream)
    # fast path
    t1 = time()
    isdone(stream) && return

    # minimize latency of short operations by busy-waiting,
    # initially without even yielding to other tasks
    spins = 0
    while spins < 256
        if spins < 32
            ccall(:jl_cpu_pause, Cvoid, ())
            # Temporary solution before we have gc transition support in codegen.
            ccall(:jl_gc_safepoint, Cvoid, ())
        else
            yield()
        end
        isdone(stream) && return
        spins += 1
    end

    # minimize CPU usage of long-running kernels by waiting for an event signalled by CUDA
    event = Base.Event()
    launch(; stream) do
        notify(event)
    end

    tevent = -1.0
    tcatch = -1.0
    tdone = -1.0
    tpoll = -1.0

    # if an error occurs, the callback may never fire, so use a timer to detect such cases
    dev = device()
    timer = Timer(0; interval=1)
    Base.@sync begin
        Threads.@spawn try
            device!(dev)
            while true
                try
                    Base.wait(timer)
                catch err
                    tcatch = time() - t1
                    if err isa EOFError
                        break
                    end
                    rethrow()
                end
                if unsafe_cuStreamQuery(stream) != ERROR_NOT_READY
                    break
                end
            end
        finally
            notify(event)
        end

        Threads.@spawn begin
            Base.wait(event)
            tevent = time() - t1
            close(timer)
        end

        #add direct polling
        while(!isdone(stream))
            sleep(0.001)
        end
        tpoll = time() - t1

    end

    tdone = time() - t1
    @info "nonblocking_synchronize: tpoll: $tpoll, tevent: $tevent, tcatch: $tcatch, tdone: $tdone"

    return
end

Correct, so you can probably remove it for your app.

I’m not sure what you mean. That loop isn’t in upstream CUDA.jl; what are you trying to measure?
It’d be interesting to know how the nonblocking_synchronize finishes in the slow case. There’s multiple possible paths: first it spins, then it spins + yields, then it waits for CUDA to notify an event, and finally there’s a timer to catch cases where the stream errored and the aforementioned event would never be notified.

That loop isn’t in upstream CUDA.jl; what are you trying to measure?

Correct, I wanted to understand what was hanging, and so I added the polling loop to see whether isdone(stream) would return true once the GPU was actually idle.

The stalls dont seem to occur during the initial spin stage, but during the event timer stage. I saw this manifest in several different ways: Sometimes the thread waiting for the event would return and call close(timer) in timely manner, but then the watchdog thread wouldn’t catch the close event until several hundred ms later. Or wait(event) itself would return late, long after the polling loop showed the stream was ready. Finally the polling loop in the main thread would itself sometimes stall. By emitting an NVTX marker inside this loop, I could see that no signals were emitted during such stalls, and thus the thread must not have been running. Thread starvation/preemption or slow context switching would also explain the delays observed in the other threads. Overall, three context switches are required for the timer stage to complete, and if any one of these threads is blocked or the context switch is slow the overall sync call will slow to return. I cant quite fathom how all the threads could be stalled for hundreds of ms while the CPU is not otherwise doing much, but I’m aware there are some known issues and limitations to the Julia scheduler…

I did just test our application without using nonblocking_synchronize and it does seem to be substantially more stable. Would it be possible to add a flag or setting to turn this on and off? Ideally such a flag could be set globally or per stream, since synchronize() is sometimes called directly within CUDA.jl.

Thanks for all your help in getting to the bottom of this!

Just a follow up with some data to highlight the impact of this issue for our use-case. In the toy example above, the stalls occurred rather infrequently and although they introduced occasional long delays they didn’t really seriously degrade average throughput. However, in our compute heavy multithreaded application this becomes a regular occurrence with a massive impact on performance. Below is an Nsight capture of our actual application with 5 GPU pipelined streams under full load. Note that we are not actually calling CUDA.synchronize() from our application code, but it is called from within CUDA.findall()

Upstream implementation with nonblocking_synchronize enabled. Stalls occur every few seconds, massively reducing performance and latency. Ultimately we are unable to keep up with the realtime input in this case and are dropping data on each stall.

Patch with nonblocking_synchronize disabled. No stalls occur, the GPU remains loaded, and data input buffer remains near empty.

Overall, I think this issue is fundamentally due to a flaw in the Julia scheduler rather than a CUDA.jl problem per se, but it would be very useful to have a workaround available. Id also be interested in any thoughts on why the Julia scheduler behaves this way, but that may be a topic for another post.

Thanks for digging into this.

That one may be due to CUDA notifying the event late, i.e., not immediately after the stream would become idle. It’s not clear that is caused by the Julia scheduler (albeit likely).

That one seems unambiguously caused by Julia though. Extracting the relevant parts from nonblocking_synchronize and annotating with your findings:

# event that gets signalled externally when a condition is met
event = Base.Event()

# sometimes though, the event doesn't get notified,
# so use a timer to separately check the condition
timer = Timer(0; interval=1)
Base.@sync begin
    Threads.@spawn try
        while true
            try
                Base.wait(timer)
            catch err
                err isa EOFError && break
                rethrow()
            end

            if condition_met(...)
                break
            end
        end
    finally
        notify(event)
    end

    # in case the event was properly signalled when the condition was met,
    # close the timer
    Threads.@spawn begin
        Base.wait(event)
        close(timer)
        # XXX: sometimes there's a very large delay between this close,
        #      and the above timer task actually returning
    end
end

I’m not able to reproduce any large delays between closing the timer and its task actually returning, but it may still be that I’m doing something suboptimally. Let’s call in some experts: @tkf @jameson is there anything I’m doing wrong here that may explain large latencies? Is there a better way to have low-latency event signalling without falling back to polling (we used to just do while !condition; sleep; yield; end)?

Threads are not expected to do IO. So they have to wait on thread 1 to do it for them.

Could you elaborate on this? Is this a limitation on the underlying libuv implementation, Julia, the scheduler, or some form of a global lock on the IO subsystem?

With respect, your statement would indicate to me that there is a massive performance limitation for anyone doing any form of parallel IO in Julia, is not one of the major use cases of threading to allow for parallel IO?

What constitutes I/O here? Just the timer? Or also the Event and notifying/waiting for it?

This is interesting, but Im still confused as to what is going on, since the main thread is not blocked and yet we are seeing long stalls of hundreds of ms. Using a 3rd party profiler (NVIDIA NSight), I indeed see that the call to close(timer) results in a write() call from the main thread, which is presumably the close on the timer fid. Immediately following this in the watchdog thread I see a call to libjulia-internal.so.1.7!get_next_task and then a pthread_cond_wait that does not return for 127ms.

During that 127ms, there is essentially no activity in any of the threads, apart from mutex polling. Specifically in the main thread, there are periodic calls to
libjulia-internal.so.1.7!get_next_task, pthread_cond_signal, pthread_mutex_trylock, and libjulia-internal.so.1.7!uv_mutex_trylock. Perhaps something is somehow locking libuv? There are a number of other threads all similarly calling get_next_task and mutex_try_lock. No threads seem to be busy with anything. Eventually, I see a read() in the main thread followed by the catch of EOF in the watchdog thread.

Hopefully this summary makes sense. I can share detailed logs if that would be helpful.

Edit: I f you are trying to reproduce, I sometimes have to run the MWE code several times to catch one of these stall events, but I have observed this behavior on several difference machines.

Edit 2: We have a data service running on our system that performs a lot of heavy I/O reading from a dma device and writing to shared memory file. This service does not use julia or libuv, and does not interact with the test code. When I disable this service, it appears to reduce the number and severity of the stalls seen in the test code, although they still do occur. Is it possible that a separate I/O heavy process could result in extended preemption of the libuv subsystem (and by extension, Timers)?

From talking with @vchuravy, I think the limitation is “just” I/O that’s performed using libuv, due to it’s single-threaded event loop. So IIUC channels, locks, etc do not fall under this restriction, but here we’re using AsyncCondition and Timer, both of which are implemented using libuv.

Sorry, I’m not familiar enough with libuv to know what could cause this. Maybe you’re calling into a library from the libuv thread, and that system library is then doing a blocking syscall?

For a solution to your issue, either we could introduce an environment variable that disables the nonblocking synchronization, or we could rework it so that it doesn’t rely on libuv. The simplest solution that comes to mind then is a busy-loop that calls yield() (to give other tasks the opportunity to execute), maybe with some exponential delay and a sleep() to not hog the CPU too much (although CUDA’s default synchronization behavior is to spin as well).

Hopefully fixed by Perform synchronization on a worker thread by maleadt · Pull Request #2025 · JuliaGPU/CUDA.jl · GitHub.

1 Like