We have a real-time CUDA.jl application that must maintain a given time budget in order to keep up with a streaming input signal. Although the steady state performance meets timing, we occasional see fairly long delays that can exceed our buffer depths, causing the application to fall behind and drop data. Below is a MWE with some simple broadcast arithmetic. Each loop typically takes around 15ms, but longer loop times of 30-50ms occur regularly, and occasionally a very long stall of >500ms is observed. This isn’t due to compilation but occurs at random points throughout the loop. In this simple example, such long stalls are relatively infrequent (one or two in 1000 iterations), but in our multithreaded application they seem to occur more regularly, as often as every few seconds.
We initially suspected GC, but that does not appear to be the issue, assuming the GC stats can be trusted. Running a similar experiment on the CPU produces much more uniform results. Is there something in the CUDA.jl or cuda driver that might be causing this behavior?
using CUDA
function simpleDummyLoop(gpu = true, Nloops = 1000)
timevec = zeros(Float64,Nloops)
if gpu
x = CUDA.randn(1000,1000,100);
y = CUDA.randn(1000,1000,100);
else
#reduce size for CPU ops to keep compute time similar
x = randn(1000,1000,5);
y = randn(1000,1000,5);
end
maxGC = 0;
for n=1:Nloops
gcstats = Base.gc_num()
# sleep(0.01);
CUDA.synchronize()
t = @CUDA.elapsed begin
x .= y .* 2.0
y .= x ./ 2.0
CUDA.synchronize()
end
timevec[n] = t*1e3
gcstats = Base.GC_Diff(gcstats, Base.gc_num())
gctime = gcstats.total_time / 1e6
gcevents = gcstats.pause
if gctime > maxGC
maxGC = gctime
end
@info "Loop $n, Time $(round(t*1e3,digits=1))ms -- GC: $gcevents events, $(round(gctime,digits=1))ms"
end
tmax = round(maximum(timevec),digits=1)
tmean = round(sum(timevec)/Nloops,digits=1)
if gpu
@info "GPU:"
else
@info "CPU:"
end
@info "$Nloops loops, mean time $(tmean)ms, max time: $(tmax)ms, max GC time: $(maxGC)ms"
end
[ Info: GPU:
[ Info: 1000 loops, mean time 14.5ms, max time: 757.9ms, max GC time: 0ms
[ Info: CPU:
[ Info: 1000 loops, mean time 11.1ms, max time: 12.3ms, max GC time: 0ms