Crashes and high utilization while training with Flux with GPU

Hey there lovely Julia community!

I’m quite new to Julia, but I’m so excited about all the awesome packages and the wonderful community. I’ve already used PyTorch a lot, also with GPU on several machines, and got it always working.

But now I’ve installed Flux and tried to execute the MNIST CNN from the Model Zoo. (https://github.com/FluxML/model-zoo/blob/master/vision/mnist/conv.jl) The first time it ran through but I noticed that there were spikes in GPU usage, where I couldn’t use my apps for a few seconds - I couldn’t scroll in Chrome but still could move the mouse cursor.
90% of all the next tries failed, because my monitors went black and I think some driver crashed - video driver or CUDA driver(not sure if these are the “same”?).

At the first time I thought it’s maybe because of RTX Voice, Folding@Home or BOINC running(but paused). But now I’ve killed all the tasks and still, it crashed halfway in training.
It’s always crashing after different times - sometimes directly after the training started, sometimes in the middle or near the end.

It’s saying this:

ERROR: WARNING: Error while freeing CUDAdrv.CuPtr{Nothing}(0x0000000d32000000):
CUDAdrv.CuError(code=CUDAdrv.cudaError_enum(0x000003e7), meta=nothing)

For my “honor”: I’ve also installed it on a different machine (P5000), also Windows, and there it worked flawlessly without these laggy performance spikes. But all the time Flux used a lot of VRAM - as far as I read, it’s because it reserves a lot of VRAM but it isn’t using it(so not bad, just a strategy)?

I’ve tried reinstalling CUDA. I’ve also restarted my computer several times. I tried to run this for the past 4 days.
Tested on Julia 1.4 and 1.4.1 with installing Flux, CuArrays, CUDAdrv, CUDAnative in many different ways. I also used the script(?) from the Model Zoo.

GPU RTX 2080 TI (EVGA Hybrid Cooled)
Using 3 Monitors so about 2GB VRAM is already used
Having an NVMe SSD as the main drive. But I only have 40GB left, but this shouldn’t be any problem?
CUDA 10.2
CUDNN installed (Version for 10.2)
GPU Driver Version: 445.87
OS Microsoft Windows 10 Education (got all the Windows updates!)
Version 10.0.18363 Build 18363
Processor Intel(R) Core™ i7-9800X CPU @ 3.80GHz, 3792 MHz, 8 cores
Mainboard Asus WS X299 SAGE/10G
RAM 32,0 GB

image
On the last spike it crashed, after like only up to 1-5 minutes of training.
CUDAErrorLogCrash.txt

As I’ve already said - I tried many package installations, but this is the current one:

Status  `C:\Users\Peter\.julia\environments\v1.4\Project.toml`
[c52e3926] Atom v0.12.10
[fbb218c0] BSON v0.2.6
[3895d2a7] CUDAapi v4.0.0
[c5f51814] CUDAdrv v6.3.0
[be33ccc6] CUDAnative v3.1.0
[3a865a2d] CuArrays v2.1.0
[7a1cc6ca] FFTW v1.2.1
[1a297f60] FillArrays v0.8.9
[587475ba] Flux v0.10.4
[0c68f7d7] GPUArrays v3.3.0
[7073ff75] IJulia v1.21.2
[d61cbc2d] JuliaTemplatePlayground v0.1.0 [ `C:\Users\Peter\.julia\dev\JuliaTemplatePlayground` ]
[e5e0dc1b] Juno v0.8.1
[d4b2101a] Lint v0.0.0 #master (https://github.com/tonyhffong/Lint.jl)
[d96e819e] Parameters v0.12.1
[14b8a8f1] PkgTemplates v0.6.4
[b3cc710f] StaticLint v4.3.0

Thank you so much in advance!
Best Regards
Peter

I’ve fixed it… It was like I thought. But when I tried to kill the task RTX Voice it didn’t, that’s why I thought for days that the tool isn’t the problem…
RTX Voice is the problem, for anybody who doesn’t know what this tool is:
It removes noise from a microphone and speakers with a DL-model.
NVIDIA RTX Voice
The weird thing is, that it’s not using many resources, nearly none(much less than 4%).
Nvidia is claiming that the tool uses the built-in Tensor cores from their RTX series.

I’ve run a few different test cycles and every time when I start RTX Voice before running the train() method, it will do these sky-rocketing 3D full utilization thing.
Without RTX Voice you can barely see any utilization in 3D usage.

So, I’m not sure why this is a problem? Is there any restriction that you can’t use 2 CUDA-using things simultaneously?
When I’ve used PyTorch and CUDA/GPU(and RTX Voice) it worked fine — so maybe there is still a problem with Flux or any dependency of it?

EDIT:
I’ve also let Folding@Home ran GPU stuff, which also uses CUDA as far as I know and when I ran Julia CNN simultaneously the same thing happened: CUDA/GPU driver(?) crashed. Even though the task manager looked very different as when I used RTX Voice, but it crashed after like 15 Sec.

It looks like you can’t run two CUDA-applications with Flux(or CuArrays, CUDAnative, CUDAdrv whatever…) simultaneously? While you can with PyTorch.

Thankyou for the update.
I would have said run nvidia-smi in a terminal as your Julia application executes.

1 Like