Hey there lovely Julia community!
I’m quite new to Julia, but I’m so excited about all the awesome packages and the wonderful community. I’ve already used PyTorch a lot, also with GPU on several machines, and got it always working.
But now I’ve installed Flux and tried to execute the MNIST CNN from the Model Zoo. (https://github.com/FluxML/model-zoo/blob/master/vision/mnist/conv.jl) The first time it ran through but I noticed that there were spikes in GPU usage, where I couldn’t use my apps for a few seconds - I couldn’t scroll in Chrome but still could move the mouse cursor.
90% of all the next tries failed, because my monitors went black and I think some driver crashed - video driver or CUDA driver(not sure if these are the “same”?).
At the first time I thought it’s maybe because of RTX Voice, Folding@Home or BOINC running(but paused). But now I’ve killed all the tasks and still, it crashed halfway in training.
It’s always crashing after different times - sometimes directly after the training started, sometimes in the middle or near the end.
It’s saying this:
ERROR: WARNING: Error while freeing CUDAdrv.CuPtr{Nothing}(0x0000000d32000000):
CUDAdrv.CuError(code=CUDAdrv.cudaError_enum(0x000003e7), meta=nothing)
For my “honor”: I’ve also installed it on a different machine (P5000), also Windows, and there it worked flawlessly without these laggy performance spikes. But all the time Flux used a lot of VRAM - as far as I read, it’s because it reserves a lot of VRAM but it isn’t using it(so not bad, just a strategy)?
I’ve tried reinstalling CUDA. I’ve also restarted my computer several times. I tried to run this for the past 4 days.
Tested on Julia 1.4 and 1.4.1 with installing Flux, CuArrays, CUDAdrv, CUDAnative in many different ways. I also used the script(?) from the Model Zoo.
GPU RTX 2080 TI (EVGA Hybrid Cooled)
Using 3 Monitors so about 2GB VRAM is already used
Having an NVMe SSD as the main drive. But I only have 40GB left, but this shouldn’t be any problem?
CUDA 10.2
CUDNN installed (Version for 10.2)
GPU Driver Version: 445.87
OS Microsoft Windows 10 Education (got all the Windows updates!)
Version 10.0.18363 Build 18363
Processor Intel(R) Core™ i7-9800X CPU @ 3.80GHz, 3792 MHz, 8 cores
Mainboard Asus WS X299 SAGE/10G
RAM 32,0 GB
On the last spike it crashed, after like only up to 1-5 minutes of training.
CUDAErrorLogCrash.txt
As I’ve already said - I tried many package installations, but this is the current one:
Status `C:\Users\Peter\.julia\environments\v1.4\Project.toml`
[c52e3926] Atom v0.12.10
[fbb218c0] BSON v0.2.6
[3895d2a7] CUDAapi v4.0.0
[c5f51814] CUDAdrv v6.3.0
[be33ccc6] CUDAnative v3.1.0
[3a865a2d] CuArrays v2.1.0
[7a1cc6ca] FFTW v1.2.1
[1a297f60] FillArrays v0.8.9
[587475ba] Flux v0.10.4
[0c68f7d7] GPUArrays v3.3.0
[7073ff75] IJulia v1.21.2
[d61cbc2d] JuliaTemplatePlayground v0.1.0 [ `C:\Users\Peter\.julia\dev\JuliaTemplatePlayground` ]
[e5e0dc1b] Juno v0.8.1
[d4b2101a] Lint v0.0.0 #master (https://github.com/tonyhffong/Lint.jl)
[d96e819e] Parameters v0.12.1
[14b8a8f1] PkgTemplates v0.6.4
[b3cc710f] StaticLint v4.3.0
Thank you so much in advance!
Best Regards
Peter