I am trying to run the MNIST digit recognition from model-zoo using the latest Flux version with Zygote on a GPU. I had to make some tweaks to the code to make it run: https://github.com/tanhevg/model-zoo/blob/tanhevg/mnist-zygote/vision/mnist/conv.jl. One strange pattern that I observe is that after training for a while and almost converging (test accuracy > 0.99), all of a sudden the accuracy drops to < 0.1 and stays that way. Inspecting model parameters, they are full of
NaNs. If I save the model just before it turns into
NaNs, and compute its gradients,
NaNs are returned undeterministically (i.e. sometimes it returns a valid gradient, and sometimes -
I think this might be related to me running it on a GPU. I have not observed this behaviour on a CPU. Interestingly, the learning curve on the CPU looks completely different from that on GPU (in addition to being much slower, which is expected). E.g. on GPU the accuracy after the first epoch is >0.9, on CPU it is <0.11.
Has anyone else run into this? Maybe suggest a workaround?
Thanks in advance.
julia> versioninfo() Julia Version 1.3.0 Commit 46ce4d7933 (2019-11-26 06:09 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge) Environment: JULIA_DEPOT_PATH = /data/.julia JULIA_WORKER_TIMEOUT = 300 JULIA_EDITOR = atom -a JULIA_NUM_THREADS = 12 JULIA_PROJECT = @.
$ nvidia-smi -L GPU 0: Quadro P2000 $ nvidia-smi | grep SMI | NVIDIA-SMI 418.87.00 Driver Version: 440.44 CUDA Version: 10.2 |