(open issue here)
I’m fairly new to
Flux, and I met this problem of having halted training after some epochs. There is no CUDA out of memory error, but the usage is extremely high for this simple linear model (99.97% on a 1080 Ti). The code would sometimes finish all 500 epochs without problems, but other times halt around Epoch 150. Here’s an example code snippet
using LinearAlgebra using Flux using CuArrays, CUDAnative using Flux.Optimise: update! using Flux: crossentropy device!(1) CuArrays.allowscalar(false) pred_loss(x, y) = sum((x .- y) .^ 2) # dimens B = 250 linear = Dense(400, 144) |> gpu # norm linear.W .= linear.W ./ sqrt.(sum(linear.W .^ 2, dims=1)); # training E = 500 opt_U = Descent(0.01) for e = 1:E running_l = 0 c = 0 for b = 1:100 y = rand(144, B) |> gpu R, = zeros(400, size(y)) |> gpu l = 0 grads = gradient(params(linear.W)) do l = pred_loss(y, linear(R)) running_l += l return l end update!(opt_U, linear.W, grads[linear.W]) linear.W .= linear.W ./ sqrt.(sum(linear.W .^ 2, dims=1)) c += 1 end println("Epoch: $e, Running loss: $(running_l / c)") end
I’m having this problem on Ubuntu 18.04, using
CuArrays v 2.1.0. Would appreciate some pointers on this.