(open issue here)
I’m fairly new to CuArrays
and Flux
, and I met this problem of having halted training after some epochs. There is no CUDA out of memory error, but the usage is extremely high for this simple linear model (99.97% on a 1080 Ti). The code would sometimes finish all 500 epochs without problems, but other times halt around Epoch 150. Here’s an example code snippet
using LinearAlgebra
using Flux
using CuArrays, CUDAnative
using Flux.Optimise: update!
using Flux: crossentropy
device!(1)
CuArrays.allowscalar(false)
pred_loss(x, y) = sum((x .- y) .^ 2)
# dimens
B = 250
linear = Dense(400, 144) |> gpu
# norm
linear.W .= linear.W ./ sqrt.(sum(linear.W .^ 2, dims=1));
# training
E = 500
opt_U = Descent(0.01)
for e = 1:E
running_l = 0
c = 0
for b = 1:100
y = rand(144, B) |> gpu
R, = zeros(400, size(y)[2]) |> gpu
l = 0
grads = gradient(params(linear.W)) do
l = pred_loss(y, linear(R))
running_l += l
return l
end
update!(opt_U, linear.W, grads[linear.W])
linear.W .= linear.W ./ sqrt.(sum(linear.W .^ 2, dims=1))
c += 1
end
println("Epoch: $e, Running loss: $(running_l / c)")
end
I’m having this problem on Ubuntu 18.04, using CuArrays
v 2.1.0. Would appreciate some pointers on this.