Training Halts when Using CuArrarys

(open issue here)

I’m fairly new to CuArrays and Flux, and I met this problem of having halted training after some epochs. There is no CUDA out of memory error, but the usage is extremely high for this simple linear model (99.97% on a 1080 Ti). The code would sometimes finish all 500 epochs without problems, but other times halt around Epoch 150. Here’s an example code snippet

using LinearAlgebra
using Flux
using CuArrays, CUDAnative
using Flux.Optimise: update!
using Flux: crossentropy

device!(1)
CuArrays.allowscalar(false)
pred_loss(x, y) = sum((x .- y) .^ 2)

# dimens
B = 250
linear = Dense(400, 144) |> gpu
# norm
linear.W .= linear.W ./ sqrt.(sum(linear.W .^ 2, dims=1));
# training
E = 500
opt_U = Descent(0.01)
for e = 1:E
    running_l = 0
    c = 0
    for b = 1:100
        y = rand(144, B) |> gpu
        R, = zeros(400, size(y)[2]) |> gpu
        l = 0
        grads = gradient(params(linear.W)) do
            l = pred_loss(y, linear(R))
            running_l += l
            return l
        end
        update!(opt_U, linear.W, grads[linear.W])
        linear.W .= linear.W ./ sqrt.(sum(linear.W .^ 2, dims=1))
        c += 1
    end
    println("Epoch: $e, Running loss: $(running_l / c)")
end

I’m having this problem on Ubuntu 18.04, using CuArrays v 2.1.0. Would appreciate some pointers on this.