Training Halts when Using CuArrarys

lpjiang97 · April 23, 2020, 4:40am

(open issue here)

I’m fairly new to CuArrays and Flux, and I met this problem of having halted training after some epochs. There is no CUDA out of memory error, but the usage is extremely high for this simple linear model (99.97% on a 1080 Ti). The code would sometimes finish all 500 epochs without problems, but other times halt around Epoch 150. Here’s an example code snippet

using LinearAlgebra
using Flux
using CuArrays, CUDAnative
using Flux.Optimise: update!
using Flux: crossentropy

device!(1)
CuArrays.allowscalar(false)
pred_loss(x, y) = sum((x .- y) .^ 2)

# dimens
B = 250
linear = Dense(400, 144) |> gpu
# norm
linear.W .= linear.W ./ sqrt.(sum(linear.W .^ 2, dims=1));
# training
E = 500
opt_U = Descent(0.01)
for e = 1:E
    running_l = 0
    c = 0
    for b = 1:100
        y = rand(144, B) |> gpu
        R, = zeros(400, size(y)[2]) |> gpu
        l = 0
        grads = gradient(params(linear.W)) do
            l = pred_loss(y, linear(R))
            running_l += l
            return l
        end
        update!(opt_U, linear.W, grads[linear.W])
        linear.W .= linear.W ./ sqrt.(sum(linear.W .^ 2, dims=1))
        c += 1
    end
    println("Epoch: $e, Running loss: $(running_l / c)")
end

I’m having this problem on Ubuntu 18.04, using CuArrays v 2.1.0. Would appreciate some pointers on this.

Topic		Replies	Views
Crashes and high utilization while training with Flux with GPU GPU cudanative , cuda , flux , machine-learning	2	1314	May 17, 2020
Flux on GPU too slow Machine Learning gpu , cuda , flux	5	1115	September 22, 2022
Flux gpu gradient failing General Usage question	5	856	August 18, 2020
CUDNNError when using Flux within a Task General Usage flux , cuarrays	15	1553	June 22, 2020
Training with Flux.jl on the GPU causes ArgumentError: cannot take the CPU address of a CuArray GPU question , gpu , flux , machine-learning , neural-network	4	1099	May 28, 2022

Training Halts when Using CuArrarys

Related topics