Hi there
I was trying to run the Vision vgg16 model in the model-zoo in Flux. https://github.com/FluxML/model-zoo/blob/master/vision/cifar10/cifar10.jl
I found that the performance on the GPU was quite slow - mainly due to the eager nature in which Flux allocates memory. So I modified the code to keep clearing memory as we go along, and it now takes about 1 minute per epoch - so whole training in < 1 hour.
To optimise further, and not block the GPU from running the callback evalcb function, I thought I would offload that function to the CPU. So far so good.
But I’m not an expert in this, and unfortunately, nothing is shown/printed on the callback for displaying loss.
Any ideas on how to do this please?
using Distributed
addprocs(2)
# Most of the code from the Model Zoo link....
function train(; kws...)
# Initialize the hyperparameters
args = Args(; kws...)
# Load the train, validation data
train, val, train_gpu, val_gpu = get_processed_data(args)
@info("Constructing Model")
# Defining the loss and accuracy functions
m = vgg16()
loss(x, y) = logitcrossentropy(m(x), y)
## Training
# Defining the callback and the optimizer
function free_mem()
GC.gc()
CUDA.reclaim()
end
function evalcb()
m_cpu = m |> cpu
remote_do(() -> @show(logitcrossentropy(m_cpu(val[1]), val[2])), 2)
end
opt = ADAM(args.lr)
@info("Training....")
# Starting to train models
Flux.@epochs args.epochs Flux.train!(loss, params(m), train_gpu, opt, cb = [free_mem, throttle(evalcb, 100)])
return m
end
The line remote_do(() → @show(logitcrossentropy(m_cpu(val[1]), val[2])), 2) is supposed to offload that computation to the CPU, but I don’t see any results in Jupyter.
Please help
Output:
┌ Info: Constructing Model
└ @ Main In[149]:8
┌ Info: Training…
└ @ Main In[149]:27
┌ Info: Epoch 1
└ @ Main C:\Users\Gurvesh Sanghera.julia\packages\Flux\05b38\src\optimise\train.jl:114
┌ Info: Epoch 2
└ @ Main C:\Users\Gurvesh Sanghera.julia\packages\Flux\05b38\src\optimise\train.jl:114
┌ Info: Epoch 3
└ @ Main C:\Users\Gurvesh Sanghera.julia\packages\Flux\05b38\src\optimise\train.jl:114
…
and so on… no Loss