Flux usage: Sending callback function to CPU on a separate thread

Hi there

I was trying to run the Vision vgg16 model in the model-zoo in Flux. https://github.com/FluxML/model-zoo/blob/master/vision/cifar10/cifar10.jl

I found that the performance on the GPU was quite slow - mainly due to the eager nature in which Flux allocates memory. So I modified the code to keep clearing memory as we go along, and it now takes about 1 minute per epoch - so whole training in < 1 hour.

To optimise further, and not block the GPU from running the callback evalcb function, I thought I would offload that function to the CPU. So far so good.

But I’m not an expert in this, and unfortunately, nothing is shown/printed on the callback for displaying loss.

Any ideas on how to do this please?

using Distributed
addprocs(2)

# Most of the code from the Model Zoo link....

function train(; kws...)
    # Initialize the hyperparameters
    args = Args(; kws...)
	
    # Load the train, validation data 
    train, val, train_gpu, val_gpu = get_processed_data(args)

    @info("Constructing Model")	
    # Defining the loss and accuracy functions
    m = vgg16()

    loss(x, y) = logitcrossentropy(m(x), y)

    ## Training
    # Defining the callback and the optimizer
    function free_mem()
        GC.gc()
        CUDA.reclaim()
    end
    
    function evalcb()
        m_cpu = m |> cpu
        remote_do(() -> @show(logitcrossentropy(m_cpu(val[1]), val[2])), 2)
    end
    
    opt = ADAM(args.lr)
    @info("Training....")
    # Starting to train models
    Flux.@epochs args.epochs Flux.train!(loss, params(m), train_gpu, opt, cb = [free_mem, throttle(evalcb, 100)])

    return m
end

The line remote_do(() → @show(logitcrossentropy(m_cpu(val[1]), val[2])), 2) is supposed to offload that computation to the CPU, but I don’t see any results in Jupyter.

Please help :slight_smile:

Output:

┌ Info: Constructing Model
└ @ Main In[149]:8
┌ Info: Training…
└ @ Main In[149]:27
┌ Info: Epoch 1
└ @ Main C:\Users\Gurvesh Sanghera.julia\packages\Flux\05b38\src\optimise\train.jl:114
┌ Info: Epoch 2
└ @ Main C:\Users\Gurvesh Sanghera.julia\packages\Flux\05b38\src\optimise\train.jl:114
┌ Info: Epoch 3
└ @ Main C:\Users\Gurvesh Sanghera.julia\packages\Flux\05b38\src\optimise\train.jl:114

and so on… no Loss

Try running this code as a script directly or from the REPL (i.e. not Jupyter). The callback should be running, so my hunch is that Jupyter isn’t capturing the output.

Changing my previous reply - it works! It is a lot more non-trivial than I thought. The whole code needs to be made available to all the procs, so Jupyter was never going to work! I was also trying to run this from a wrong folder…

For anyone else looking at how to do this:

This is the final code of my train function. Everything else is the same from the model-zoo, except that I get the cpu and gpu versions of all the arrays, so I can pass the callback easily to the cpu.

function train(; kws...)
    # Initialize the hyperparameters
    args = Args(; kws...)
	
    # Load the train, validation data 
    train, val, train_gpu, val_gpu = get_processed_data(args)
    val = val

    @info("Constructing Model")	
    # Defining the loss and accuracy functions
    m = vgg16()

    loss(x, y) = logitcrossentropy(m(x), y)

    ## Training
    # Defining the callback and the optimizer
    function free_mem()
        GC.gc()
        CUDA.reclaim()
    end
    
    function ecb()
        @show(logitcrossentropy(cpu(m)(val[1]), val[2]))
    end
    
    opt = ADAM(args.lr)
    @info("Training....")
    # Starting to train models
    Flux.@epochs args.epochs Flux.train!(loss, params(m), train_gpu, opt;
        cb = [free_mem, throttle(() -> remote_do(ecb, 2), 240)])

    return m
end

After this - save the file as a script. Then load Julia either with julia -p 2 or load Distributed, and addprocs.

And finally - profit! Note that we need to load all deps, and the full code - on all the procs (hence the liberal sprinkling of @everywhere)

julia> @everywhere using Pkg; @everywhere Pkg.activate("."); @everywhere Pkg.instantiate()
 Activating environment at `D:\Dropbox\Code\julia\model-zoo\tutorials\Project.toml`
      From worker 3:     Activating environment at `D:\Dropbox\Code\julia\model-zoo\tutorials\Project.toml`
      From worker 2:     Activating environment at `D:\Dropbox\Code\julia\model-zoo\tutorials\Project.toml`

julia> @everywhere include("cifar10_new.jl")

julia> m = train()
[ Info: Constructing Model
[ Info: Training....
[ Info: Epoch 1
      From worker 2:    logitcrossentropy((cpu(m))(val[1]), val[2]) = 2.3026755f0
[ Info: Epoch 2
      From worker 2:    logitcrossentropy((cpu(m))(val[1]), val[2]) = 1.287541f0
[ Info: Epoch 3
      From worker 2:    logitcrossentropy((cpu(m))(val[1]), val[2]) = 0.96682256f0
      From worker 2:    logitcrossentropy((cpu(m))(val[1]), val[2]) = 0.99137723f0
      From worker 2:    logitcrossentropy((cpu(m))(val[1]), val[2]) = 1.2300518f0
[ Info: Epoch 4
      From worker 2:    logitcrossentropy((cpu(m))(val[1]), val[2]) = 0.8948067f0
...... and so on

The CPU and GPU are running independently, with no blocks (I verified in Task Manager). Training is on the GPU, and callback to Loss on the CPU.

2 Likes