Flux: GPU not working as expected

https://github.com/FluxML/model-zoo/blob/master/vision/mnist/mlp.jl

Adapting above example to my own set of inputs. The example works as expected
When I change input from 28x28 MNIST pictures to my input ( 10x5 Array{float32,2}) the GPU only works hard if I constrain my training set is small. If I increase the size of the training set to > 500,000 items the data loads to GPU (example: 5.2gb used of 6gb available) but the gpu is not being used on task manager and the training is very slow (similat to CPU) speed

I am using the latest package of Flux& Zygote

How can I get the Dataloader to make use of my GPU?
@with_kw mutable struct Args
η::Float64 = 3e-4 # learning rate
batchsize::Int = 32 # batch size
epochs::Int = 10 # number of epochs
device::Function = gpu # set as gpu, if gpu available
end

train_data,test_data = getdata(args)

# Construct model
m = build_model()
train_data = args.device.(train_data)
test_data = args.device.(test_data)
m = args.device(m)

Could you post the actual data setup and training code? The snippet above doesn’t run actually run the model on any data, so it’s impossible to tell what the performance bottleneck is.

Sure, essentially the same as the Model Zoo example (swapped 28,28,N input pictures with 5,10,N also Float32)
Only utilising 7% of GPU [Nvidia 1060] when I increase N above a small number (i.e. CPU is faster

if has_cuda()		# Check if CUDA is available
    @info "CUDA is on"
    import CuArrays		# If CUDA is available, import CuArrays
    CuArrays.allowscalar(false)
end

@with_kw mutable struct Args
    η::Float64 = 3e-4       # learning rate
    batchsize::Int = 32   # batch size
    epochs::Int = 10        # number of epochs
    device::Function = cpu  # set as gpu, if gpu available
end
function getdata(args)
    # Loading Dataset
    endOfTrain=Int64(trunc(length(Output)*0.7))
    xtrain=Imatrix[:,:,1:endOfTrain]
    xtest=Imatrix[:,:,endOfTrain+1:length(Output)]

    ytrain=Output[1:endOfTrain]
    ytest=Output[endOfTrain+1:length(Output)]
    # Reshape Data for flatten the each image into linear array
    xtrain = Flux.flatten(xtrain)
    xtest = Flux.flatten(xtest)

    # One-hot-encode the labels
    ytrain, ytest = onehotbatch(ytrain, 0:1), onehotbatch(ytest, 0:1)

    # Batching
    train_data = DataLoader(xtrain, ytrain, batchsize=args.batchsize, shuffle=true)
    test_data = DataLoader(xtest, ytest, batchsize=args.batchsize)

    return train_data, test_data
end

function build_model(; imgsize=(5,10,1), nclasses=2)
    return Chain(
 	    Dense(prod(imgsize), 32, relu),Dense(32,32,relu),
            Dense(32, nclasses))
end

function loss_all(dataloader, model)
    l = 0f0
    for (x,y) in dataloader
        l += logitcrossentropy(model(x), y)
    end
    l/length(dataloader)
end

function train(; kws...)
    # Initializing Model parameters
    args = Args(; kws...)

    # Load Data
    train_data,test_data = getdata(args)

    # Construct model
    m = build_model()
    train_data = args.device.(train_data)
    test_data = args.device.(test_data)
    #m = args.device(m)
    m = args.device(m)
    loss(x,y) = logitcrossentropy(m(x), y)

    ## Training
    evalcb = Flux.throttle( () -> @show(loss_all(train_data, m),accuracy(test_data, m)),5)
    
    opt = ADAM(args.η)

    @epochs args.epochs Flux.train!(loss, Flux.params(m), train_data, opt)

    @show accuracy(train_data, m)

    @show accuracy(test_data, m)
    return m
end

Continuing the discussion from Flux: GPU not working as expected:

  • set device::Function =gpu when testing speed of the GPU

(Quick tip: use triple-backtick (```) or the preformatted text button for code samples. It’s very difficult to follow the snippets above. You should be able to edit the previous posts.)

WRT your original question, I would try to increase the batch size. A 5x10xN matrix multiply is tiny by GPU standards and is unlikely to gain you much of a performance advantage unless N is ludicrously large (think upwards of 1000). Note how the current implementation uses a batch size of 1024 for size 28x28 = 784 inputs. Those are an order of magnitude larger, and yet I only see ~16% utilization on an RTX 2070!

TL;DR moving data to the GPU is important, but isn’t going to make things faster if only a small batch is run through the model at a time. If your data is small enough, it may be advantageous to stick with CPU-based Flux and not worry about all the complexity that GPU acceleration adds.

1 Like

Thanks for the quick response and the suggestion

Increased batch size (Increased to 2048) improves speed alot but CPU still beats GPU.

Data is 10m - 30m output data points depending on setup so would have thought its possible to use the GPU more than 15%

Is it possible to multithread in Flux on CPU?

Considering the MNIST input size is >10x that of your inputs, you may have to increase batch size even further (instead of just 2x), assuming your GPU has enough VRAM.

I should clarify that the total size of your dataset has no impact on CPU vs GPU performance. Since each iteration of SGD only works with one batch, the total size is irrelevant and only the batch size should be considered.

The underlying linear algebra functions Flux uses are multi-threaded by default. If you want to scale up even further, I would suggest looking into parallel/distributed programming in Julia.