Flux training GPU vs CPU different results

Hi,
I have a problem while training on GPU vs CPU. Or when I just once run train function it goes completely off while on GPU. If I choose CPU it just stays “converge” and not go further away. An example of 3 runs on GPU. It just gets super off, and I don’t get that when doing function train2() again it starts close where the previous ended. It just should not because I guess the variables are not GLOBAL.
When I run it as many times I want, it will always give a result similar to the first one since it just starts with different values but then converges somewhere. GPU just goes off.

julia> m, argsout, resout = train2(tr_data, te_data, modelfct;...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #879)
[ Info: Start Training . . .
Epoch: 0    Train: 1.0487   Test: 43.2849   
Epoch: 30   Train: 1.1291   Test: 1.1006    
Epoch: 60   Train: 1.1291   Test: 1.1006    
Epoch: 90   Train: 1.1291   Test: 1.1006    
Epoch: 120  Train: 1.1291   Test: 1.1006    
Epoch: 150  Train: 1.1291   Test: 1.1006    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct;...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #879)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1291   Test: 1.1006    
Epoch: 30   Train: 1.1291   Test: 1.1006    
Epoch: 60   Train: 1.1291   Test: 1.1006    
Epoch: 90   Train: 1.1291   Test: 1.1006    
Epoch: 120  Train: 131.7943     Test: 169.8546  
Epoch: 150  Train: 46.2814  Test: 97.1106   

julia> m, argsout, resout = train2(tr_data, te_data, modelfct;...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #879)
[ Info: Start Training . . .
Epoch: 0    Train: 160.3743     Test: 143.1688  
Epoch: 30   Train: 160.3743     Test: 152.41    
Epoch: 60   Train: 116.2331     Test: 95.1814   
Epoch: 90   Train: 107.7877     Test: 159.5695  
Epoch: 120  Train: 105.0404     Test: 54.0994   
Epoch: 150  Train: 99.2565  Test: 69.0809   

Any ideas are more than welcomed.

Based on the output, it definitely seems like something is being mutated. So each successive run is picking up where the previous run left off instead of starting from scratch. Do you mind posting the CPU version of the output you shared? From your description, it seems like the CPU version doesn’t have the mutation issue.

Hi, thanks. It looks like you write, but I just can’t find it. Here are 3 runs of CPU and 3 runs on GPU. CPU behaves as I would expect - learns a bit, has the same starting loss and does not go off some stable path.

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=false,...);
[ Info: Training on CPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1564   Test: 1.1672    
Epoch: 30   Train: 1.0754   Test: 1.08  
Epoch: 60   Train: 1.0749   Test: 1.0777    
Epoch: 90   Train: 1.0767   Test: 1.0828    
Epoch: 120  Train: 1.073    Test: 1.08  
Epoch: 150  Train: 1.0771   Test: 1.0833    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=false,...);
[ Info: Training on CPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1564   Test: 1.1672    
Epoch: 30   Train: 1.0754   Test: 1.08  
Epoch: 60   Train: 1.0749   Test: 1.0777    
Epoch: 90   Train: 1.0767   Test: 1.0828    
Epoch: 120  Train: 1.073    Test: 1.08  
Epoch: 150  Train: 1.0771   Test: 1.0833    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=false,...);
[ Info: Training on CPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1564   Test: 1.1672    
Epoch: 30   Train: 1.0754   Test: 1.08  
Epoch: 60   Train: 1.0749   Test: 1.0777    
Epoch: 90   Train: 1.0767   Test: 1.0828    
Epoch: 120  Train: 1.073    Test: 1.08  
Epoch: 150  Train: 1.0771   Test: 1.0833    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=true,...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1589   Test: 1.2147    
Epoch: 30   Train: 1.0768   Test: 1.0803    
Epoch: 60   Train: 41.8823  Test: 41.5863   
Epoch: 90   Train: 1.068    Test: 1.0709    
Epoch: 120  Train: 37.0247  Test: 82.1282   
Epoch: 150  Train: 44.2864  Test: 1.0749    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=true,...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 34.6607  Test: 40.7039   
Epoch: 30   Train: 43.6453  Test: 1.0864    
Epoch: 60   Train: 12.0337  Test: 80.6082   
Epoch: 90   Train: 48.6058  Test: 45.9999   
Epoch: 120  Train: 22.2186  Test: 1.0751    
Epoch: 150  Train: 22.3228  Test: 138.1419  

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=true,...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 31.0082  Test: 50.5443   
Epoch: 30   Train: 45.2076  Test: 49.8443   
Epoch: 60   Train: 46.5327  Test: 1.0789    
Epoch: 90   Train: 55.8642  Test: 122.2016  
Epoch: 120  Train: 45.7922  Test: 46.6472   
Epoch: 150  Train: 86.8372  Test: 47.0398   

Hi, I would add a figure where I got from yesterday’s situation.
It works “better” but it is still super random the training on GPU. Code is just the same, data too, and I just say device is gpu or cpu and the losses look like on the figure attached. I just don’t understand.

Does anyone know what might be the problem? Thanks a lot!

You’ll have to post the inside of train2 since that’s where the mutation must be occurring. This seems like less of a ML issue and more of a programming error.

1 Like

I think that the problem was too many Flux.reset!() in for loop of 1:epochs when doing RNN. I have rewritten the code from almost scratch inspired by mlp.jl from model-zoo and find out that Flux.reset!() is needed only in the loss function and when I do run model. It might be helpful for someone to write and to put the model into let+end, because I have wanted to save prediction when improved.

if condtion_that_model_improved
    let m1 = cpu(model)
        update_best_yhat = m1(test_data_input)
    end
end

and continue learning further.

I am glad you figured it out. Since RNNs have hidden state, doing inference passes in the middle of training could cause issues. The cleanest solution here would probably be to not use Flux.train! if you are and write your own training loop instead. This would allow you to have a copy of each forward pass which you can then store for future reference. For example,

besty = # initialize
for epoch in 1:nepochs
  for (x, y) in data
    local ypred
    grad = gradient(params(m)) do
      ypred = m(x)
    end
    update!(opt, params(m), grad)

    if improved(ypred, besty)
      besty = ypred
    end
  end
end
1 Like

@darsnack I have a similar problem, but I believe is not related to GPU vs CPU. The problem appears when I wrap the training code inside a function() which acctually prevents me to upgrade the code to a more dynamic custom training loop. when I do the latter, it just gives me the same result allways, but when I run it from the REPL, it runs OK. I have made a different post for this issue. I don’t know if it would be better to just continue this one… let me know @luboshanus if that’s Ok.