Flux training GPU vs CPU different results

luboshanus · August 18, 2020, 3:40pm

Hi,
I have a problem while training on GPU vs CPU. Or when I just once run train function it goes completely off while on GPU. If I choose CPU it just stays “converge” and not go further away. An example of 3 runs on GPU. It just gets super off, and I don’t get that when doing function train2() again it starts close where the previous ended. It just should not because I guess the variables are not GLOBAL.
When I run it as many times I want, it will always give a result similar to the first one since it just starts with different values but then converges somewhere. GPU just goes off.

julia> m, argsout, resout = train2(tr_data, te_data, modelfct;...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #879)
[ Info: Start Training . . .
Epoch: 0    Train: 1.0487   Test: 43.2849   
Epoch: 30   Train: 1.1291   Test: 1.1006    
Epoch: 60   Train: 1.1291   Test: 1.1006    
Epoch: 90   Train: 1.1291   Test: 1.1006    
Epoch: 120  Train: 1.1291   Test: 1.1006    
Epoch: 150  Train: 1.1291   Test: 1.1006    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct;...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #879)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1291   Test: 1.1006    
Epoch: 30   Train: 1.1291   Test: 1.1006    
Epoch: 60   Train: 1.1291   Test: 1.1006    
Epoch: 90   Train: 1.1291   Test: 1.1006    
Epoch: 120  Train: 131.7943     Test: 169.8546  
Epoch: 150  Train: 46.2814  Test: 97.1106   

julia> m, argsout, resout = train2(tr_data, te_data, modelfct;...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #879)
[ Info: Start Training . . .
Epoch: 0    Train: 160.3743     Test: 143.1688  
Epoch: 30   Train: 160.3743     Test: 152.41    
Epoch: 60   Train: 116.2331     Test: 95.1814   
Epoch: 90   Train: 107.7877     Test: 159.5695  
Epoch: 120  Train: 105.0404     Test: 54.0994   
Epoch: 150  Train: 99.2565  Test: 69.0809

Any ideas are more than welcomed.

darsnack · August 18, 2020, 4:04pm

Based on the output, it definitely seems like something is being mutated. So each successive run is picking up where the previous run left off instead of starting from scratch. Do you mind posting the CPU version of the output you shared? From your description, it seems like the CPU version doesn’t have the mutation issue.

luboshanus · August 19, 2020, 7:53am

Hi, thanks. It looks like you write, but I just can’t find it. Here are 3 runs of CPU and 3 runs on GPU. CPU behaves as I would expect - learns a bit, has the same starting loss and does not go off some stable path.

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=false,...);
[ Info: Training on CPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1564   Test: 1.1672    
Epoch: 30   Train: 1.0754   Test: 1.08  
Epoch: 60   Train: 1.0749   Test: 1.0777    
Epoch: 90   Train: 1.0767   Test: 1.0828    
Epoch: 120  Train: 1.073    Test: 1.08  
Epoch: 150  Train: 1.0771   Test: 1.0833    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=false,...);
[ Info: Training on CPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1564   Test: 1.1672    
Epoch: 30   Train: 1.0754   Test: 1.08  
Epoch: 60   Train: 1.0749   Test: 1.0777    
Epoch: 90   Train: 1.0767   Test: 1.0828    
Epoch: 120  Train: 1.073    Test: 1.08  
Epoch: 150  Train: 1.0771   Test: 1.0833    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=false,...);
[ Info: Training on CPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1564   Test: 1.1672    
Epoch: 30   Train: 1.0754   Test: 1.08  
Epoch: 60   Train: 1.0749   Test: 1.0777    
Epoch: 90   Train: 1.0767   Test: 1.0828    
Epoch: 120  Train: 1.073    Test: 1.08  
Epoch: 150  Train: 1.0771   Test: 1.0833    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=true,...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 1.1589   Test: 1.2147    
Epoch: 30   Train: 1.0768   Test: 1.0803    
Epoch: 60   Train: 41.8823  Test: 41.5863   
Epoch: 90   Train: 1.068    Test: 1.0709    
Epoch: 120  Train: 37.0247  Test: 82.1282   
Epoch: 150  Train: 44.2864  Test: 1.0749    

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=true,...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 34.6607  Test: 40.7039   
Epoch: 30   Train: 43.6453  Test: 1.0864    
Epoch: 60   Train: 12.0337  Test: 80.6082   
Epoch: 90   Train: 48.6058  Test: 45.9999   
Epoch: 120  Train: 22.2186  Test: 1.0751    
Epoch: 150  Train: 22.3228  Test: 138.1419  

julia> m, argsout, resout = train2(tr_data, te_data, modelfct; cuda=true,...);
[ Info: Training on GPU
[ Info: Dataset: 1600 train and 400 test examples
[ Info: Model: 1185 trainable params. Chain(Dense(1, 64, relu), Dense(64, 16, relu), Dropout(0.5), Dense(16, 1), #185)
[ Info: Start Training . . .
Epoch: 0    Train: 31.0082  Test: 50.5443   
Epoch: 30   Train: 45.2076  Test: 49.8443   
Epoch: 60   Train: 46.5327  Test: 1.0789    
Epoch: 90   Train: 55.8642  Test: 122.2016  
Epoch: 120  Train: 45.7922  Test: 46.6472   
Epoch: 150  Train: 86.8372  Test: 47.0398

luboshanus · August 20, 2020, 10:14am

Hi, I would add a figure where I got from yesterday’s situation.
It works “better” but it is still super random the training on GPU. Code is just the same, data too, and I just say device is gpu or cpu and the losses look like on the figure attached. I just don’t understand.

Does anyone know what might be the problem? Thanks a lot!

darsnack · August 20, 2020, 12:42pm

You’ll have to post the inside of train2 since that’s where the mutation must be occurring. This seems like less of a ML issue and more of a programming error.

luboshanus · August 20, 2020, 3:18pm

I think that the problem was too many Flux.reset!() in for loop of 1:epochs when doing RNN. I have rewritten the code from almost scratch inspired by mlp.jl from model-zoo and find out that Flux.reset!() is needed only in the loss function and when I do run model. It might be helpful for someone to write and to put the model into let+end, because I have wanted to save prediction when improved.

if condtion_that_model_improved
    let m1 = cpu(model)
        update_best_yhat = m1(test_data_input)
    end
end

and continue learning further.

darsnack · August 20, 2020, 3:39pm

I am glad you figured it out. Since RNNs have hidden state, doing inference passes in the middle of training could cause issues. The cleanest solution here would probably be to not use Flux.train! if you are and write your own training loop instead. This would allow you to have a copy of each forward pass which you can then store for future reference. For example,

besty = # initialize
for epoch in 1:nepochs
  for (x, y) in data
    local ypred
    grad = gradient(params(m)) do
      ypred = m(x)
    end
    update!(opt, params(m), grad)

    if improved(ypred, besty)
      besty = ypred
    end
  end
end

german_brunini · August 20, 2020, 8:19pm

@darsnack I have a similar problem, but I believe is not related to GPU vs CPU. The problem appears when I wrap the training code inside a function() which acctually prevents me to upgrade the code to a more dynamic custom training loop. when I do the latter, it just gives me the same result allways, but when I run it from the REPL, it runs OK. I have made a different post for this issue. I don’t know if it would be better to just continue this one… let me know @luboshanus if that’s Ok.

Topic		Replies	Views
Flux.jl: training fails at GPU but works on CPU Machine Learning gpu , flux	1	630	September 19, 2019
Training gets differents results when using Flux.train() inside function Machine Learning flux	2	1337	August 23, 2020
The same network performs differently in Flux.jl and tensorflow Machine Learning performance	13	3068	December 18, 2019
Flux on GPU too slow Machine Learning gpu , cuda , flux	5	1119	September 22, 2022
Data Science lessons: Making "10 - Neural Networks" run on GPU? New to Julia gpu , flux	4	740	January 14, 2022

Flux training GPU vs CPU different results

Related topics