Deepcopy Flux Model

Hi,

I am struggling with warmstarting Flux models. My idea is to deepcopy both the model and the optimizer. However, if I do that the training error goes significantly up when retraining. If I use BSON: @save/@load it works as expected.

So my questions are:

  • Are the Flux models / optimizer not deep copyable?
  • Is there any other way to achieve an in-memory copy of the models.

In the end I would actually would like to have a single model and just load/store the parameters. But that does not work either. It’s seems that the model somehow has some global state that prevents having independent copies.

Thanks Tobi

Without an MWE it is difficult to pinpoint the exact issue, but one thing which would not work with deepcopy is that the stateful optimizers uses an IdDict to map parameters to state and identity is not preserved by deepcopy (i.e deepcopy(a) !== a) by definition. I believe that BSON.@save "modelopt.bson" model opt will preserve identities when model and opt are loaded back.

For the second part of the question about storing parameters, Flux has the function loadparams! which loads parameters into an existing model.

I don’t think Flux has a way to load optimizer state in the same way though. You can try to do so manually using equals (==) instead of identity (===). I think that if you do BSON.@save "paropt.bson" pars opt where pars = params(model) you’ll get the identity preservation when loading so if that works for you I suggest you do it that way.

Note that BSON might not guarantee long term compatibility so if you make changes to your project there is a risk that an old model won’t load. I haven’t used it in a while so it might be wrong though.

Thanks. MWE would of course be better but your answer indicated already that deepcopy will not work. I know about loadparams! but since the optimizer seems to be not deep copyable that does not help.

My goal as actually rather simple: I want to duplicate model/optimizer (everything that belongs together). BSON is not ideal for me since it requires very much time to load the network and the first train! is very long (> 1 minute) presumedly because of gradient calculation. My hope is that there is some in memory way to achieve that more efficiently.

Ah, then I misunderstood you. I believe you can do modelcopy, optcopy = deepcopy((model, opt)). deepcopy is a bit of a problematic function as its semantics tend to be not well defined. For instance, the example I gave might seem to contradict what one would think is a deep copy and maybe it will not work for deeper objects (I tried with a Vector and an IdDict which has the vector as a key).

If deepcopy does not work for you, I think that something like modelcopy, optcopy = Flux.fmap(copy, (model, opt)) should work.

Ok, thanks, will try this out. For my understanding: is the issue that the optimizer holds references into the model? Otherwise I don’t understand what is going on here. Further its interesting that the deepcopy approach using the tuple is able to transfer these references to the copy.

Yes, the optimizer (if stateful, e.g. Momentum or ADAM) holds references to the parameter arrays of the model. That is how it knows what the state is for each parameter. It uses an IdDict to do this. If you use a stateless optimizer (e.g. Descent) then this problem should not exist.

Further its interesting that the deepcopy approach using the tuple is able to transfer these references to the copy

Yes, this was mildly surprising to me as well. I guess it has to do with deepcopy being (trying to be?) equivalent to xcopy = deserialize(serialize(x)) and the latter would be less useful if it didn’t behave like this.

Fwiw, there has been some effort to make the optimizers stateless (== not have an IdDict) and instead be explicit about the state (similar to how julias iterator protocol works) but I don’t know how actively it is being pursued.

I now moved my code to the GPU and now even the BSON version does not work anymore. My code roughly looks like this

model = make_model() |> gpu
opt = ADAM()

# do the training

model = model |> cpu

BSON.@save "model.bson" model opt

#### now lets get it back

BSON.@load "model.bson" model opt

# if I now train again, the first iterations are very far off.

I know the the line where I move the model back to the CPU might be problematic, but I don’t see a way doing it in another way. Any ideas?

The gpu function does basically the same thing as deepcopy (but in a more manual structured way) so you’ll run into the same type of problems; The line model = model |> cpu is making a new deep copy of the model with all CuArrays replaced with Arrays, but the optimizer still has the CuArrays as keys.

I don’t have a Flux install at hand, but try the same deepcopy trick with the cpu/gpu functions:

model, opt = cpu((model, opt))
# Deepcopy or save
model, opt = gpu((model, opt))

If it works, consider submitting an issue or PR to flux to add this to the GPU support doc section.

1 Like

Unfortunately this doesn’t work because cpu/gpu don’t recurse into Dicts, but it lead me to something that should :slight_smile:

ps_gpu = params(model)
model = cpu(model)

# you could also create a new ADAM() here
opt.state = IdDict(pc => cpu(opt.state[pg]) for (pc, pg) in zip(params(model), ps_gpu))

BSON.@save "model.bson" model opt

#### now lets get it back

BSON.@load "model.bson" model opt

ps_cpu = params(model)
model = gpu(model)

# you could also create a new ADAM() here
opt.state = IdDict(pg => gpu(opt.state[pc]) for (pc, pg) in zip(params(model), ps_gpu))

This can be pulled out into a function:

function load_opt_state!(opt::ADAM, ps_dest, ps_src; transform=identity)
  opt.state = IdDict(p_dest => transform(opt.state[p_src]) for (p_dest, p_src) in zip(ps_dest, ps_src))
end

# example usage
load_opt_state!(opt, ps_cpu, ps_gpu, transform=cpu)
4 Likes