BSON error when loading Flux model

The command

m = BSON.load("model.bson", @__MODULE__)[:m]

is throwing

ERROR: MethodError: Cannot `convert` an object of type CuContext to an object of type CuPtr{Nothing}
Closest candidates are:
  convert(::Type{CuPtr{T}}, ::CUDA.Mem.UnifiedBuffer) where T at /scratch/drozda/.julia/packages/CUDA/9T5Sq/lib/cudadrv/memory.jl:234
  convert(::Type{CuPtr{T}}, ::CUDA.Mem.HostBuffer) where T at /scratch/drozda/.julia/packages/CUDA/9T5Sq/lib/cudadrv/memory.jl:129
  convert(::Type{CuPtr{T}}, ::CUDA.Mem.DeviceBuffer) where T at /scratch/drozda/.julia/packages/CUDA/9T5Sq/lib/cudadrv/memory.jl:59
  ...

Does anyone know what could it be ?

Does the machine this is running on have a CUDA-supported GPU? Otherwise the note in Saving & Loading · Flux may apply.

1 Like

Thanks for the reply @ToucheSir.
I do offload the model back to the CPU before saving

m = model |> cpu
@save "model.bson" m opt

Ah, but opt is not offloaded. Can you get away with not saving the optimizer state as well?

From what I’ve seen in Flux model zoo, you don’t need to offload the optimizer to any device. Instantiating opt = ADAM(), for instance, will work both on CPU and GPU.

What’s strange to me is that the reported issue appears just for models saved some time ago.

If I train a new model, save it (with the optimizer) and load it for inference, the issue isn’t raised.

Yes, but if you offload the model itself and load it back in, the references in the optimizer will no longer point to the same thing. So no error is raised, but all the optimizer state has been silently invalidated. This is a big footgun and something we haven’t been able to address until very recently, since it’s a fundamental issue with the design of the current optimizer interface.

That’s interesting, I assume you didn’t offload those older models before saving? If so, I wonder if internal changes in CUDA.jl might be causing the errors then.

I do offload these older models back to the CPU before saving.

Could changing modules over time lead to such an issue ?
I’m using @__MODULE__ option when loading,
but if the current version of a module differs from the older one,
issues could appear, I suppose.

Did you save anything other than the model? Optimizer state could be a problem as mentioned above.

If you’re willing/able to provide one of these troublesome BSON files, I could take a look. Another idea would be reading them back in with https://github.com/ancapdev/LightBSON.jl and seeing trying to recover the data manually. Even printing out a tree of all the type tags in the file could help to identify what is holding onto CUDA stuff.