I’m trying to understand the optimization code for momentum which reads
function apply!(o::Momentum, x, Δ)
η, ρ = o.eta, o.rho
v = get!(o.velocity, x, zero(x))::typeof(x)
@. v = ρ * v - η * Δ
@. Δ = -v
I don’t understand the line involving
get!. This idiom is used in all the other (stateful) optimizers. Please can you explain what this whole IdDict() thing is about.
o.velocity contains a velocity array for each parameter array in the model being optimized. One way to do that would be to create another instance of the model type that contains velocities instead of parameters, but a lighter-weight option is to use an
IdDict, a data structure that allows using the object identities (essentially, “memory locations”) of arbitrary Julia values as keys.
To be more concrete, if you used a normal
Dictionary, then looking up a velocity array that corresponds to a parameter array would require hashing the entire parameter array, when just hashing the parameter array’s location in memory is enough (and in fact we want two parameter arrays that happen to have the same values in them to still have independent velocity arrays).
get! function looks up a key in any dict-like object and, if the key isn’t present, creates an entry with that key and a given initial value (in this case
zero(x)); either way it then returns the value for the key.
For a possible answer see stackoverflow.
ObjectIdDict is the old name for
Initially I wondered why you did not store a vector of velocity vectors, one entry for every parameter. But then I realized this won’t work if the parameters have different shapes. So that’s why you use a dict? Why not just a single massive “flattened” vector of velocities for all the params combined? Is it because that would require packing and unpacking the gradient into flat and structured format?
Basically that, and also because there’s no canonical order for the parameters in a network.
FWIW, it is a perfectly viable design to do the optimization with immutable arrays and structural identity (as opposed to object identity as Flux does by default) instead. We do that on TPU because it allows us to fuse both the model and the optimizer into one big function to offload to the TPU.