Please can you explain the use of get! and IdDict in optimisers.jl

murphyk · February 11, 2019, 11:13pm

I’m trying to understand the optimization code for momentum which reads

function apply!(o::Momentum, x, Δ)
  η, ρ = o.eta, o.rho
  v = get!(o.velocity, x, zero(x))::typeof(x)
  @. v = ρ * v - η * Δ
  @. Δ = -v
end

I don’t understand the line involving get!. This idiom is used in all the other (stateful) optimizers. Please can you explain what this whole IdDict() thing is about.

jekbradbury · February 12, 2019, 12:18am

The object o.velocity contains a velocity array for each parameter array in the model being optimized. One way to do that would be to create another instance of the model type that contains velocities instead of parameters, but a lighter-weight option is to use an IdDict, a data structure that allows using the object identities (essentially, “memory locations”) of arbitrary Julia values as keys.

To be more concrete, if you used a normal Dictionary, then looking up a velocity array that corresponds to a parameter array would require hashing the entire parameter array, when just hashing the parameter array’s location in memory is enough (and in fact we want two parameter arrays that happen to have the same values in them to still have independent velocity arrays).

The get! function looks up a key in any dict-like object and, if the key isn’t present, creates an entry with that key and a given initial value (in this case zero(x)); either way it then returns the value for the key.

jandehaan · February 12, 2019, 12:25am

For a possible answer see stackoverflow. ObjectIdDict is the old name for IdDict.

murphyk · February 12, 2019, 12:35am

Initially I wondered why you did not store a vector of velocity vectors, one entry for every parameter. But then I realized this won’t work if the parameters have different shapes. So that’s why you use a dict? Why not just a single massive “flattened” vector of velocities for all the params combined? Is it because that would require packing and unpacking the gradient into flat and structured format?

jekbradbury · February 12, 2019, 12:42am

Basically that, and also because there’s no canonical order for the parameters in a network.

Keno · February 12, 2019, 12:51am

FWIW, it is a perfectly viable design to do the optimization with immutable arrays and structural identity (as opposed to object identity as Flux does by default) instead. We do that on TPU because it allows us to fuse both the model and the optimizer into one big function to offload to the TPU.

Topic		Replies	Views
How to access IdDict with keys? New to Julia question	2	337	August 25, 2021
Unexpected allocations when accessing IdDict Performance	9	638	August 30, 2021
Performance of IdDict vs Dict Performance	13	1458	May 19, 2023
`IdDict{UInt64, Float64}` is much slower than `Dict{UInt64, Float64}` for retrieving values Performance dictionary	2	110	November 27, 2024
Is there a way I can identity Dict just like identity Vector? New to Julia dictionaries	3	358	May 30, 2022

Please can you explain the use of get! and IdDict in optimisers.jl

Related topics