Hi,
I want to understand the meaning of Δ used in different methods (i.e. for different optimizers) in Flux.
For example: in Descent, the meaning of Δ seems to be:
[∂f/∂θᵢ] i=1,2,3,…
While looking at the meaning of Δ in AdaGrad, the meaning of AdaGrad takes time into account as well (if I understood AdaGrad correctly), then Δ mathematically means a diagonal matrix where ith element is:

, considering that my understanding is correct. How is Δ given different (types) of values? Since its a part of the API, which is usually called by update! function, which eventually calls x -= Δ. I’m not able to wrap my head around as to how that operation (even when vectorised) is going to be a valid one?