When you start thinking about this carefully, it’s a real eye-opening experience. Did you know that gradient descent is utter nonsense?
Agree that it’s fascinating when you start thinking about it carefully. I’ll add a more technically correct statement for the point you are making: gradient descent depends on the choice of an inner product.
The intuition is as follows: In a gradient descent situation, suppose that
\begin{eqnarray}
\mathrm{loss} &=& 6 \\
\partial_x \mathrm{loss} &=& -3 \\
\partial_y \mathrm{loss} &=& -2
\end{eqnarray}
Now you’re wondering whether you should make 2 steps in the x direction, 3 steps in the y direction, or some combination of the two.
An inner product is what allows you to compare distances along different coordinate axes, so you can choose the combination that makes (the linear approximation to) the loss function vanish with the shortest step.
I’ll illustrate that with an example: if x and y are orthogonal and x is measured in meters but y is measured in feet, then your inner product would look like
\langle a, b \rangle = a_x b_x + (0.3048)^2 a_y b_y
The gradient, \nabla \mathrm{loss}, is defined as the unique vector such that
\langle \nabla \mathrm{loss}, b \rangle = \partial_x \mathrm{loss} \cdot b_x + \partial_y \mathrm{loss} \cdot b_y
for every vector b. So the only solution is
\begin{eqnarray}
\nabla\mathrm{loss}_x &=& -3 \\
\nabla\mathrm{loss}_y &=& -2(0.3048)^{-2}
\end{eqnarray}
That’s exactly what you would get if you first convert y into meters, compute the gradient “naively”, and then convert \nabla\mathrm{loss}_y back to feet.
Btw, in this example with meters and feet, at least the type of unit is the same. In typical machine learning applications, you might be mixing features with different types of unit that you can’t naturally add at all. In that case, you have to tell the computer how to e.g. compare “a step of x EUR in price” and “a step of y minutes of driving time”. It’s just a hyperparameter to tune. I think in practice people achieve this effect by pre-scaling the input. That’s also what @j-fu says:
It seems that in fact before we hand over data to such an algorithm we kind of always implicitely strip them off their units by passing ratios wrt. reference quantities.