Although I’d like to pose my thoughts and questions on a quite general group of optimization problems, my thoughts on these topics came from working on the following optimization problem:
The second lens surface is a 2d spline surface. The problem is to optimize the control points of this surface in order to achieve a desired illumination pattern on the detector screen. The computation of the illumination pattern is done via differentiable ray-tracing. So we have:
- A number of variables to optimize given by the z coordinates of the control points
- An objective function given by the difference between the current and desired illumination pattern
This defines a smooth non-linear optimization problem and thus an opportunity to apply one of many (!) solvers for such a problem. I’ve made the interesting observation that if you either:
- Solve this problem directly with the Adam optimizer;
- Solve this problem by defining a densely connected MLP with a few layers of the same width as the number of of control points, giving it a trivial input of 1 and use the output as the control point z coordinates, and optimize the MLP parameters with the Adam optimizer;
the second approach performs much better (in the sense that it achieves smaller loss values). If you squint a bit, given the ‘trivial’ use of the MLP, you could say this NN is just a parameter space transformation part the optimizer.
An obvious difference between these approaches is that the second one optimizes on a higher dimensional parameter space than the first one. But my intuition here is that the second approach ‘sees more opportunities for loss reduction’. This is because a single parameter in a hidden layer of the MLP affects all control points to a greater and lesser extend. So maybe it is helpful to think of each node in a hidden layer to represent a direction in the control point z coordinate space as determined by the parameter values in the following layers, and the plain Adam optimizer approach considers just one direction?
There’s also the question of what type of cleverness different optimizers employ to find the ‘best’ step direction in the parameter space, and how that interacts with the above.
I’d love to hear your thoughts!