Smooth delta functions to compute Jacobian and Hessian from samples

Hello,

Given an expensive function \boldsymbol{G} :\mathbb{R}^n \xrightarrow{} \mathbb{R}^m whose value is known at samples \boldsymbol{x}^1,\ldots, \boldsymbol{x}^N, I would like to compute the Jacobian and Hessian at each sample or at least at the mean of the samples \overline{\boldsymbol{x}}, without new evaluations of \boldsymbol{G}.

The package https://github.com/Mattriks/CoupledFields.jl can be a solution, @Mattriks can you give me more background on the techniques used to construct these gradients?

I would like to get your feedback on the following technique that relies on smooth delta functions:

Letā€™s start with the continuous form:

\boldsymbol{G}(\overline{\boldsymbol{x}}) = \int_{\boldsymbol{x} \in \mathbb{R}^n} \boldsymbol{G(x)} \delta(\overline{\boldsymbol{x}} - \boldsymbol{x}) \mathrm{d}\boldsymbol{x} \mbox{ with } \delta \mbox{ the Dirac delta function.}

Thus we can get the Jacobian at \overline{\boldsymbol{x}} as:

\partial_{\overline{\boldsymbol{x}}_i} G(\overline{\boldsymbol{x}}) = \int_{\boldsymbol{x} \in \mathbb{R}^n} \boldsymbol{G}(\boldsymbol{x}) \delta_i^{(1)}(\overline{\boldsymbol{x}} - \boldsymbol{x}) \mathrm{d}\boldsymbol{x} \mbox{ with }\delta^{(1)}_i(\boldsymbol{z}) = \frac{\partial \delta(\boldsymbol{z})}{\partial z_i}.

We can replace the Dirac delta function by a smoothed version \delta_h with a compact support.

The function \delta(\boldsymbol{z}) can be factorized as a product of one-dimensional delta functions \delta(\boldsymbol{z}) = \prod_{i=1}^{n} \delta(z_i), (also holds for \delta_{h}). Therefore we can easily construct the derivative kernels \delta^{(1)}_i(\boldsymbol{z}) = \partial \delta(\boldsymbol{z})/{\partial z_i} and \delta^{(2)}_{ij}(\boldsymbol{z}) = \partial^2 \delta(\boldsymbol{z})/\partial z_i\partial z_j.

For our discrete samples, we get:

\partial_{\overline{\boldsymbol{x}}_i} G(\overline{\boldsymbol{x}}) \approx \sum_{k=1}^N \boldsymbol{G}(\boldsymbol{x}^k) \delta_i^{(1)}(\overline{\boldsymbol{x}} - \boldsymbol{x}^k)

where h is set according to the spread of the samples about the \overline{\boldsymbol{x}}

The usual thing to do would be to construct an interpolant, e.g. with Chebyshev polynomials. Once you have a smooth interpolant from your sample points, you can compute any derivatives you want.

(There are several packages that can do Chebyshev interpolation for you in Julia, e.g. ApproxFun and ChebyshevApprox.)

Two key questions: what is the dimension n of your input, and can you choose the sample points? Interpolation is easiest if n is moderate (\lesssim 5) and you can choose the sample points (e.g. Chebyshev nodes).

Another possible interpolant, which is vaguely similar to your ā€œsmooth delta functionsā€ suggestion, is to use radial basis functions. (Approximate delta functions by themselves wonā€™t work wellā€¦that direction leads towards finite-difference approximation, which is what an approximate \partial\delta/\partial x essentially is.)

3 Likes

Thank you for your answer,

The input size n can vary from 50-100, the output dimension m is 50, and we have 50-200 samples. Unfortunately, I canā€™t choose the sample points. They are samples from a prior distribution, a Gaussian approximation of the distribution is reasonable.

In that case, I would think about adjoint methods (automated ala Zygote or by hand) to compute derivatives of G directly. Accurate interpolation in 50 dimensions is hard unless the function is very nicely behaved (effectively low-dimensional), and taking derivatives tends to amplify interpolation errorsā€¦

4 Likes

I forgot to mention that the samples have a relative small spread about the mean, so we are only interested in a very local behavior of the function.

Thank you for your notes. I am not familiar with adjoint methods. How can I reformulate my problem \boldsymbol{y} = \boldsymbol{G}(\boldsymbol{x}) to fit in the nonlinear equations section of your paper, i.e. \boldsymbol{f}(\boldsymbol{x}, \boldsymbol{p}) = 0

Itā€™s hard to say more without knowing anything about the function G that you want to differentiate.

If the function G isnā€™t changing much among your samples, so that it is well approximated by a low-order polynomial, then I would just try a polynomial or radial-basis-function fit.

If I could afford more evaluations of the function, the best will probably be to use a Smolyak grid to construct an interpolant and then use automatic differentation?

Smolyak/sparse grids arenā€™t a panacea. They donā€™t work well if the Hessian has lots of non-negligible off-diagonal elements.

In high dimensions, there is no free lunch for interpolation ā€” there has to be something special about your function. Thatā€™s why I said that it may be easier just to figure out how to differentiate G directly.

3 Likes

Going back to the OP, CoupledFields is based on kernel regression.
If yįµ¢ = g(xįµ¢), then it follows from kernel ridge regression that:

āˆ‡g(xįµ¢) = āˆ‡K(xįµ¢, ā‹…) (Gā‚“ + 10įµ‡nI)ā»Ā¹ Y

where Gā‚“ is the gram matrix of X, and āˆ‡K() is the gradient function of a kernel function.

In CoupledFields, for gradvecfield([a b], ...), a is a smoothness parameter (that scales an auto-estimate of kernel width), and b is a ridge parameter (as in the equation above).
Using a=1.0 should typically work well.

Hereā€™s an example. The black line is the true value, the gradient function is estimated using gradvecfield([a -7], X, Y, GaussianKP(X)) for 3 values of a.

CF_sinc

For high-dimensional problems, the average product of jacobians can be used to find an active subspace.

1 Like

I use sparse grids (aka Smolyak) for reduced order models in computational chemistry. As @stevengj says, they donā€™t solve all your problems. Our function evaluations are very expensive and we can only populate grids for moderate orders and lowish (<=10) dimensions before the computations start taking weeks.

If you donā€™t care about boundary conditions, you may have better luck. We do and have to mix polynomial and trig functions bases. This is not easy. I have a student who is about to get a PhD for getting this into the ORNL Tasmanian code.

You could also get some Gibbs effects for your delta function approximation.

2 Likes