I am trying to optimize a function that is expensive to evaluate but has many component parts that can be computed in parallel. However, DualNumbers are not bits-types and therefore cannot be stored in SharedArrays. Is there anything that can be done analogously to using SharedArrays to avoid reallocating and distributing memory between many processors for intermediate steps that contain DualNumbers?
As an example, let’s say we wanted to minimize f(X,theta) with respect to theta, where X is a large SharedArray not modified by f. (We use a SharedArray to avoid the overhead of passing X to all of the processors many times.)
As an example: let f(X,theta)= sum_k h(x_k,g(X,theta)), where x_k is the k-th row of X. It is clear that once g=g(X,theta) is computed that f(X,theta) = sum_k h(x_k,g) is embarrassingly parallelizeable, so the “obvious” way to evaluate f(X,theta) is to just compute g once and parallelize the computation of sum_k h(x_k,g) along k. However, during optimization using automatic differentiation, g will take on a dual-number. That means that we can’t just stick g in a SharedArray. Instead we have to pass g to each processor, even though g isn’t modified at all during the parallel computation of sum_k h(x_k,g).
Does anyone have a suggestion for a good way to avoid reallocating and passing g around many times in such a case, especially since it isn’t even modified during the parallel computation.
Thank you!