BatchSolve.jl aims to provide various functionalities for solving batch-able, vectorized residual functions (for root finding) and objective functions (for minimizing). As long as your function is vectorized, then BatchSolve.jl will take care of the rest!
In summary, this package features:
- Solvers for (GPU) vectorized residual/merit functions
- The
AutoBatchautomatic-differentiation (AD) type for computing batches of derivatives of CPU/GPU vectorized/parallel functions, and is also fully compatible withDifferentiationInterface.jland any of its supported AD backends - Specialized bindings for
FiniteDiff.jlwithAutoBatchto compute CPU/GPU vectorized finite differences - Solvers that directly plug into CUDA’s cuBLAS library for batched linear solving
What do we mean by “vectorized”? Say you want to find the root x of this function given some parameters p:
f(x, p) = (x-1)^2 - p^2
You can do this easily with a Newton or Brent method, of course, probably available in some other root-finding package. But what if you want to solve this for 1 million different p? Well, if we let p be a vector of those million different parameters, then we can evaluate this function for a million different x using the broadcast operator “.” in Julia:
f_vectorized(x, p) = f.(x, p)
The result of f_vectorized will be a vector where each element is the corresponding f(x,p), and the evaluation will (hopefully) take advantage of SIMD if your CPU supports it. Better yet, if x and p are GPU arrays (e.g. CuArray for NVIDIA CUDA), then the evaluation will be GPU-parallelized!
With BatchSolve.jl, we can do root finding on this function in a vectorized way too. All we need to do is specify the batchdim - the dimension of the input array along which each independent element in a batch lies, where 1 corresponds to the rows and 2 to the columns. This allows for batched solving of many n-dimensional dimensional systems. For our 1D example here we would just choose 1. E.g. for a batch with p=1:10000,
sol = newton(f_vectorized, zeros(10000), Constant(1:10000), batchdim=1)
where we specified that p is a Constant - not mutated throughout function evaluation. Arguments could also be specified as Cache if they are mutated. These constructs are re-exported from DifferentiationInterface.jl, and more details can be found here. For all AD purposes, BatchSolve uses DifferentiationInterface. Therefore it is easy to use a different AD backend, e.g. Enzyme:
using Enzyme
sol = newton(f_vectorized, zeros(10000), Constant(1:10000), batchdim=1, autodiff=AutoEnzyme())
The returned object is a NamedTuple with the fields:
u: an array with size equal to the input array, containing the inputs at the rootsjac: a (sparse) matrix storing the Jacobian for the entire batched-system at the last iterationretcode: an array of return codes for each element in the batch, where0x0is success,0x1is failure, and0x2means the maximum number of iterations was reachediters: an array storing the number of iterations until convergence for each element in the batch
Of course, BatchSolve.jl is not limited to only 1D functions. Consider this example:
g(x, p) = x .+ p
This function takes in a vector x and adds a vector p to it. Suppose we want to do a Newton search on this system now for vectors of length 2, for many different p, say 10000. In this simple case we can evaluate g in a vectorized way for all systems simply by initializing x and p as matrices, either 2 x 10000 or 10000 x 2. For batchdim=1, this would be 10000 x 2:
x0 = zeros(10000, 2)
p = rand(10000, 2)
sol = newton(g, x0, Constant(p), batchdim=1)
sol.u will be a 10000 x 2 matrix with the roots. For batchdim=2:
x0 = zeros(2, 10000)
p = rand(2, 10000)
sol = newton(g, x0, Constant(p), batchdim=2)
Of course, you could just treat this batch all together as one vector of size 20000 (number of elements in the “batch” equal to one). This can be done by not setting batchdim, which basically gives a regular non-batched Newton solver:
x0 = zeros(20000)
p = rand(20000)
sol = newton(g, x0, Constant(p))
The problem with this approach is that now instead of solving many small 2x2 systems, the linear algebra solver needs to solve a giant 20000 x 20000 matrix equation. This will generally be much slower than just solving many small systems. In fact, we tested this approach with CUDA’s cuDSS solver, and unfortunately we found performance to scale poorly beyond 60000 x 60000 systems.
For CUDA arrays, BatchSolve.jl plugs directly into CUDA’s cuBLAS library which provides GPU-accelerated batched linear system solving.
See the rest of the README here for more details and benchmarks