BatchSolve.jl: Vectorized, batched, AD-compatible root finders and minimizers. Also vectorized finite differences

mattsignorelli · April 20, 2026, 12:09pm

BatchSolve.jl aims to provide various functionalities for solving batch-able, vectorized residual functions (for root finding) and objective functions (for minimizing). As long as your function is vectorized, then BatchSolve.jl will take care of the rest!

In summary, this package features:

Solvers for (GPU) vectorized residual/merit functions
The AutoBatch automatic-differentiation (AD) type for computing batches of derivatives of CPU/GPU vectorized/parallel functions, and is also fully compatible with DifferentiationInterface.jl and any of its supported AD backends
Specialized bindings for FiniteDiff.jl with AutoBatch to compute CPU/GPU vectorized finite differences
Solvers that directly plug into CUDA’s cuBLAS library for batched linear solving

What do we mean by “vectorized”? Say you want to find the root x of this function given some parameters p:

f(x, p) = (x-1)^2 - p^2

You can do this easily with a Newton or Brent method, of course, probably available in some other root-finding package. But what if you want to solve this for 1 million different p? Well, if we let p be a vector of those million different parameters, then we can evaluate this function for a million different x using the broadcast operator “.” in Julia:

f_vectorized(x, p) = f.(x, p)

The result of f_vectorized will be a vector where each element is the corresponding f(x,p), and the evaluation will (hopefully) take advantage of SIMD if your CPU supports it. Better yet, if x and p are GPU arrays (e.g. CuArray for NVIDIA CUDA), then the evaluation will be GPU-parallelized!

With BatchSolve.jl, we can do root finding on this function in a vectorized way too. All we need to do is specify the batchdim - the dimension of the input array along which each independent element in a batch lies, where 1 corresponds to the rows and 2 to the columns. This allows for batched solving of many n-dimensional dimensional systems. For our 1D example here we would just choose 1. E.g. for a batch with p=1:10000,

sol = newton(f_vectorized, zeros(10000), Constant(1:10000), batchdim=1)

where we specified that p is a Constant - not mutated throughout function evaluation. Arguments could also be specified as Cache if they are mutated. These constructs are re-exported from DifferentiationInterface.jl, and more details can be found here. For all AD purposes, BatchSolve uses DifferentiationInterface. Therefore it is easy to use a different AD backend, e.g. Enzyme:

using Enzyme
sol = newton(f_vectorized, zeros(10000), Constant(1:10000), batchdim=1, autodiff=AutoEnzyme())

The returned object is a NamedTuple with the fields:

u: an array with size equal to the input array, containing the inputs at the roots
jac: a (sparse) matrix storing the Jacobian for the entire batched-system at the last iteration
retcode: an array of return codes for each element in the batch, where 0x0 is success, 0x1 is failure, and 0x2 means the maximum number of iterations was reached
iters: an array storing the number of iterations until convergence for each element in the batch

Of course, BatchSolve.jl is not limited to only 1D functions. Consider this example:

g(x, p) = x .+ p

This function takes in a vector x and adds a vector p to it. Suppose we want to do a Newton search on this system now for vectors of length 2, for many different p, say 10000. In this simple case we can evaluate g in a vectorized way for all systems simply by initializing x and p as matrices, either 2 x 10000 or 10000 x 2. For batchdim=1, this would be 10000 x 2:

x0 = zeros(10000, 2)
p = rand(10000, 2)
sol = newton(g, x0, Constant(p), batchdim=1)

sol.u will be a 10000 x 2 matrix with the roots. For batchdim=2:

x0 = zeros(2, 10000)
p = rand(2, 10000)
sol = newton(g, x0, Constant(p), batchdim=2)

Of course, you could just treat this batch all together as one vector of size 20000 (number of elements in the “batch” equal to one). This can be done by not setting batchdim, which basically gives a regular non-batched Newton solver:

x0 = zeros(20000)
p = rand(20000)
sol = newton(g, x0, Constant(p))

The problem with this approach is that now instead of solving many small 2x2 systems, the linear algebra solver needs to solve a giant 20000 x 20000 matrix equation. This will generally be much slower than just solving many small systems. In fact, we tested this approach with CUDA’s cuDSS solver, and unfortunately we found performance to scale poorly beyond 60000 x 60000 systems.

For CUDA arrays, BatchSolve.jl plugs directly into CUDA’s cuBLAS library which provides GPU-accelerated batched linear system solving.

See the rest of the README here for more details and benchmarks

ChrisRackauckas · April 20, 2026, 3:46pm

Great work

Topic		Replies	Views
A Unified Interface for Rootfinding Numerics package	22	5759	December 24, 2022
Root finding / NonLinear problem Numerics question , package , nonlinearsolve	12	424	April 9, 2024
"I don't like NumPy" - Julia equivalents to the numpy code? General Usage gpu , gpuarrays , linearalgebra , linearsolve	19	1990	May 21, 2025
Differences between NLsolve and Optim in solving system of equations Optimization (Mathematical)	24	3226	November 16, 2022
Batched Matrix solve in CUDA.jl GPU blas	3	1729	February 1, 2023

BatchSolve.jl: Vectorized, batched, AD-compatible root finders and minimizers. Also vectorized finite differences

Related topics