Hi and greetings

I am trying to compute derivatives (later on up to the 3rd or 4th degree) of vectorized functions using ForwardDiff. The functions are generally in the following form:

```
using BenchmarkTools
using StaticArrays
using ForwardDiff
# broadcasting inside function
vars = [rand(1000) for _ in 1:2]
func(vector_of_vars::Vector{<:Vector{<:Number}}) = 5.0 .* vector_of_vars[1] .* vector_of_vars[1] .* vector_of_vars[1] .* vector_of_vars[2]
@btime func(vars) # -> 657.443 ns (1 allocation: 7.94 KiB)
```

Some research yielded the following approach: (Vectorization of ForwardDiff.gradient function - #4 by rdeits)

```
# vectorize outside the function
vars_rows = [rand(2) for _ in 1:1000]
func_rows(vars_rows::Vector{<:Number}) = 5 * vars_rows[1] * vars_rows[1] * vars_rows[1] * vars_rows[2]
@btime func_rows.(vars_rows) # -> 1.158 ÎĽs (4 allocations: 8.00 KiB)
grad_func_rows = x -> ForwardDiff.gradient(func_rows, x)
@btime grad_func_rows.(vars_rows) # 229.625 ÎĽs (4500 allocations: 297.19 KiB)
```

Seeing as there is this enourmous performance decrease, I tried to see if I can copy and reimplement some simplified basis of ForwardDiff with some help from various sources including the paper from ForwardDiff and Youtube yielded the following:

```
struct Dual{T, N} <: Number
value::T
partials::N
end
import Base: convert, promote_rule, show, *
Base.convert(::Type{<:Dual}, val::Real) = Dual(val, zero(val))
Base.promote_rule(::Type{<:Dual}, ::Type{<:Number}) = Dual
Base.show(io::IO, x::Dual) = print(io, x.value, " + ", x.partials, "đťś–")
*(x::Dual, y::Dual) = Dual(x.value .* y.value, (x.value .* y.partials .+ x.partials .* y.value))
vars_dual = [[Dual(rand(), @SVector[1.0, 0.0]) for _ in 1:1000], [Dual(rand(), @SVector[0.0, 1.0]) for _ in 1:1000]]
@btime func(vars_dual) # -> 1.613 ÎĽs (2 allocations: 23.48 KiB)
```

The manual approach uses the vectorization inside the function and (I guess) has better access to SIMD and other optimizations.

On top of the decrease seen here, in my real code, the function evaluations are quite nested and go many levels deep, which favors the vectorization inside rather than outside even more (I would think).

So my questions is, can I somehow make ForwardDiff work with the vector evaluations. Maybe by manually creating a vectors of Duals using the ForwardDiff infrastructure? Do you have any other insights I might have missed?

Thank you very much in advance