ForwardDiff.jacobian of softmax vs analytic Jacobian

I am trying out automatic differentiation on part of my code in place of the analytic derivative, but ForwardDiff.jacobian of softmax seems to take twice as long as the analytic Jacobian. I’m not sure if this is an expected result or I’m doing something to slow down ForwardDiff. Any feedback is appreciated.

Below is a MWE with data representative of my problem. The loop within the functions is also representative (i.e. compute a list of Jacobians for subsets of a larger array). I tried passing an anonymous function to the jacobian, as well as a function that doesn’t compute softmax in-place (both commented out). I know that pushing to an Array{Any} is not great, but that’s just for the MWE.

using ForwardDiff
using BenchmarkTools

# softmax copied from StatsFuns without convert
function softmax2(x::AbstractArray{T}) where {T<:Real}
    n = length(x)
    u = maximum(x)
    s = 0.
    r = similar(x)
    @inbounds for i = 1:n
        s += (r[i] = exp(x[i] - u))
    end
    invs = inv(s)
    @inbounds for i = 1:n
        r[i] *= invs
    end
    r
end
function softmax2!(r::AbstractArray{T}, x::AbstractArray{T}) where {T<:Real}
    n = length(x)
    length(r) == n || throw(DimensionMismatch("Inconsistent array lengths."))
    u = maximum(x)
    s = 0.
    @inbounds for i = 1:n
        s += (r[i] = exp(x[i] - u))
    end
    invs = inv(s)
    @inbounds for i = 1:n
        r[i] *= invs
    end
    r
end

function jac1(x, beta)
    result = Array{Any}(undef, 0)
    for i in 1:10:(size(x)[1] - 9)
        thisx = x[i:(i + 9),:]
        pp!(y::Vector, b::Vector) = softmax2!(y, thisx * b)
        # pp! = (y, b) -> softmax2!(y, thisx * b)
        p = zeros(Float64, 10)
        lambda = ForwardDiff.jacobian(pp!, p, beta)::Array{Float64, 2}
        # pp = b -> softmax2(thisx * b)
        # 𝐩 = pp(beta)
        # lambda = ForwardDiff.jacobian(pp, beta)
        push!(result, lambda)
    end
    return result
end
function jac2(x, beta)
    result = Array{Any}(undef, 0)
    for i in 1:10:(size(x)[1] - 9)
        thisx = x[i:(i + 9),:]
        p = softmax2(thisx * beta)
        lambda = similar(thisx)
        for k in 1:length(beta)
            last_term = 0.
            for s in 1:10
                last_term += thisx[s, k] * p[s]
            end
            for t in 1:10
                lambda[t, k] = p[t] * (thisx[t, k] - last_term)
            end
        end
        push!(result, lambda)
    end
    return result
end

X = [0.398106 0.496961; -0.612026 -0.224875; 0.34112 -1.11714;
     -1.12936 -0.394995; 1.43302 1.54983; 1.9804 -0.743514; -0.367221 -2.33171;
     -1.04413 0.812245; 0.56972 -0.501311; -0.135055 -0.510887;
     2.40162 -1.21536; -0.03924 -0.0225586; 0.689739 0.701239;
     0.0280022 -0.587482; -0.743273 -0.606728; 0.188792 1.09664;
     -1.80496 -0.24751; 1.46555 -0.159902; 0.153253 -0.625778;
     2.17261 0.900435]
β = [0.924499; 0.869371]
X = repeat(X, outer = (5, 2))
β = repeat(β, 2)

display(isapprox(jac1(X, β), jac2(X, β)))

@btime jac1(X, β)
@btime jac2(X, β)

Output

true  8.137 μs (94 allocations: 25.50 KiB)
  4.086 μs (44 allocations: 12.22 KiB)